Homework 3: SQL/XML & XPath

Due: October 20, 11:59pm

To be done individually!!

 

PART-I: SQL

 

1. Introduction

In this assignment, you will write SQL queries that answer questions about a database containing information about players, teams and games from the National Basketball Association (NBA). Yahoo Sports (http://sports.yahoo.com/nba/) and ESPN (http://sports.espn.go.com/nba/) are examples of web sites that use such a database. We have simplified the schema by maintaining information for only part of the 2003 season[1].

 

Some notes to keep in mind as you go:

 

From your home directory, run the command:

gtar -zxvf /home/cc/cs186/fa05/hw/hw3/hw3.tar.gz


This will create a directory Hw3/ in your home directory with the following scripts:

Script

Description

initdb.sh

Initialize database directory at $PGDATA_HW3. To reinitialize, you have to delete existing $PGDATA_HW3 first.

startpg.sh

Starts Postgres master process for database directory at $PGDATA_HW3

loadnba.sh

loadbignba.sh

Creates database ‘hw3’, and loads in some test data. The test data from loadbignba.sh is based on real data from this year’s NBA roster. We have also provided a smaller data set that you can load using loadnba.sh. Feel free to insert in your own data for your own testing. Read

loaddata.sql to see how to use the copy command to bulk load data.

loadxml.sh

Creates database hw3_xml, creates a table named AuctionSites, and loads in some test data.

runquery.sh <QUERY_FILE>

Runs query stored in <QUERY_FILE> using psql and outputs to screen. We have provided a sample query file query0.sql

runxml.sh <XPATH_FILE>

Runs xpath query stored in <XPATH_FILE> using psql and outputs to screen. We have provided a sample xml query file xml0.sql

stoppg.sh

Stops Postgres master process for the database directory at $PGDATA_HW3 before you log off.

 

To set up the initial database for the very first time, you should run:


To start up psql, type:

Some of these scripts are simply there for convenience to ensure that you run our version of pg_ctl, psql, initdb, createdb, etc and not your compiled version from previous homework. You can choose not to use our scripts (at your own risk!) and create your own database (using our schema.sql) as you may have done in hw0.

 

 

2. Schema

There are 4 relations in the schema, which are described below along with their integrity constraints. Columns in the primary key are underlined.

 

Player(playerID: integer, name : varchar(50), position : varchar(10),  height : integer, weight : integer, team: varchar(30)) 

Each Player is assigned a unique playerID. The position of a player can either be Guard, Center or Forward. The height of a player is in inches while the weight is in pounds. Each player plays for only one team. The team field is a foreign key to Team.


Team (name: varchar(30), city : varchar(20)) 

Each Team has a unique name associated with it. There can be multiple teams from the same city.

 

Game (gameID: integer, homeTeam: varchar(30), awayTeam : varchar(30), homeScore : integer, awayScore : integer)

Each Game has a unique gameID. The fields homeTeam and awayTeam are foreign keys to Team. Two teams may play each other multiple times each season. There is an integrity check to ensure homeTeam and awayTeam are different.

 

GameStats (playerID : integer, gameID: integer, points : integer, assists : integer,
rebounds : integer) 

GameStats records the performance statistics of a player within a game. A player may not play in every game, in which case it will not have its statistics recorded for that game. gameID is a foreign key to Game. playerID is a foreign key to Player. Assume that two assertions are in place. The first is to ensure that the player involved belongs to either the involving home or away teams, and the second is to ensure that the total score obtained by a team (in Game) is consistent with the total sum (in GameStats) of individual players in the team playing in the game[2].

 

 

3. Queries

 Please make sure you include the ORDER BY clause as stated in or queries. If your query output differs from ours because of a missing or mistaken  ORDER BY clause, you do not get any credit! Also, make sure you name the aggregate columns and other output columns in the SELECT lists as indicated using the AS clause (e.g., "... AS average_points, ..." for Query 2).

 

1. [1 point] Find the names of the shortest player(s) playing the "Guard" position for each team. (ORDER BY Players.name)

            select list: Player.name

            ordering: Player.name ascending

 

2. [1 point] List the names of all players and the average number of points and rebounds for all games that they played in, ignoring players who did not play in any games. (ORDER BY Player.name)

            select list:  Player.name, average_points, average_rebounds

            ordering: Player.name ascending

 

3. [1 point] List all distinct (player name, team name) pairs of players and their teams, where the player name contains "Mark" and the team is located in a city whose name starts with "B" or "C" or "L". (ORDER BY Player.name, Team.name) Hint: You'll need to use the 'LIKE' predicate operator.

            select list: Player.name, Team.name

            ordering: Player.name ascending, Team.name ascending

 

4. [1 point] For each city that has at least one NBA team, list the city name and the height of the tallest player playing the "Center" position in that city. (ORDER BY Team.city).

            select list: Team.city, max_center_height

            ordering: Team.city ascending

 

5. [1 point] List the names, heights, and weights for all players that have not scored more than 19 points or collected more than 9 rebounds in any of their teams' home games. (ORDER BY Player.name. Player.height)

            select list: Player.name, Player.height, Player.weight

            ordering: Player.name ascending, Player.height ascending

 

6. [2points]  List the names and teams of all players that have averaged at least 10 points for their teams' home games thus far, but have not scored 10 points or more in any of their teams' away games. The result should be ordered by team names, and within each team, player names should be presented in descending order.

            select list: Player.team, Player.name

            ordering: Player.team ascending, Player.name descending

 

7. [2 points] List the IDs, names, and teams for all players who have scored at least 15 points in every home game for their team. Do not output players whose teams have not played any home games. (ORDER BY Player.playerID)

            select list: Player.playerID, Player.name, Player.team

            ordering: Player.playerID

 

8. [2 points] List the names of all teams with a winning percentage of 60% or better at home games. Output the team name (calling it 'team'), and the number of home wins and losses. (ORDER BY team)

            select list: team, num_of_home_wins, num_of_home_losses

            ordering: team

 

9. [2 points] For each win of the "ChicagoBulls", list the game ID and the name(s) of player(s) who scored the maximum number of points for the Bulls in that win, along with that maximum number of points. (ORDER BY Game.gameID, Player.name)

            select list: gameID, Player.name, max_num_of_points

            ordering: gameID, Player.name

 

10. [2 points] List the names of all teams that have a winning record thus far in the season (total number of wins > total number of losses). (ORDER BY team_name)

            select list: team_name

            ordering: team_name ascending

 

 

 

 

PART-II: XML & XPath

 

1. Introduction & Schema Information

For this part of the assignment, you will experiment with the XML support provided by PostgreSQL by posing XPath queries over XML documents describing online-auction-site databases. The DTD schema for such auction-site XML data is given below. (The element names should be self-explanatory.)

<!ELEMENT site            (regions, categories, catgraph, people, open_auctions, closed_auctions)>

<!ELEMENT categories      (category+)>
<!ELEMENT category        (name, description)>
<!ATTLIST category        id ID #REQUIRED>
<!ELEMENT name            (#PCDATA)>
<!ELEMENT description     (text | parlist)>
<!ELEMENT text            (#PCDATA | bold | keyword | emph)*>
<!ELEMENT bold            (#PCDATA | bold | keyword | emph)*>
<!ELEMENT keyword         (#PCDATA | bold | keyword | emph)*>
<!ELEMENT emph            (#PCDATA | bold | keyword | emph)*>
<!ELEMENT parlist         (listitem)*>
<!ELEMENT listitem        (text | parlist)*>

<!ELEMENT catgraph        (edge*)>
<!ELEMENT edge            EMPTY>
<!ATTLIST edge            from IDREF #REQUIRED to IDREF #REQUIRED>

<!ELEMENT regions         (africa, asia, australia, europe, namerica, samerica)>
<!ELEMENT africa          (item*)>
<!ELEMENT asia            (item*)>
<!ELEMENT australia       (item*)>
<!ELEMENT namerica        (item*)>
<!ELEMENT samerica        (item*)>
<!ELEMENT europe          (item*)>
<!ELEMENT item            (location, quantity, name, payment, description, shipping, incategory+, mailbox)>
<!ATTLIST item            id ID #REQUIRED
                          featured CDATA #IMPLIED>
<!ELEMENT location        (#PCDATA)>
<!ELEMENT quantity        (#PCDATA)>
<!ELEMENT payment         (#PCDATA)>
<!ELEMENT shipping        (#PCDATA)>
<!ELEMENT reserve         (#PCDATA)>
<!ELEMENT incategory      EMPTY>
<!ATTLIST incategory      category IDREF #REQUIRED>
<!ELEMENT mailbox         (mail*)>
<!ELEMENT mail            (from, to, date, text)>
<!ELEMENT from            (#PCDATA)>
<!ELEMENT to              (#PCDATA)>
<!ELEMENT date            (#PCDATA)>
<!ELEMENT itemref         EMPTY>
<!ATTLIST itemref         item IDREF #REQUIRED>
<!ELEMENT personref       EMPTY>
<!ATTLIST personref       person IDREF #REQUIRED>

<!ELEMENT people          (person*)>
<!ELEMENT person          (name, emailaddress, phone?, address?, homepage?, creditcard?, profile?, watches?)>
<!ATTLIST person          id ID #REQUIRED>
<!ELEMENT emailaddress    (#PCDATA)>
<!ELEMENT phone           (#PCDATA)>
<!ELEMENT address         (street, city, country, province?, zipcode)>
<!ELEMENT street          (#PCDATA)>
<!ELEMENT city            (#PCDATA)>
<!ELEMENT province        (#PCDATA)>
<!ELEMENT zipcode         (#PCDATA)>
<!ELEMENT country         (#PCDATA)>
<!ELEMENT homepage        (#PCDATA)>
<!ELEMENT creditcard      (#PCDATA)>
<!ELEMENT profile         (interest*, education?, gender?, business, age?)>
<!ATTLIST profile         income CDATA #IMPLIED>
<!ELEMENT interest        EMPTY>
<!ATTLIST interest        category IDREF #REQUIRED>
<!ELEMENT education       (#PCDATA)>
<!ELEMENT income          (#PCDATA)>
<!ELEMENT gender          (#PCDATA)>
<!ELEMENT business        (#PCDATA)>
<!ELEMENT age             (#PCDATA)>
<!ELEMENT watches         (watch*)>
<!ELEMENT watch           EMPTY>
<!ATTLIST watch           open_auction IDREF #REQUIRED>

<!ELEMENT open_auctions   (open_auction*)>
<!ELEMENT open_auction    (initial, reserve?, bidder*, current, privacy?, itemref, seller, annotation, quantity, type, interval)>
<!ATTLIST open_auction    id ID #REQUIRED>
<!ELEMENT privacy         (#PCDATA)>
<!ELEMENT initial         (#PCDATA)>
<!ELEMENT bidder          (date, time, personref, increase)>
<!ELEMENT seller          EMPTY>
<!ATTLIST seller          person IDREF #REQUIRED>
<!ELEMENT current         (#PCDATA)>
<!ELEMENT increase        (#PCDATA)>
<!ELEMENT type            (#PCDATA)>
<!ELEMENT interval        (start, end)>
<!ELEMENT start           (#PCDATA)>
<!ELEMENT end             (#PCDATA)>
<!ELEMENT time            (#PCDATA)>
<!ELEMENT status          (#PCDATA)>
<!ELEMENT amount          (#PCDATA)>

<!ELEMENT closed_auctions (closed_auction*)>
<!ELEMENT closed_auction  (seller, buyer, itemref, price, date, quantity, type, annotation?)>
<!ELEMENT buyer           EMPTY>
<!ATTLIST buyer           person IDREF #REQUIRED>
<!ELEMENT price           (#PCDATA)>
<!ELEMENT annotation      (author, description?, happiness)>

<!ELEMENT author          EMPTY>
<!ATTLIST author          person IDREF #REQUIRED>
<!ELEMENT happiness       (#PCDATA)>

 

PostrgreSQL stores entire XML documents as text BLOBs (Binary Large Objects), inside text-valued columns of relational tables, and provides functions that support (in-memory) processing of XPath queries over the documents. For the purposes of this assignment, we will provide you with a simple relational table named "AuctionSites" with schema:  AuctionSites(auctionSiteID: integer, auctionSiteXML:text), where "auctionSiteID" is an integer site key and "auctionSiteXML" is the XML document for the corresponding site. You can initialize and load the data for the AuctionSites table using the loadxml.sh script. The runxml.sh script takes a file (containing your XPath query) as argument and runs is using psql.

 

2. XPath Queries

You should obviously use the DTD-schema information given above in formulating your queries. The queries can be expressed using the simple xpath_bool() and xpath_nodeset() functions supported by PostgreSQL. (See the README file under the contrib/xml2/ directory of the PostgreSQL distribution for more details, or this web page for a few examples.) Again, please make sure to rename the output columns in your query result as requested.

 

11. [1 point] List the address information for the person with id "person0" in each of the auction site tuples. (Output columns: auctionSiteID, person0_addresses)

 

12. [2 points] For each auction site, list the interval information/elements for each open auction where the type contains the substring "Dutch" and the quantity of auctioned items is greater than one. (Output columns: auctionSiteID, open_dutch_auctions)

 

13. [2 points] We classify a person as wealthy, if the income listed in his/her profile is more than 60000. For each auction site that contains at least one wealthy person, list the name information for all wealthy persons. Auction sites without any wealthy people should not appear in the output. (Output columns: auctionSiteID, wealthy_folks)

 


 

 

 


 

 

 

SUBMISSION INSTRUCTIONS

 

For both parts of the homework, submit each of your queries inside a separate file. Query X should go into a file called queryX.sql. E.g. query 1 goes into query1.sql, query 2 goes into query2.sql, and so on. You should be able to run ‘./runquery.sh queryX.sql to see the output of your query X. You are welcome to use view definitions within your query, as long as you ensure that runquery.sh command generates the correct output. This output will be used to be compared with our solutions to see if your query is written correctly.

 

.

 



[1] As a fun exercise, think about how to extend the schema we provide to support multiple seasons.

[2] A good exercise for you (outside the scope of this assignment) is to figure out how to add in these checks.