Homework 3: SQL

October 2003

Due: Sunday, November 2, 2003 (7pm)

To be done individually!!

 

1. Introduction

In this assignment, you will write SQL queries that answer questions about a database containing information about players, teams and games from the National Basketball Association (NBA). Yahoo Sports (http://sports.yahoo.com/nba/) and ESPN (http://sports.espn.go.com/nba/) are examples of web sites that use such a database. We have simplified the schema by maintaining information for only last season[1].

 

Some notes to keep in mind as you go:

 

From your home directory, run the command:

gtar -zxvf /home/cc/cs186/fa03/Hw3/Hw3.tar.gz


This will create a directory Hw3/ in your home directory with the following scripts:

Script

Description

initdb.sh

Initialize database directory at $PGDATA_HW3. To reinitialize, you have to delete existing $PGDATA_HW3 first.

startpg.sh

Starts Postgres master process for database directory at $PGDATA_HW3

loadnba.sh

loadbignba.sh

Creates database ‘hw3’, and loads in some test data. The test data from loadbignba.sh is based on real data from this year’s NBA roster. We have also provided a smaller data set that you can load using loadnba.sh. Feel free to insert in your own data for your own testing. Read

loaddata.sql to see how to use the copy command to bulk load data.

startpsql.sh

Starts psql for hw3

runquery.sh <QUERY_FILE>

Runs query stored in <QUERY_FILE> using psql and outputs to screen. We have provided a sample query file query0.sql

stoppg.sh

Stops Postgres master process for the database directory at $PGDATA_HW3 before you log off.

 

To set up the initial database for the very first time, you should run:


To start up psql, type:

Some of these scripts are simply there for convenience to ensure that you run our version of pg_ctl, psql, initdb, createdb, etc and not your compiled version from previous homework. You can choose not to use our scripts (at your own risk!) and create your own database (using our schema.sql) as you may have done in hw0.

 

 

2. Schema

There are 4 relations in the schema, which are described below along with their integrity constraints. Columns in the primary key are underlined.

 

Player(playerID: integer, name : varchar(50), position : varchar(10),  height : integer, weight : integer, team: varchar(30)) 

Each Player is assigned a unique playerID. The position of a player can either be Guard, Center or Forward. The height of a player is in inches while the weight is in pounds. Each player plays for only one team. The team field is a foreign key to Team.


Team (name: varchar(30), city : varchar(20)) 

Each Team has a unique name associated with it. There can be multiple teams from the same city.

 

Game (gameID: integer, homeTeam: varchar(30), awayTeam : varchar(30), homeScore : integer, awayScore : integer)

Each Game has a unique gameID. The fields homeTeam and awayTeam are foreign keys to Team. Two teams may play each other multiple times each season. There is an integrity check to ensure homeTeam and awayTeam are different.

 

GameStats (playerID : integer, gameID: integer, points : integer, assists : integer,
rebounds : integer) 

GameStats records the performance statistics of a player within a game. A player may not play in every game, in which case it will not have its statistics recorded for that game. gameID is a foreign key to Game. playerID is a foreign key to Player. Assume that two assertions are in place. The first is to ensure that the player involved belongs to either the involving home or away teams, and the second is to ensure that the total score obtained by a team (in Game) is consistent with the total sum (in GameStats) of individual players in the team playing in the game[2].

 

 

3. Queries

 

Part I

1. Find distinct names of players who play the “Guard” Position and have name containing “Jo”. (ORDER BY Players.name)

            select list: Player.name

            ordering: Player.name ascending

 

2. List cities that have more than 1 team playing there. (ORDER BY Team.city)

            select list:  Team.city

            ordering: Team.city ascending

 

3. Find the player(s) who has the highest score for this season (highest total score in all games of this season). Output the player’s ID, name, team, and total score.

            select list: Player.playerID, Player.name, Player.team, total_score

            ordering: Player.playerID ascending

 

4. List all players’ playerIDs and their average points in all home games that they played in. (ORDER BY Player.playerID)

            select list: Player.playerID, average_points

            ordering: Player.playerID ascending

 

5. For each different position, find the average height and weight of players that play that position for each team. Output the team and position first followed by the averages. The result should be ordered by the name of the team, and within each team, the different positions should be presented in descending order

            select list: Player.team, Player.position, avg_height, avg_weight

            ordering: Player.team ascending, Player.position descending

 

6. List each player’s playerID, their name, team and the number of “triple doubles” they earned.  (A “triple double” is a game in which the player’s number of assists, rebounds, and points are all in the double-digits). Even if a player doesn’t have any triple doubles, he should appear in the output with the corresponding attribute equal to zero. (ORDER BY Player.playerID).

            select list: Player.playerID, Player.name, Player.team, num_of_triple_doubles

            ordering: Player.playerID ascending

 

7. Find the number of games in which both “Joey Loo” and “Hairy Harry”
play, and Loo scores more points than Harry.

            select list: num_of_games

 

8. List the playerIDs and names of players who have played in at least two games during the season. Output the playerID and name followed by the number of games played. (ORDER BY Player.playerID)

            select list: Player.playerID, Player.name, num_of_games

            ordering: Player.playerID ascending

 

9. Find the team(s) with the most wins compared to other teams. Output the team name and the number of wins. (ORDER BY team name)

            select list: Team.name, num_of_wins

            ordering: Team.name ascending

 

10. List the playerIDs, names and teams of players who played in all home games for their team. Do not output players whose teams did not play any home games. (ORDER BY Player.ID)

            select list: Player.playerID, Player.name, Player.team

            ordering: Player.playerID ascending

 

Part II

For this part of the assignment you will have to use the EXPLAIN clause. You can observe the execution plan that the optimizer has chosen for a particular query, using the statement:

EXPLAIN <your_query>

 

This command displays the execution plan that the PostgreSQL planner generates for the supplied statement. The execution plan shows how the table(s) referenced by the statement will be scanned - by plain sequential scan, index scan, etc. - and if multiple tables are referenced, what join algorithms will be used to bring together the required row from each input table.

The most critical part of the display is the estimated statement execution cost, which is the planner's guess at how long it will take to run the statement (measured in units of disk page fetches). Actually two numbers are shown: the start-up time before the first row can be returned, and the total time to return all the rows.

 

Note: You will need to run the ANALYZE command before using EXPLAIN to collect statistics about the contents of the tables in the database.


For each of the following queries, run the query using EXPLAIN, choose which query is best, and explain why.

 

Query 1: List the shortest player(s).

 

select * from player p1 where p1.height <= all (select height from player);


select * from player p1 where p1.height in (select min(height) from player);

 
select * from player p1 where not exists (select * from player p2 where p2.height < p1.height);

(select * from player)
EXCEPT
(select p1.*  from player p1, player p2
where p1.height>p2.height AND p1.playerid<>p2.playerid);

 

Query 2: For each game return the total points scored in that game.


select gameId, homeScore + awayScore as totalScore from game;

select gameId, sum(points) as totalScore from gameStats group by gameid;


 

 

 

For all of the queries above, use the EXPLAIN clause to observe the execution plan chosen by the optimizer. Which query would you choose in each case and why?  Include at least one sentence explaining why one query might be better or worse than another.

 

 

4. Submission instructions

For Part I, submit each of your queries inside a separate file. Query X should go into a file called queryX.sql. E.g. query 1 goes into query1.sql, query 2 goes into query2.sql, and so on. You should be able to run ‘./runquery.sh queryX.sql to see the output of your query X. You are welcome to use view definitions within your query, as long as you ensure that runquery.sh command generates the correct output. This output will be used to be compared with our solutions to see if your query is written correctly.

 

For Part II, prepare a separate txt file for each query. The files should be named explain1.txt and explain2.txt respectively. Each file should contain the SQL queries provided for each question, each followed by the output of the EXPLAIN command for each one of them. Finally you should also add your comments about which of the options you would run on a large dataset and why.

 

You should submit the following files:

query[1,2,3,4,5,6,7,8,9,10].sql for each of your answers in Part I

explain[1,2].txt .sql for each of your answers in Part II

README – Your name and login

 

Make sure all the above files are in a directory called Project3/. cd into that directory and type submit Project3 to submit the files.

 



[1] As a fun exercise, think about how to extend the schema we provide to support multiple seasons.

[2] A good exercise for you (outside the scope of this assignment) is to figure out how to add in these checks.