UNIVERSITY OF CALIFORNIA
College
of Engineering
Department of EECS, Computer Science Division

CS186

Prof.  Michael Franklin

Spring 2006

Assignment 3

 

Assignment 3: PostgreSQL Query Optimizer

 

In this homework you will carry out a number of exercises involving the optimization of relational queries using the PostgreSQL optimizer and the visualization command EXPLAIN. You will not have to write any code. This assignment has to be done individually.

There are four parts to this assignment:

  • Reading some PostgreSQL documentation
  • Setting up the database, loading the tables, creating indexes
  • Answering questions by observing the query plans generated by PostgreSQL
  • Writing and running some queries of your own.

General Instructions

  • Please spend some time to read this assignment completely before beginning.
  • You can get all the files for this assignment at ~cs186/Hw3.
  • We will be using PostgreSQL version 8.0.3, which is now set up to be the default for cs186 accounts. The instructions provided here are designed to work with the EECS instructional machines and may not work as smoothly elsewhere. The TAs and Professors will only support EECS instructional machines.
  • By default, you should be able to access the 8.0.3 versions of pg_ctl, psql, etc. (test this out by using the commands "psql --version", "pg_ctl --version"). If your path is pointing to some other version of these binaries, you may have to explicitly add to your PATH environment variable, the directory ~cs186/pgsql/bin, which has the 8.0.3 binaries.
  • Before moving from one question to the next, you should remove any indexes that you created specifically for the previous question. Also, enable any access/join methods that you had disabled for the previous question.
  • To submit this assignment, print out the questions below, and write in brief answers.. You should bring your answer sheets to class on Thursday the 23rd, or drop your answer sheets in the box provided on 2nd floor of SODA by the deadline.   Be sure to clearly indicate your name, Student ID, and cs-186 login on the answer sheet.
  • The deadline for this assignment is 5 pm, Friday March 24, 2006.  Due to spring break – NO SLIP DAYS WILL BE ALLOWED FOR THIS HOMEWORK.  Remember, this assignment is not to be done in groups.
  • If you know how to use PostgreSQL, this assignment should be doable in a single sitting.

Documentation and PostgreSQL Tips

The PostgreSQL 8.0 Documentation is on-line (and searchable) at:  http://www.postgresql.org/docs/8.0/static/index.html

 

You will find the following parts particularly useful for this assignment:

  • PostgreSQL Performance Tips  http://www.postgresql.org/docs/8.0/static/performance-tips.html
  • The CREATE INDEX Command: http://www.postgresql.org/docs/8.0/static/sql-createindex.html
    • the command for creating indexes;  you will also need “DROP INDEX”
  • The SET Command: http://www.postgresql.org/docs/8.0/static/sql-set.html
    • this command lets you turn on and off operators that the query optimizer can consider.  For example, the SQL command “SET enable_hashjoin TO off;” tells the query optimizer not to use Hash Joins.
  • Runtime Configuration: http://www.postgresql.org/docs/8.0/static/runtime-config.html
    • see section 16.4.5.1 “planner method configuration” for  a list of the operators you can disable and re-enable for the optimizer using the SET command.
  • PostgreSQL System Catalogs: http://www.postgresql.org/docs/8.0/static/catalogs.html
    • note that PostgreSQL keeps its catalog information in tables.   The tables/views that are likely to be most useful in this assignment are “pg_stats”, “pg_class”, “pg_indexes,  and  pg_attribute”.   You can query these in psql just like regular tables.  For example, the query “select * from pg_indexes where tablename = 'orders';” gives you information about the indexes defined on the orders table.  You can see the statistics the optimizer has for the table ‘orders’ using the query:  “select * from pg_stats where tablename = 'orders'” (note the single quotes around the tablename, which is a string attributed  in the pg_stats view).  The documentation for pg_stats (linked from the above page) explains what the numbers  mean.  You can get information on tables by querying the “pg_class” table and/or using the “/d” command in psql. 

Setting up the Database

  1. Create a database directory with initdb .  After that, start the postmaster with pg_ctl  (initdb tells you the exact command).  Then, create a database with createdb.

 

  1. Create the following tables (it’s best to use the file schema.sql  with the  "psql -f" command for this):

 

    • CREATE TABLE companies ( co_id serial PRIMARY KEY, co_name varchar(64), co_lastchg timestamp );

 

    • CREATE TABLE products ( pr_code varchar(6) PRIMARY KEY, pr_desc varchar(64) );

 

    • CREATE TABLE orders ( ord_id serial, ord_company int4 REFERENCES companies(co_id), ord_product varchar(6) REFERENCES products(pr_code), ord_qty int4, ord_placed date, ord_delivered date, ord_paid date );

 

  1. Load the data into the tables using the files companies.data, products.data and orders.data. You should use "psql -f" to run these files (which contain SQL INSERT statements).  Note that orders.data should be run last (why?).

 

  1. If  you chose not to use “psql –f schema.sql” to create your tables, then create an index ord_placed_idx on the attribute ord_placed for the relation Orders. (if you used "psql -f schema.sql", then skip this step, the  index has already been created for you)

 

  1. You MUST update the PostgreSQL statistics after adding or deleting indexes. The command for updating the statistics (in a psql session) is "VACUUM ANALYZE;". You MUST update the statistics once before starting the exercises. This step is crucial to getting correct answers to the questions!

After completing the above steps you should be able to analyze query plans produced by the optimizer for a given query (using EXPLAIN). You should also be able to modify the runtime variables for enabling/disabling join and access methods (using SET).


Questions –Hand in from here below…

CS 186 SPRING 2006 Homework 3 Answer Sheet (don’t forget to staple it together!).

NAME: ____________________________              STUDENT ID:____ญญญญ______________   

IMPORTANT: Circle the last two letters of your class account:

 cs186 a b c d e f g h i j k l m n o p q r s t u v w x y z

       a b c d e f g h i j k l m n o p q r s t u v w x y z

 

1.      SOME EASY STUFF

Start psql and make sure you are running version 8.0.3 (it tells you when it starts up).  Then, using queries in psql, examine the set of tables and access methods in the database you just created. In particular examine the relation Orders.
Answer the following questions:

A.      What are the attributes of the Orders relation?

 

 

B.       How many indexes are built on this relation? Name them and write down their type.

 

 

C.       How many tuples are there in the Orders relation?

 

D.      How many distinct values of  ord_qty” does the query planner estimate there are for the Orders relation?  (hint: Query pg_stats to find out)

 

E.       How many distinct values of “ord_qty” are there actually in the orders table? (hint: it’s probably easiest to run a query to compute this!)

 

 

2.      USING THE QUERY PLAN VIEWER

This question requires you to use the PostgreSQL query plan visualization command EXPLAIN. Read the documentation for EXPLAIN at the link given above.   Note that (like System R) EXPLAIN estimates query costs in units of disk I/Os (CPU instructions are added in by multiplying times a conversion factor).
Consider the following query:

SELECT * FROM orders WHERE ord_qty = 55;

Answer the following questions looking at the query plan generated by the EXPLAIN command:

    1. Briefly describe the plan chosen.   (e.g., what kind of scan is used?).

 

 

    1. In what order would the tuples be returned by this plan? Why?

 

    1. What is the estimated total cost of running the plan?

 

    1. What is the estimated result cardinality for this plan? The estimated result cardinality is the number of orders that the optimizer estimates to have quantity 55.   Looking at the statistics, why does the optimizer come up with this estimate?

 

    1. How many orders actually do have “ord_qty = 55”?  (hint: it’s probably easiest to run a query to compute this!)

 

    1. Looking at the statistics, which value of “ord_qty” would you expect to return the most tuples?

 

    1. Which value of “ord_qty” is actually the most popular, and how many tuples have that “ord_qty”?

 

3.      SELECTS WITH INDEXES

Consider the same query from previous question:

SELECT * FROM orders WHERE ord_qty = 55;

Answer the following questions looking at the plans and the access methods:

    1. Create a btree index on the attribute ord_qty of the relation Orders. What is the plan chosen for the query now?   (e.g., what kind of scan is used and what is scanned?).

 

 

    1. What is the estimated total cost of running the plan?

 

    1. Compare this plan with the plan obtained in question 2.A above.  Which is cheaper and why?

 

 

 

4.      RANGE SELECTS

DROP the index that you created for Question 3.

Now analyze the query plan that PostgreSQL comes up for the following query:

SELECT * FROM orders WHERE ord_placed < '2002-09-1';

Answer the following questions:

    1. How many order tuples that have ord_placed < '2002-09-1' does the optimizer think there are?

 

    1. How does the optimizer arrive at this estimate of the number of orders? That is, what calculations does it perform, and where does the supporting data come from?

 

 

    1. In what order will the tuples be returned by this plan?

 

    1. Disable the access method used in the plan in step 1) What happens to the cost now? Does the order in which the tuples are returned change? Why?

 

 

    1. Explain why one of the access methods is costlier than the other in this case.

 

5.      SIMPLE JOIN

RE-ENABLE (using “SET”) the access method you turned off above.

Analyze the query plan for the following query that finds all products with quantity less than 5.

SELECT DISTINCT (co_name)
FROM orders, companies
WHERE ord_company = co_id AND ord_qty < 5;


Answer the following questions:

    1. What is the estimated cost of this plan?

 

    1. What kind of join is used by the plan?

 

    1. What relations are sorted in this plan, and why?

 

    1. Why does only one of the inputs to the join need to be sorted?

 

    1.  Disable the join type used in the above plan and re-optimize the query.  What type of join is used now, and what is the total estimated cost of the query?

 

 

6.      THREE-WAY JOIN

RE-ENABLE (using “SET”) the access method you turned off above.

Answer the following questions referring to the query below:

SELECT pr_desc, co_name
FROM orders, products, companies
WHERE ord_company = co_id AND ord_product = pr_code;

    1. Describe the best plan estimated by the optimizer. List the joins and access methods it uses, and the order in which the relations are joined.

 

 

 

    1. What is the Relational Algrebra expression for the above join order?

 

    1. Modify the query by adding a condition ord_qty < 5. What are the differences between this plan and the one above? Why is this new join ordering better for the extended query than the ordering obtained in part A?

 

 

 

7.      PLAYING WITH SQL 

Answer the following questions about the database (by writing queries!):

    1. What is the description (pr_desc)pr of  the most popular product and how many orders are there for  it?  (hint: while one way to get close to this is simply to list all products and the number of orders for them sorted by number of orders, you may also want to try writing a variant that only produces a single result row – it’s a bit ugly).

 

    1. For each company, what is the average number of days from the time an order is placed to the time it is delivered? (note: see how PostgreSQL handles missing (null) values in this case).

 

 

    1. Show a query to find out some other interesting fact in the database and tell us what the answer is (or at least summarize it).