UNIVERSITY OF CALIFORNIA
College of Engineering
Department of EECS, Computer Science Division
CS186
|
Prof. Michael
Franklin
|
Spring 2006
|
Assignment 3
|
Assignment 3: PostgreSQL
Query Optimizer
In this homework you will carry out a number
of exercises involving the optimization of relational queries using the PostgreSQL optimizer and the visualization command EXPLAIN.
You will not have to write any code. This assignment has to be done individually.
There are four parts to this assignment:
- Reading
some PostgreSQL documentation
- Setting
up the database, loading the tables, creating indexes
- Answering
questions by observing the query plans generated by PostgreSQL
- Writing
and running some queries of your own.
General Instructions
- Please
spend some time to read this assignment completely before beginning.
- You
can get all the files for this assignment at ~cs186/Hw3.
- We
will be using PostgreSQL version 8.0.3, which is
now set up to be the default for cs186 accounts. The instructions provided
here are designed to work with the EECS instructional machines and may not
work as smoothly elsewhere. The TAs and Professors will only support EECS
instructional machines.
- By
default, you should be able to access the 8.0.3 versions of pg_ctl,
psql, etc. (test this out by using the commands
"psql --version", "pg_ctl --version"). If your path is pointing to
some other version of these binaries, you may have to explicitly add to
your PATH environment variable, the directory ~cs186/pgsql/bin,
which has the 8.0.3 binaries.
- Before
moving from one question to the next, you should remove any indexes that
you created specifically for the previous question. Also, enable any
access/join methods that you had disabled for the previous question.
- To
submit this assignment, print out the questions below, and write in brief
answers.. You should bring your answer sheets to
class on Thursday the 23rd, or drop your answer sheets in the
box provided on 2nd floor of SODA by the deadline. Be sure to clearly indicate your name,
Student ID, and cs-186 login on the answer sheet.
- The
deadline for this assignment is 5 pm, Friday March 24, 2006. Due to
spring break – NO SLIP DAYS WILL BE ALLOWED FOR THIS HOMEWORK. Remember, this assignment is not to be
done in groups.
- If
you know how to use PostgreSQL, this assignment
should be doable in a single sitting.
Documentation and PostgreSQL Tips
The PostgreSQL
8.0 Documentation is on-line (and searchable) at: http://www.postgresql.org/docs/8.0/static/index.html
You will find the following parts
particularly useful for this assignment:
- PostgreSQL Performance Tips http://www.postgresql.org/docs/8.0/static/performance-tips.html
- The
CREATE INDEX Command: http://www.postgresql.org/docs/8.0/static/sql-createindex.html
- the command for creating indexes; you will also need “DROP INDEX”
- The
SET Command: http://www.postgresql.org/docs/8.0/static/sql-set.html
- this
command lets you turn on and off operators that the query optimizer can
consider. For example, the SQL
command “SET enable_hashjoin TO off;”
tells the query optimizer not to use Hash Joins.
- Runtime
Configuration: http://www.postgresql.org/docs/8.0/static/runtime-config.html
- see
section 16.4.5.1 “planner method configuration” for a list of the operators you can disable
and re-enable for the optimizer using the SET command.
- PostgreSQL System Catalogs: http://www.postgresql.org/docs/8.0/static/catalogs.html
- note
that PostgreSQL keeps its catalog information
in tables. The tables/views that
are likely to be most useful in this assignment are “pg_stats”, “pg_class”,
“pg_indexes”, and “pg_attribute”. You can query these in psql just
like regular tables. For example,
the query “select * from pg_indexes where
tablename = 'orders';” gives you
information about the indexes defined on the orders table. You can see the statistics the optimizer
has for the table ‘orders’ using the query: “select * from pg_stats where tablename =
'orders'” (note the single quotes around the tablename,
which is a string attributed in the pg_stats
view). The documentation for pg_stats (linked from the above page) explains what
the numbers mean. You can get information on tables by
querying the “pg_class” table
and/or using the “/d” command in psql.
Setting up the Database
- Create
a database directory with initdb . After that, start
the postmaster with pg_ctl (initdb tells you the exact command). Then, create a database with createdb.
- Create
the following tables (it’s best to use the file schema.sql
with the "psql
-f" command for this):
- CREATE TABLE companies ( co_id
serial PRIMARY KEY, co_name varchar(64),
co_lastchg timestamp );
- CREATE TABLE products ( pr_code
varchar(6) PRIMARY KEY, pr_desc
varchar(64) );
- CREATE TABLE orders ( ord_id
serial, ord_company int4 REFERENCES companies(co_id), ord_product varchar(6) REFERENCES products(pr_code),
ord_qty int4, ord_placed
date, ord_delivered date, ord_paid
date );
- Load
the data into the tables using the files companies.data,
products.data and orders.data.
You should use "psql -f" to run these
files (which contain SQL INSERT statements). Note that orders.data
should be run last (why?).
- If you chose not to use “psql
–f schema.sql” to create your
tables, then create an index ord_placed_idx
on the attribute ord_placed for the
relation Orders. (if you used "psql
-f schema.sql", then skip this step, the index has already been created for you)
- You
MUST update the PostgreSQL statistics
after adding or deleting indexes. The command for updating the statistics
(in a psql session) is "VACUUM ANALYZE;". You MUST update the statistics once
before starting the exercises. This step is
crucial to getting correct answers to the questions!
After completing the above
steps you should be able to analyze query plans produced by the optimizer for a
given query (using EXPLAIN). You should also be able to modify the
runtime variables for enabling/disabling join and access methods (using SET).
Questions –Hand in from here below…
CS 186 SPRING 2006 Homework 3 Answer Sheet (don’t forget to staple it
together!).
NAME: ____________________________ STUDENT ID:____ญญญญ______________
IMPORTANT: Circle the last two letters of your
class account:
cs186 a b c d e f g h
i j k l m n o p q r s t u v w x y z
a b c d e f g h
i j k l m n o p q r s t u v w x y z
1. SOME
EASY STUFF
Start psql and make sure you are
running version 8.0.3 (it tells you when it starts up). Then, using queries in psql, examine the set of tables
and access methods in the database you just created. In particular examine the
relation Orders.
Answer the following questions:
A. What are the attributes of the Orders relation?
B. How many indexes are built on this relation? Name them
and write down their type.
C. How many tuples are there in the Orders relation?
D. How many distinct values of “ord_qty”
does the query planner estimate there are for the Orders relation? (hint: Query pg_stats
to find out)
E. How many distinct values of “ord_qty”
are there actually in the orders table? (hint: it’s
probably easiest to run a query to compute this!)
2. USING
THE QUERY PLAN VIEWER
This question requires you to
use the PostgreSQL query plan visualization command EXPLAIN.
Read the documentation for EXPLAIN at the link given above. Note that (like System R) EXPLAIN estimates
query costs in units of disk I/Os (CPU instructions are added in by multiplying
times a conversion factor).
Consider the following query:
SELECT * FROM orders WHERE ord_qty = 55;
Answer the following questions looking at the query plan generated by the
EXPLAIN command:
- Briefly describe the plan chosen. (e.g., what kind of scan is used?).
- In what order would the tuples be returned by
this plan? Why?
- What is the estimated total cost of running the
plan?
- What is the estimated result cardinality for
this plan? The estimated result cardinality is the number of orders that
the optimizer estimates to have quantity 55. Looking at the statistics, why does the
optimizer come up with this estimate?
- How many orders actually do have “ord_qty = 55”?
(hint: it’s probably easiest to run
a query to compute this!)
- Looking at the statistics, which value of “ord_qty” would you expect to return the most
tuples?
- Which value of “ord_qty”
is actually the most popular, and how many tuples have that “ord_qty”?
3. SELECTS
WITH INDEXES
Consider the same query from previous
question:
SELECT * FROM orders WHERE ord_qty = 55;
Answer the following questions looking at the plans and the access methods:
- Create a btree index
on the attribute ord_qty of the relation
Orders. What is the plan chosen for the query now? (e.g., what kind of scan is used and
what is scanned?).
- What is the estimated total cost of running the
plan?
- Compare this plan with the plan obtained in
question 2.A above. Which is
cheaper and why?
4. RANGE
SELECTS
DROP the index
that you created for Question 3.
Now analyze the query plan
that PostgreSQL comes up for the following query:
SELECT * FROM orders WHERE ord_placed <
'2002-09-1';
Answer the following questions:
- How many order tuples that have ord_placed < '2002-09-1' does the optimizer think
there are?
- How does the optimizer arrive at this estimate
of the number of orders? That is, what calculations does it perform, and
where does the supporting data come from?
- In what order will the tuples be returned by
this plan?
- Disable the access method used in the plan in
step 1) What happens to the cost now? Does the
order in which the tuples are returned change? Why?
- Explain why one of the access methods is
costlier than the other in this case.
5. SIMPLE
JOIN
RE-ENABLE (using “SET”)
the access method you turned off above.
Analyze the query plan for the
following query that finds all products with quantity less than 5.
SELECT DISTINCT (co_name)
FROM orders, companies
WHERE ord_company = co_id
AND ord_qty < 5;
Answer the following questions:
- What is the estimated cost of this plan?
- What kind of join is used by the plan?
- What relations are sorted in this plan, and why?
- Why does only one of the inputs to the join need
to be sorted?
- Disable
the join type used in the above plan and re-optimize the query. What type of join is used now, and what
is the total estimated cost of the query?
6. THREE-WAY
JOIN
RE-ENABLE (using “SET”)
the access method you turned off above.
Answer the following questions
referring to the query below:
SELECT pr_desc, co_name
FROM orders, products, companies
WHERE ord_company = co_id
AND ord_product = pr_code;
- Describe the best plan estimated by the
optimizer. List the joins and access methods it uses, and the order in
which the relations are joined.
- What is the Relational Algrebra
expression for the above join order?
- Modify the query by adding a condition ord_qty < 5. What are the differences between this
plan and the one above? Why is this new join ordering better for the
extended query than the ordering obtained in part A?
7. PLAYING
WITH SQL
Answer the following questions
about the database (by writing queries!):
- What is the description (pr_desc)pr of the most
popular product and how many orders are there for it?
(hint: while one way to get close to this
is simply to list all products and the number of orders for them sorted
by number of orders, you may also want to try writing a variant that only
produces a single result row – it’s a bit ugly).
- For each company, what is the average number of
days from the time an order is placed to the time it is delivered? (note: see how PostgreSQL
handles missing (null) values in this case).
- Show a query to find out some other interesting
fact in the database and tell us what the answer is (or at least
summarize it).