Homework 3

Overview

In this assignment you will design and implement a tool to suggest potentially useful indexes for a Postgres database, based on the read and write workload on the database. For example, if a common query is "SELECT * FROM Emp WHERE Salary>100000" then it might be useful to create an index on Emp.Salary. On the other hand, if Emp.Salary is frequently updated, you will have to weigh the tradeoff of extra cost on update compared to the benefit to select queries.

The goal of this assignment is threefold: 1. to understand and modify the code of a real DBMS, 2. to think about index selection strategies, and 3. to design an architecture for a DBMS tool. In particular, you will modify the Postgres source code to gather the appropriate workload statistics, and a tool to use those statistics to suggest indexes to the DBA. You will also have to think carefully about how to make the decision whether to suggest a potential index.

Due

This assignment is due May 8. While we expect that the modifications to the Postgres source may not be very extensive (potentially just touching a few source files) understanding the structure of the Postgres source and how to get the information you need will likely be challenging.

1. Obtain and build Postgres source

We suggest for this assignment that you work on ilinux1.eecs.berkeley.edu or ilinux2.eecs.berkeley.edu. Once you start using ilinux1 or ilinux2 you should continue to use the same machine, as binaries compiled on one machine may not work properly on the other.

The full source and build of Postgres is 160 MB or more, which is too large to fit in your home directory given student quotas. So you will be doing most of your work in /home/tmp, which is a large filesystem. Because /home/tmp is shared between many classes, and we want to be good citizens, we strongly suggest that you work in pairs for this assignment, and that only one of you uncompress and build the postgres source. Thus, you will have to work together to write code in this shared code tree.

Warning: Make sure that you prefix all of the postgres commands with the appropriate path, to make sure you pick up the commands you have built and not the standard class postgres instance. For example:

./postgres
will run the postgres command in the current directory, while
postgres
will likely run the postgres command from your path, which won't be the one you built.

To obtain and build the postgres source:

  1. If you don't already have a directory in /home/tmp, create one by running the following command:
    /share/b/bin/mkhometmpdir
    
    If you have problems running this command on ilinux1 or ilinux2, try logging into pulsar.eecs.berkeley.edu and running it there. That is, do the project on one of the linux machines, but carry out just this step of making the directory on pulsar.

    After you have done this step, you should have a directory called /home/tmp/cs186-xx, where cs186-xx is your login.

  2. Change to this tmp directory, and untar the source from the "cs186" account home directory:
    cd /home/tmp/cs186-xx
    tar xzvf ~cs186/postgresql-8.3.1.tar.gz
    
    This will create a directory called "postgresql-8.3.1" in your /home/tmp/cs186-xx directory, containing all the source.

    The files in /home/tmp should not get deleted, but just to be safe you should periodically copy any files you edit to your home directory as a backup.

  3. Configure your build. First, choose a directory, preferably on /home/tmp/cs186-xx, where your built binaries will be installed. For example, you might choose "/home/tmp/cs186-xx/pgbin". This directory must be specified to the configure script using the --prefix flag. You should also use the --enable-debug flag, so that postgres will be built using debugging symbols and you can use a debugger like gdb during your project.
    cd /home/tmp/cs186-xx/postgresql-8.3.1
    ./configure --prefix=/home/tmp/cs186-xx/pgbin --enable-debug
    
    This step will create the necessary configuration scripts and makefiles to build postgres.

  4. Make postgres.
    gmake
    
    If all goes well, at the end you should get a message that says: "All of PostgreSQL successfully made. Ready to install."

  5. Install postgres in your build directory:
    gmake install
    
    If successful you should be see the message: "PostgreSQL installation complete."

  6. If necessary, create a data directory for postgres. You might still have a data directory from previous projects, but if you have trouble using your existing data directory with this version of postgres, you can throw it away and start over.
    mkdir /home/cc/cs186/sp08/class/c186-xx/pgdata
    cd /home/tmp/cs186-xx/pgbin/bin
    ./initdb -D /home/cc/cs186/sp08/class/c186-xx/pgdata
    
    Note that this will require about 33 MB in your home directory, so clear out files from previous assignments if necessary.

  7. Run your built postgres:
    cd /home/tmp/cs186-xx/pgbin/bin
    ./postgres -D /home/cc/cs186/sp08/class/c186-xx/pgdata
    
    Now, from another shell, you should be able to run standard postgres commands:
    cd /home/tmp/cs186-xx/pgbin/bin
    ./createdb test
    ./psql test
    

2. Modify Postgres

Next, we suggest you make a small change to Postgres, such as printing out "Hello world" from the server or something when a query is run. Make sure you can make, install and run your modified server. You will need to repeat the "gmake; gmake install" commands, and shut down and restart your running version of postgres.

For this assignment, when you turn in modified source files, you should clearly mark your changes by surrounding them with comments:

/* BEGIN HW3 */
your additions, modifications, deletions to code ...
/* END HW3 */
This will help us find the relevant portions of code.

3. Tasks

You should track the number of times Postgres does a sequential scan of a table, as well as the number of tuples inserted into or deleted from the table, and the number of times a particular attribute of a tuple in the table is updated. Using these statistics (and any other information you need to collect), your tool should suggest zero or more indexes to be created. This suggestion will require a "cost model" that weighs the potential benefits of an index versus its costs (more on this below.) The tool can be run offline; that is, the tool to suggest indexes itself does not have to be a part of the Postgres server process. However, it must use real statistics from plans that are actually executed against the postgres backend.

Your cost model is really just a formula that determines whether an index will produce a net benefit, or net cost, to the performance of the system. Ideally, you would count the number of I/O's done by plans before and after using the index. However, for the purposes of this assignment, you can use a simpler cost model based on the number of tuples examined under alternative plans. Part of this assignment is to design a simple but effective cost model, gather the necessary statistics from the execution of postgres to feed this cost model, and write a tool that uses these statistics and the cost model to suggest indexes.

Postgres is implemented in C and you will need to modify it by writing some C. Any external tools you write to parse runtime statistics, gather additional statistics, and suggest indexes can be in your language of choice, as long as you provide clear instructions on how to compile and run your tool.

Also, you should run some test workloads against your modified server that include both reads and writes. After running this workload, you should run your tool to make sure the appropriate indexes are being selected.

4. Some hints

5. Deliverables

You need to turn in several things:

For submission: