CS 186 - Homework 3

Homework 3

Overview

In this assignment you will design and implement a tool to suggest potentially useful indexes for a Postgres database, based on the read and write workload on the database. For example, if a common query is "SELECT * FROM Emp WHERE Salary>100000" then it might be useful to create an index on Emp.Salary. On the other hand, if Emp.Salary is frequently updated, you will have to weigh the tradeoff of extra cost on update compared to the benefit to select queries.

The goal of this assignment is threefold: 1. to understand and modify the code of a real DBMS, 2. to think about index selection strategies, and 3. to design an architecture for a DBMS tool. In particular, you will modify the Postgres source code to gather the appropriate workload statistics, and a tool to use those statistics to suggest indexes to the DBA. You will also have to think carefully about how to make the decision whether to suggest a potential index.

Due

This assignment is due May 8. While we expect that the modifications to the Postgres source may not be very extensive (potentially just touching a few source files) understanding the structure of the Postgres source and how to get the information you need will likely be challenging.

1. Obtain and build Postgres source

We suggest for this assignment that you work on ilinux1.eecs.berkeley.edu or ilinux2.eecs.berkeley.edu. Once you start using ilinux1 or ilinux2 you should continue to use the same machine, as binaries compiled on one machine may not work properly on the other.

The full source and build of Postgres is 160 MB or more, which is too large to fit in your home directory given student quotas. So you will be doing most of your work in /home/tmp, which is a large filesystem. Because /home/tmp is shared between many classes, and we want to be good citizens, we strongly suggest that you work in pairs for this assignment, and that only one of you uncompress and build the postgres source. Thus, you will have to work together to write code in this shared code tree.

Warning: Make sure that you prefix all of the postgres commands with the appropriate path, to make sure you pick up the commands you have built and not the standard class postgres instance. For example:

./postgres

will run the postgres command in the current directory, while

postgres

will likely run the postgres command from your path, which won't be the one you built.

To obtain and build the postgres source:

If you don't already have a directory in /home/tmp, create one by running the following command:
```
/share/b/bin/mkhometmpdir
```
If you have problems running this command on ilinux1 or ilinux2, try logging into pulsar.eecs.berkeley.edu and running it there. That is, do the project on one of the linux machines, but carry out just this step of making the directory on pulsar.
After you have done this step, you should have a directory called /home/tmp/cs186-xx, where cs186-xx is your login.
Change to this tmp directory, and untar the source from the "cs186" account home directory:
```
cd /home/tmp/cs186-xx
tar xzvf ~cs186/postgresql-8.3.1.tar.gz
```
This will create a directory called "postgresql-8.3.1" in your /home/tmp/cs186-xx directory, containing all the source.
The files in /home/tmp should not get deleted, but just to be safe you should periodically copy any files you edit to your home directory as a backup.
Configure your build. First, choose a directory, preferably on /home/tmp/cs186-xx, where your built binaries will be installed. For example, you might choose "/home/tmp/cs186-xx/pgbin". This directory must be specified to the configure script using the --prefix flag. You should also use the --enable-debug flag, so that postgres will be built using debugging symbols and you can use a debugger like gdb during your project.
```
cd /home/tmp/cs186-xx/postgresql-8.3.1
./configure --prefix=/home/tmp/cs186-xx/pgbin --enable-debug
```
This step will create the necessary configuration scripts and makefiles to build postgres.
Make postgres.
```
gmake
```
If all goes well, at the end you should get a message that says: "All of PostgreSQL successfully made. Ready to install."
Install postgres in your build directory:
```
gmake install
```
If successful you should be see the message: "PostgreSQL installation complete."
If necessary, create a data directory for postgres. You might still have a data directory from previous projects, but if you have trouble using your existing data directory with this version of postgres, you can throw it away and start over.
```
mkdir /home/cc/cs186/sp08/class/c186-xx/pgdata
cd /home/tmp/cs186-xx/pgbin/bin
./initdb -D /home/cc/cs186/sp08/class/c186-xx/pgdata
```
Note that this will require about 33 MB in your home directory, so clear out files from previous assignments if necessary.

Run your built postgres:

cd /home/tmp/cs186-xx/pgbin/bin
./postgres -D /home/cc/cs186/sp08/class/c186-xx/pgdata

Now, from another shell, you should be able to run standard postgres commands:

cd /home/tmp/cs186-xx/pgbin/bin
./createdb test
./psql test

2. Modify Postgres

Next, we suggest you make a small change to Postgres, such as printing out "Hello world" from the server or something when a query is run. Make sure you can make, install and run your modified server. You will need to repeat the "gmake; gmake install" commands, and shut down and restart your running version of postgres.

For this assignment, when you turn in modified source files, you should clearly mark your changes by surrounding them with comments:

/* BEGIN HW3 */
your additions, modifications, deletions to code ...
/* END HW3 */

This will help us find the relevant portions of code.

3. Tasks

You should track the number of times Postgres does a sequential scan of a table, as well as the number of tuples inserted into or deleted from the table, and the number of times a particular attribute of a tuple in the table is updated. Using these statistics (and any other information you need to collect), your tool should suggest zero or more indexes to be created. This suggestion will require a "cost model" that weighs the potential benefits of an index versus its costs (more on this below.) The tool can be run offline; that is, the tool to suggest indexes itself does not have to be a part of the Postgres server process. However, it must use real statistics from plans that are actually executed against the postgres backend.

Your cost model is really just a formula that determines whether an index will produce a net benefit, or net cost, to the performance of the system. Ideally, you would count the number of I/O's done by plans before and after using the index. However, for the purposes of this assignment, you can use a simpler cost model based on the number of tuples examined under alternative plans. Part of this assignment is to design a simple but effective cost model, gather the necessary statistics from the execution of postgres to feed this cost model, and write a tool that uses these statistics and the cost model to suggest indexes.

Postgres is implemented in C and you will need to modify it by writing some C. Any external tools you write to parse runtime statistics, gather additional statistics, and suggest indexes can be in your language of choice, as long as you provide clear instructions on how to compile and run your tool.

Also, you should run some test workloads against your modified server that include both reads and writes. After running this workload, you should run your tool to make sure the appropriate indexes are being selected.

4. Some hints

You'll need to gather statistics about plans that Postgres executes. This will require you to understand at least something about the plan data structure. Spend some time in relevant portions of the backend code to see what data structures are used. You might start your exploration in execMain.c, which contains the entry point into the executor.
Most of the time postgres takes a query and compiles it into a plan. However, there are a few operations where postgres takes a compiled plan and interprets it like a query. EXPLAIN PLAN is one example, and it may be useful to see how the code implementing EXPLAIN PLAN interprets the plan structure.
If you use elog or even printf inside of the postgres engine, the output will be produced in the console or logfile from the postgres command. In order to understand what is going on during the execution of a plan, it may be useful to liberally sprinkle some printfs or elogs around.
It may be important to debug your modified postgres. You can use gdb to attach to the running postgres process, and debug it. First, determine the process id to attach to. Postgress will start multiple processes, and when you connect with psql, it will start a separate process to handle the queries. So, if you type:
```
ps ux | grep postgres
```
you'll see something like:
```
cs186-md 17756  0.4  0.0  45468  3372 pts/8    S    21:38   0:00 ./postgres -D /home/cc/cs186/sp08/class/cs186-md/pgdata/
cs186-md 17758  0.0  0.0  45468   976 ?        Ss   21:38   0:00 postgres: writer process
cs186-md 17759  0.0  0.0  45468   816 ?        Ss   21:38   0:00 postgres: wal writer process
cs186-md 17760  0.0  0.0  45468   964 ?        Ss   21:38   0:00 postgres: autovacuum launcher process
cs186-md 17761  0.0  0.0   8156   868 ?        Ss   21:38   0:00 postgres: stats collector process
cs186-md 17864  0.0  0.0  46276  2704 ?        Ss   21:40   0:00 postgres: cs186-md test [local] idle
```
Postgres modifies the process name so that you can see more clearly what is going on. In the example above, process 17864 (labeled "postgres: cs186-md test [local] idle") is the process spawned to handle your queries. You can attach to this process using:
```
 gdb postgres
```
and then, in gdb, typing:
```
 (gdb) attach 17864
```
This will interrupt the process and allow you to use gdb to step through it and conduct other debugging tasks. There are many good tutorials for gdb on the web if you are not familiar with that tool.

5. Deliverables

You need to turn in several things:

The files from the postgres source tree that you have modified. You don't need to turn in the whole source tree; just the files you have modified. Remember to include "BEGIN HW3" and "END HW3" comments surrounding your modifications.
The source files and build instructions for any code you write outside of the postgres source. Make sure you are extremely clear on how to build and run these tools.
A text file, called README, which describes several things:
- The general architecture of your solution. What functionality did you add to postgres, what functionality did you build outside of postgres, how do they interact.
- Your cost model. What are the inputs, what is the output, what is the justification for the cost model.
- Some sample runs of your system. Describe the workload you ran, and cut and paste the results of your tool showing what indexes were suggested. Provide some discussion as to why you think those suggestions are reasonable.

For submission:

Compress all the source files into a modified.tar file.
Compress all the related files into an addition.tar file.