Overview
In this assignment you will design and implement a tool to suggest potentially useful indexes for a Postgres database, based on the read and write workload on the database. For example, if a common query is "SELECT * FROM Emp WHERE Salary>100000" then it might be useful to create an index on Emp.Salary. On the other hand, if Emp.Salary is frequently updated, you will have to weigh the tradeoff of extra cost on update compared to the benefit to select queries.
The goal of this assignment is threefold: 1. to understand and modify the code of a real DBMS, 2. to think about index selection strategies, and 3. to design an architecture for a DBMS tool. In particular, you will modify the Postgres source code to gather the appropriate workload statistics, and a tool to use those statistics to suggest indexes to the DBA. You will also have to think carefully about how to make the decision whether to suggest a potential index.
Due
This assignment is due May 8. While we expect that the modifications to the Postgres source may not be very extensive (potentially just touching a few source files) understanding the structure of the Postgres source and how to get the information you need will likely be challenging.
1. Obtain and build Postgres source
We suggest for this assignment that you work on ilinux1.eecs.berkeley.edu or ilinux2.eecs.berkeley.edu. Once you start using ilinux1 or ilinux2 you should continue to use the same machine, as binaries compiled on one machine may not work properly on the other.
The full source and build of Postgres is 160 MB or more, which is too large to fit in your home directory given student quotas. So you will be doing most of your work in /home/tmp, which is a large filesystem. Because /home/tmp is shared between many classes, and we want to be good citizens, we strongly suggest that you work in pairs for this assignment, and that only one of you uncompress and build the postgres source. Thus, you will have to work together to write code in this shared code tree.
Warning: Make sure that you prefix all of the postgres commands with the appropriate path, to make sure you pick up the commands you have built and not the standard class postgres instance. For example:
./postgreswill run the postgres command in the current directory, while
postgreswill likely run the postgres command from your path, which won't be the one you built.
To obtain and build the postgres source:
/share/b/bin/mkhometmpdirIf you have problems running this command on ilinux1 or ilinux2, try logging into pulsar.eecs.berkeley.edu and running it there. That is, do the project on one of the linux machines, but carry out just this step of making the directory on pulsar.
After you have done this step, you should have a directory called /home/tmp/cs186-xx, where cs186-xx is your login.
cd /home/tmp/cs186-xx tar xzvf ~cs186/postgresql-8.3.1.tar.gzThis will create a directory called "postgresql-8.3.1" in your /home/tmp/cs186-xx directory, containing all the source.
The files in /home/tmp should not get deleted, but just to be safe you should periodically copy any files you edit to your home directory as a backup.
cd /home/tmp/cs186-xx/postgresql-8.3.1 ./configure --prefix=/home/tmp/cs186-xx/pgbin --enable-debugThis step will create the necessary configuration scripts and makefiles to build postgres.
gmakeIf all goes well, at the end you should get a message that says: "All of PostgreSQL successfully made. Ready to install."
gmake installIf successful you should be see the message: "PostgreSQL installation complete."
mkdir /home/cc/cs186/sp08/class/c186-xx/pgdata cd /home/tmp/cs186-xx/pgbin/bin ./initdb -D /home/cc/cs186/sp08/class/c186-xx/pgdataNote that this will require about 33 MB in your home directory, so clear out files from previous assignments if necessary.
cd /home/tmp/cs186-xx/pgbin/bin ./postgres -D /home/cc/cs186/sp08/class/c186-xx/pgdataNow, from another shell, you should be able to run standard postgres commands:
cd /home/tmp/cs186-xx/pgbin/bin ./createdb test ./psql test
2. Modify Postgres
Next, we suggest you make a small change to Postgres, such as printing out "Hello world" from the server or something when a query is run. Make sure you can make, install and run your modified server. You will need to repeat the "gmake; gmake install" commands, and shut down and restart your running version of postgres.
For this assignment, when you turn in modified source files, you should clearly mark your changes by surrounding them with comments:
/* BEGIN HW3 */ your additions, modifications, deletions to code ... /* END HW3 */This will help us find the relevant portions of code.
3. Tasks
You should track the number of times Postgres does a sequential scan of a table, as well as the number of tuples inserted into or deleted from the table, and the number of times a particular attribute of a tuple in the table is updated. Using these statistics (and any other information you need to collect), your tool should suggest zero or more indexes to be created. This suggestion will require a "cost model" that weighs the potential benefits of an index versus its costs (more on this below.) The tool can be run offline; that is, the tool to suggest indexes itself does not have to be a part of the Postgres server process. However, it must use real statistics from plans that are actually executed against the postgres backend.
Your cost model is really just a formula that determines whether an index will produce a net benefit, or net cost, to the performance of the system. Ideally, you would count the number of I/O's done by plans before and after using the index. However, for the purposes of this assignment, you can use a simpler cost model based on the number of tuples examined under alternative plans. Part of this assignment is to design a simple but effective cost model, gather the necessary statistics from the execution of postgres to feed this cost model, and write a tool that uses these statistics and the cost model to suggest indexes.
Postgres is implemented in C and you will need to modify it by writing some C. Any external tools you write to parse runtime statistics, gather additional statistics, and suggest indexes can be in your language of choice, as long as you provide clear instructions on how to compile and run your tool.
Also, you should run some test workloads against your modified server that include both reads and writes. After running this workload, you should run your tool to make sure the appropriate indexes are being selected.
4. Some hints
ps ux | grep postgresyou'll see something like:
cs186-md 17756 0.4 0.0 45468 3372 pts/8 S 21:38 0:00 ./postgres -D /home/cc/cs186/sp08/class/cs186-md/pgdata/ cs186-md 17758 0.0 0.0 45468 976 ? Ss 21:38 0:00 postgres: writer process cs186-md 17759 0.0 0.0 45468 816 ? Ss 21:38 0:00 postgres: wal writer process cs186-md 17760 0.0 0.0 45468 964 ? Ss 21:38 0:00 postgres: autovacuum launcher process cs186-md 17761 0.0 0.0 8156 868 ? Ss 21:38 0:00 postgres: stats collector process cs186-md 17864 0.0 0.0 46276 2704 ? Ss 21:40 0:00 postgres: cs186-md test [local] idlePostgres modifies the process name so that you can see more clearly what is going on. In the example above, process 17864 (labeled "postgres: cs186-md test [local] idle") is the process spawned to handle your queries. You can attach to this process using:
gdb postgresand then, in gdb, typing:
(gdb) attach 17864This will interrupt the process and allow you to use gdb to step through it and conduct other debugging tasks. There are many good tutorials for gdb on the web if you are not familiar with that tool.
You need to turn in several things: