Home

Basic Info

Blog

Newsgroup

Meetings

Homeworks

Lecture Notes

Syllabus

Homework 1: Prefix Key Compression (Updated 9/28/2006)

Prologue: Important!

For this project, you must do your work on ilinux1.eecs.berkeley.edu or ilinux2.eecs.berkeley.edu. Do not use any other instructional machines. If you attempt to compile PostgreSQL on any instructional machines other than ilinux1 and ilinux2, your code will break! You are welcome to attempt to install PostgreSQL on your home computer and use it there, but we will offer no support for such installations and you develop on them at your own risk.
You must work in a team of 2 people. If you do not have a teammate, email your TA immediately!

Overview

For homework 1, your group will be implementing prefix key compression for B+-Trees in the PostgreSQL engine.

This assignment is intended to give you the experience of adding a performance feature to a production DBMS engine (PostgreSQL), and encourage you to develop tests to explore the benefits of your feature. Along the way, you will get used to some software engineering issues in debugging a server-based system, and get familiar with the details of a real B+-tree implementation. You will also have the experience of diving into a big "legacy" codebase, and figuring out what is worth focusing on.

The feature you will add is prefix key compression, a simple technique to reduce the size of keys in a B+-tree index. Key compression increases node fanout, which in turn reduces index height, leading directly to fewer I/Os during B+-tree operations. The method behind prefix key compression is described in section 10.8.1 of your textbook.

Your deliverable in this homework centers on a single file, nbtinsert.c, which will include your prefix key compression code. To grade your homework, we will compile PostgreSQL with your modifications, and test the result both for correctness (no crashes, answers queries correctly) and for efficiency (maximal prefix key compression as described in class and the book.)

The rest of this writeup contains information to get you set up and installed, and running your own copy of PostgreSQL. Instructions are then given for the main task: completing the implementation of prefix key compression. We also provide information on debugging your implementation, browsing the PostgreSQL code base with CScope, and testing it for correctness and efficiency. Instructions for submitting your solution are also included.

Setup

Before getting into the details of the assignment, you need to get a PostgreSQL source tree you can use for compiling/debugging/etc.

Log into ilinux1.eecs or ilinux2.eecs, and type the following:

cd
~cs186/fa06/hw1/ConfigurePostgres.sh

After a lot of output, the following things should happen:

There is a directory pgsource in your home directory.
The directory ~/pgsource/src/backend/access/nbtree/ should contain a file called nbtinsert.c, which you will be editing in this assignment.

Note: most of the PostgreSQL source code files will not in fact appear in your pgsource directory. The class master source directory is in ~cs186/fa06/postgresql-8.1.4. If you find you want to look at other source files, or if you step into code via the debugger, you will find the source files in the master source directory.

Now you should be ready to compile and install your own copy of PostgreSQL.

Compiling and Installing PostgreSQL

Log into ilinux1.eecs or ilinux2.eecs, and type the following:

cd ~/pgsource
gmake clean

This will clean any compiled binaries from your source tree (this is also a good command to use if you find yourself running out of disk space, as it will free up a lot of room)

Once this completes, type:

gmake

This will automatically execute all commands needed to build PostgreSQL. The last line of output from this command before you are once again presented with a shell prompt should be the following:

If you do not see this text when gmake finishes, it is because an error has occurred, either because you ran out of disk space or one or more files had compile-time errors. Examining the output of gmake will usually give you a fairly concise explanation of the cause of the problem or problems that prevented PostgreSQL from compiling successfully.

Once PostgreSQL has been compiled, it's time to install it. We've set up your accounts so that all your binaries related to PostgreSQL will be installed into a pgsql directory in your home directory.

To install PostgreSQL, type the following (while still in the pgsource directory):

gmake install

This will copy the binary files created by the compilation process into the pgsql directory. The last line of output from this command should be the following:

PostgreSQL installation complete.

If you didn't see this line, it's probably because you ran out of disk space. See "What if I run out of disk space?" below.

To make sure that everything went smoothly, type the following:

which pg_ctl

You should see something like this:

/home/cc/cs186/fa06/class/<yourlogin>/pgsql/bin/pg_ctl

If you do not, log out and log back in again. If which pg_ctl is still not outputting a string of the correct form, please post your problem to the newsgroup.

We have made a screen dump of the install and setup process for your reference.

Getting PostgreSQL Running

Now that you have PostgreSQL built, you must initialize a database. To do this, type

initdb

This will initialize storage for the DBMS within your home directory in the pgdata directory.

The last output you see from the command should look something like this

Now we must start the server. To do this, type

pg_ctl start

You should see output similar to the following:

pg_ctl starts a running PostgreSQL process in the background, so even though messages are printing to the terminal, you should still be able to enter commands as you normally would. Simply hit Return and you'll be presented with a command prompt again.

Now that we've started the server, it's time to create a database. Type the following:

createdb

This creates a database named after your login. If you want to give your database a name, you can give that name as an argument to createdb, but you will probably find it easier to interact with a default-named database.

Now we can start interacting with the database. To do so, type the following:

psql

If you gave your database a non-default name, you'll have to specify that name as an argument to psql. From here, things should look familiar from HW0; it is through this familiar interface that you can query the database.

To stop the server at any time, type

pg_ctl stop

A screen-dump of this process is included for your reference.

Your First Task: Implementing Prefix Key Compression

For this project, you will be editing the _bt_prefixKeyCompress function in the file ~/pgsource/src/backend/access/nbtree/nbtinsert.c. You should not need to modify any other functions for this project.

We have modified the standard version of nbtinsert.c to include a skeleton of your solution. We are making life simpler by looking at a special case: single-column indexes of SQL type text (SQL's variable-length string data type). The compression logic in _bt_prefixKeyCompress is invoked only in this special case. For any other index key configuration, the code you write should simply not get called -- and certainly the system should never crash on other index configurations..

In general, when a B+-tree leaf node split takes place, half the data entries on the original node are moved onto a new "righthand" node -- this happens in a routine called _bt_split() in nbtinsert.c which you may want to examine. The smallest entry in the resulting righthand node (which would ordinarily be "copied" up unchanged to the parent node during split, as discussed in class and in the book) will have its key prefix-compressed before the copy occurs, and PostgreSQL stashes a version of the resulting compressed key in a special slot on the original page (the so-called "high key" mentioned in the comments in _bt_split(), which is maintained for concurrency control reasons that will not concern us in this homework). The compression is done by a routine we call _bt_prefixKeyCompress(). The arguments to _bt_prefixKeyCompress are:

Relation rel, a data structure representing the actual index file (which is of type Relation, actually, in PostgreSQL).
BTItem lowItem, the highest index key remaining on the original leaf page after the split (i.e. the key just below the one being compressed in alphabetical order)
BTItem highItem, a fresh copy of the lowest index key on the new rightmost node, which we can prefix-compress before it gets copied up.

We have given you skeleton code in _btPrefixKeyCompress that extracts the actual text for the keys from the BTItem data structures for lowItem and highItem (which include both a key and a pointer), to eventually generate two corresponding C char * pointers lowp and highp. Given these, you need to do three things:

Figure out how short you can truncate the string pointed to by highp by comparing it to lowp. You want the length to be just long enough to distinguish the two!
Update the length field of the highText struct to set it to the length you computed in the previous step. See comments in the code about including 4 bytes for the vl_len space!
Set the toReturn variable to the absolute difference in length between the high key pre- and post-compression.

As background, you may want to read through the code where _bt_PrefixKeyCompress is called, and generally poke around in nbtinsert.c. You can look for comments that say "CS186" for things we added to nbtinsert.c to support prefix key compression and debugging.

We have also provided code to output the contents of the B+-tree as text:

_btdump(Relation r) takes a B+-tree Relation struct, and outputs its pages in whatever order they physically appear in the file. This is available for you to call from the debugger (e.g. by typing print _btdump(rel) from a breakpoint in _bt_prefixKeyCompress). Beware that it generates a lot of output!
_btdumppage(Relation r, Buffer buf) is the inner loop of _btdump that prints a particular index page from the buffer pool.
We have put a call to _btdumppage() into the routine _bt_insertonpg() in nbtinsert.c, so that you can see the effects of your compression as internal index keys are generated and inserted. (If you want to call _btdumppage from the debugger, you first have to pin your page into the buffer pool via the ReadBuffer() function; see the code for _btdump for details.)

How do I know it's working? Testing your Code

If your key compression is effective, you should be able to see a performance benefit in terms of the number of I/Os needed to service a query over the index. PostgreSQL can allow you to observe the number of buffer pool page requests as follows:

In your ~/pgdata directory, find a file named postgresql.conf
Edit that file, and search for a line that looks like: #log_executor_stats = off Change it to say: log_executor_stats = on
Restart your postgres configuration by saying
pg_ctl restart.

When you next start psql, after each query you should be presented with executor statistics. The one you are looking for is prefaced by the phrase Shared blocks -- these are the number of I/O requests through the buffer pool made by your query. The larger the keys, the more I/O requests you'll see. For fun you can also observe the elapsed time, however since we're dealing with small databases (since the course accounts have limited disk quotas), you may not notice much of an absolute performance difference at this scale.

If you should want to turn off the executor statistics reporting, repeat the instructions above but change the value back to off in the configuration file.

Grading your code

We will test your code on its ability to compress keys. We will also check to see that your index still works properly. You can verify these properties yourself by creating a table and an index, and loading some data into your database. The combination of debug messages from _bt_prefixKeyCompress and output of _bt_dumppage should help you validate your implementation. We have provided sample data for you to experiment with. To load it, proceed as follows:

if you have not done so already, start up PostgreSQL via pg_ctl start, and run createdb as described above.
to create and load an index, run the following: script indexscript psql CREATE TABLE dict (id int4, word text); CREATE INDEX dictix on dict(word); COPY dict FROM '/home/ff/cs186/fa06/words.txt' WITH DELIMITER AS ' '; \q exit (that's a space between the single-quotes in the last line.)

You will see a lot of debugging output! It will be saved for you in a file indexscript. You can examine the debugging output there to see the index page dumps, and any key compression that is happening.

Important note: There are 3 lines in the _bt_prefixKeyCompress() routine that must remain unchanged in order for the autograder to evaluate your code. They are clearly marked in the code. Be sure not to modify or delete them!

Browsing the PostgreSQL Code Base with cs186scope

As was discussed at the debugging help session, CScope is a tool that allows you to browse a large collection of C code easily. Since instructional disk space places strict limitations on how much code you can have in your home directory at any time, your pgsource only contains actual copies of the file you will be editing. Because of this, you will need to use our version of CScope, called CS186Scope, to browse the PostgreSQL code base. To use CS186Scope, type the following command:

cs186scope

Please note that this version of cscope has indexed PostgreSQL without your modifications. As such, once you add lines to nbtinsert.c, CS186Scope might not point to the correct line in that file. All other files should be unaffected.

Debugging PostgreSQL

PostgreSQL runs in a client-server fashion, meaning that a user doesn't typically run a postgres process (the PostgreSQL server process), they run a client process like the psql command-line-interface which talks to a postgres process via inter-process communication. This complicates a number of techniques you're probably used to using:

Generating output to the screen. If you put printf()'s in your code, they will not be visible from the psql client. PostgreSQL provides a special function called elog() to get messages from the postgres server process to appear at the client (even if it's across a network, actually.) To insert debugging statements in your code, you should use the elog() function, with the first argument being DEBUG1. Example use of elog() is given in the nbtinsert.c file you'll be editing. Note that elog() takes a message string as its main argument; to construct such a string you may want to use the sprintf() routine; type man sprintf at a shell prompt for more information, or look at the sample code in nbtinsert.c.
Debugging the postgres server If you run psql, it will cause a postgres server process to be started on your behalf. To debug that process, you need to start gdb, and say attach XXX where XXX is the processID (number) of the corresponding postgres process. What is that number, you ask? Assuming you're only running one psql process, it is pretty easy to find:
- While psql is running, open another shell window on the same server and type
  ps -ef | grep <yourlogin> | grep post
  Look for a line something like <yourlogin> 29545 19588 0 22:01 pts/2 00:00:00 postgres: <yourlogin> <yourlogin> [local]
- The 2nd entry of that line (29545 in the example above) is the process ID of the process you want to attach to; call it XXX for the moment.
In your new shell window, run gdb postgres, and when you get a gdb prompt, type attach XXX. You will pause the process you attach to, which will allow you to set whatever breakpoints you like. Then you can type continue to gdb to let the postgres process keep working; it will behave as if you started it from the debugger. If you want to leave the debugger without killing the process, you can issue a detach command to gdb.

What If I Run Out of Disk Space?

Delete any unnecessary files (i.e. those not prudent for HW1) from your home directory. Space on the instructional machines is tight, so every kilobyte counts!
Try deleting your pgdata directory, executing your commands, and then recreating the pgdata directory with initdb. Note that all information in the database will be deleted.
If you have already installed PostgreSQL and run out of disk space during execution of initdb, try running gmake clean from within your pgsource directory. That should free up some space.

Submission

Once you have things working to your satisfaction, use the following steps to submit your solution:

One member of your team should create a directory at the top level of their class account called hw1.
Copy your completed ~/pgsource/src/backend/access/nbtree/nbtinsert.c file to that directory.
Also create a text file called README that contains two lines of text, each of the form
full name, class account
one line for each member of your team!
If you are going for extra credit, add a third line to your README file that says
extra credit and put your extra credit implementation in a file called nbtinsert_EXTRACREDIT.c.
cd into your hw1 directory and type
submit hw1.
To submit the extra credit, type submit hw1-extracredit

Extra Credit: i18n

An important practical issue in building "real" software is to deal with character sets for different languages. This is often referred to as "internationalization", or "i18n" for short (the 18 refers to the 18 characters omitted from the short version -- i.e. "nternationalizatio".)

PostgreSQL supports i18n in its text data type, but we have set up the class accounts to disable that support. This is done via an environment variable called LC_ALL, which we set in your accounts to use an i18n locale called C that follows the default rules of the C programming language. In particular, it ensures that the text sort ordering (called collation in i18n-speak) follows the rules of the C function strcmp(). This means that the keys in your B+-tree will be ordered alphabetically according to the rules of strcmp(), which you can read about in the corresponding manual page.

However, on our server machines, the default locale is a US English configuration called en_US.UTF-8, which is an example of a unicode locale. In these locales, the collation rules are different than in C. You can read about unicode collation on the web -- look especially at the discussion of Multi-level Comparison.

For extra credit, implement prefix compression code that works correctly regardless of whether the system is configured for the C locale or the en_US.UTF-8 locale. To try your solution under the en_US.UTF-8 locale, you need to delete your database and re-initialize with the --locale argument:
pg_ctl stop
/bin/rm ~/pgdata (beware of typos!)
initdb --locale=en_US.UTF-8

You may want to look in the file ~cs186/fa06/postgresql-8.1.4/src/backend/utils/adt/varlena.c which contains all the type-specific logic for the SQL text data type (also known in the source as a variable-length array, or varlena.)

If you are going for extra credit, add a line to that effect to your README file as described above.

Update You should be submitting your extra credit implementation in a separate file. See Submission for details.

Frequently Asked Questions

Frequently asked questions will be posted here as they arise. Also be sure to check the class newsgroup and blog regularly.

Q: I'm getting the following error when running initdb:

ilinux1 [11] ~/pgsource > initdb The files belonging to this database system will be owned by user "cs186-dh". This user must also own the server process. The database cluster will be initialized with locale C. initdb: directory "/home/cc/cs186/fa06/class/cs186-dh/pgdata" exists but is not empty If you want to create a new database system, either remove or empty the directory "/home/cc/cs186/fa06/class/cs186-dh/pgdata" or run initdb with an argument other than "/home/cc/cs186/fa06/class/cs186-dh/pgdata".

What's wrong?

A: You need to delete your pgdata directory. To do so, type rm -rf ~/pgdata. If it prompts you for file deletions, you can stop this by typing unalias rm, then re-executing the aforementioned command.

Q: I can't remove pgdata because it says some files are in use. Also, I can't pg_ctl stop because it says it can't find the server. What should I do?

A: Kill the postmaster. See below.

Q: How do I kill the postmaster?

A:First, you have to locate the postmaster process by executing ps -ef | grep $USER. The output of that command should look something like this:

ilinux1 [4] ~ > ps -ef | grep $USER root 22918 2317 0 17:57 ? 00:00:00 sshd: cs186-em [priv] cs186-em 22920 22918 0 17:57 ? 00:00:00 sshd: cs186-em@pts/3 cs186-em 22921 22920 0 17:57 pts/3 00:00:00 -tcsh cs186-em 22982 1 1 17:58 pts/3 00:00:00 /home/cc/cs186/fa06/class/cs186-em/pgsql/bin/postmaster cs186-em 22984 22982 0 17:58 pts/3 00:00:00 postgres: writer process cs186-em 22985 22982 0 17:58 pts/3 00:00:00 postgres: stats buffer process cs186-em 22986 22985 0 17:58 pts/3 00:00:00 postgres: stats collector process cs186-em 23018 22921 0 17:58 pts/3 00:00:00 ps -ef cs186-em 23019 22921 0 17:58 pts/3 00:00:00 grep cs186-em

The postmaster process is highlighted above. Its process ID (indicated by the first column of numbers in the output) is the argument you'll give to the kill command. Run it like this:

kill where is replaced by the process ID to kill (in this case, 22982)