Homework 1: Prefix Key Compression (Updated 9/28/2006)
Prologue: Important!
- For this
project, you must do your work on ilinux1.eecs.berkeley.edu or
ilinux2.eecs.berkeley.edu. Do not use any other instructional machines. If you
attempt to compile PostgreSQL on any instructional machines other than ilinux1 and
ilinux2, your code will break! You are welcome to attempt to install
PostgreSQL on your home computer and use it there, but we will offer no support for
such installations and you develop on them at your own risk.
- You must work in a team of 2 people.
If you do not have a teammate, email your TA immediately!
Overview
For homework 1, your group will be implementing prefix key
compression for B+-Trees in the PostgreSQL engine.
This assignment is intended to give you the experience of adding a
performance feature to a production DBMS engine (PostgreSQL), and
encourage you to develop tests to explore the benefits of your feature. Along the
way, you will get used to some software engineering issues in
debugging a server-based system, and get familiar with the details of
a real B+-tree implementation. You will also have the experience of
diving into a big "legacy" codebase, and figuring out what is worth focusing
on.
The feature you will add is prefix key compression, a simple
technique to reduce the size of keys in a B+-tree index. Key
compression increases node fanout, which in turn reduces index height,
leading directly to fewer I/Os during B+-tree operations. The method
behind prefix key compression is described in section 10.8.1 of your
textbook.
Your deliverable in this homework centers on a single
file, nbtinsert.c
, which will include your prefix key
compression code. To grade your homework, we will compile PostgreSQL
with your modifications, and test the result both for correctness (no
crashes, answers queries correctly) and for efficiency (maximal prefix
key compression as described in class and the book.)
Contents
The rest of this writeup contains information to get
you set up and installed,
and running your own copy of PostgreSQL.
Instructions are then given for the main
task: completing the implementation of prefix key
compression. We also provide information
on debugging your implementation,
browsing the PostgreSQL code base with CScope,
and testing it for correctness and efficiency.
Instructions for submitting your solution are
also included.
Setup
Before getting into the details of the assignment, you need to get
a PostgreSQL source tree you can use for compiling/debugging/etc.
Log into ilinux1.eecs or ilinux2.eecs, and type the following:
cd
~cs186/fa06/hw1/ConfigurePostgres.sh
After a lot of output, the following things should happen:
- There is a directory
pgsource
in your
home directory.
- The directory
~/pgsource/src/backend/access/nbtree/
should contain a file called nbtinsert.c
, which you
will be editing in this assignment.
Note: most of the PostgreSQL source code files will not in fact
appear in your pgsource
directory. The class master
source directory is in ~cs186/fa06/postgresql-8.1.4
. If
you find you want to look at other source files, or if you step into
code via the debugger, you will find the source files in the master
source directory.
Now you should be ready to compile and install your own copy of PostgreSQL.
Compiling and Installing PostgreSQL
Log into ilinux1.eecs or ilinux2.eecs, and type the following:
cd ~/pgsource
gmake clean
This will clean any compiled binaries from your source tree (this is
also a good command to use if you find yourself running out of disk
space, as it will free up a lot of room)
Once this completes, type:
gmake
This will automatically execute all commands needed to build PostgreSQL.
The last line of output from this command before you are once again
presented with a shell prompt should be the following:
If you do not see this text when gmake
finishes, it is
because an error has occurred, either because you ran out of disk space
or one or more files had compile-time errors. Examining the output
of gmake
will usually give you a fairly concise
explanation of the cause of the problem or problems that prevented
PostgreSQL from compiling successfully.
Once PostgreSQL has been compiled, it's time to install it. We've set
up your accounts so that all your binaries related to PostgreSQL will be
installed into a pgsql
directory in your home directory.
To install PostgreSQL, type the following (while still in the
pgsource
directory):
gmake install
This will copy the binary files created by the compilation process
into the pgsql
directory. The last line of output from
this command should be the following:
PostgreSQL installation complete.
If you didn't see this line, it's probably because you ran out of
disk space. See "What if I run out of disk space?" below.
To make sure that everything went smoothly, type the following:
which pg_ctl
You should see something like this:
/home/cc/cs186/fa06/class/<yourlogin>/pgsql/bin/pg_ctl
If you do not, log out and log back in again. If which pg_ctl
is
still not outputting a string of the correct form, please post your problem
to the newsgroup.
We have made a screen dump of the install and setup
process for your reference.
Getting PostgreSQL Running
Now that you have PostgreSQL built, you must initialize a
database. To do this, type
initdb
This will initialize storage for the DBMS within your home
directory in the pgdata
directory.
The last output
you see from the command should look something like this
Now we must start the server. To do this, type
pg_ctl start
You should see output similar to the following:
pg_ctl
starts a running PostgreSQL process in the
background, so even though messages are printing to the terminal, you
should still be able to enter commands as you normally would. Simply
hit Return and you'll be presented with a command prompt again.
Now that we've started the server, it's time to create a database.
Type the following:
createdb
This creates a database named after your login. If you want to give
your database a name, you can give that name as an argument to
createdb
, but you will probably find it easier to
interact with a default-named database.
Now we can start interacting with the database. To do so, type the
following:
psql
If you gave your database a non-default name, you'll have to specify
that name as an argument to psql
. From here, things
should look familiar from HW0; it is through this familiar interface
that you can query the database.
To stop the server at any time, type
pg_ctl stop
A screen-dump of this process is
included for your reference.
Your First Task: Implementing Prefix Key Compression
For this project, you will be editing the
_bt_prefixKeyCompress
function in the file
~/pgsource/src/backend/access/nbtree/nbtinsert.c
. You
should not need to modify any other functions for this
project.
We have modified the standard version of nbtinsert.c
to include a skeleton of your solution. We are making life simpler by
looking at a special case: single-column indexes of SQL type
text
(SQL's
variable-length string data type). The compression logic in
_bt_prefixKeyCompress
is invoked only in
this special case. For any other index key configuration, the code
you write should simply not get called -- and certainly the system
should never crash on other index configurations..
In general, when a B+-tree leaf node split takes place, half the
data entries on the original node are moved onto a new "righthand"
node -- this happens in a routine called _bt_split()
in
nbtinsert.c
which you may want to examine. The smallest
entry in the resulting righthand node (which would ordinarily be
"copied" up unchanged to the parent node during split, as discussed in
class and in the book) will have its key prefix-compressed before the
copy occurs, and PostgreSQL stashes a version of the resulting
compressed key in a special slot on the original page (the so-called
"high key" mentioned in the comments in _bt_split()
,
which is maintained for concurrency control reasons that
will not concern us in this homework). The compression is done by a
routine we call _bt_prefixKeyCompress()
. The arguments to
_bt_prefixKeyCompress
are:
-
Relation rel
, a data structure representing the
actual index file (which is of type Relation, actually, in
PostgreSQL).
BTItem lowItem
, the highest index key
remaining on the original leaf page after the split (i.e. the key just
below the one being compressed in alphabetical order)
BTItem highItem
, a fresh copy of the lowest index
key on the new rightmost node, which we can prefix-compress before
it gets copied up.
We have given you skeleton code in _btPrefixKeyCompress
that extracts
the actual text for the keys from the BTItem
data
structures for lowItem
and highItem
(which include both a key and a
pointer), to eventually generate two corresponding C char
*
pointers lowp
and highp
. Given
these, you need to do three things:
- Figure out how short you can truncate the string pointed to by
highp
by comparing it to lowp
. You want
the length to be just long enough to distinguish the two!
- Update the length field of the
highText
struct
to set it to the length you computed in the previous step. See comments in the code about
including 4 bytes for the vl_len
space!
- Set the
toReturn
variable to the absolute difference in length
between the high key pre- and post-compression.
As background, you may want to read through the code where
_bt_PrefixKeyCompress
is called, and generally poke around in
nbtinsert.c
. You can look for comments that say "CS186" for things we
added to nbtinsert.c
to support prefix key compression and debugging.
We have also provided code to output the contents of the B+-tree
as text:
_btdump(Relation r)
takes a B+-tree
Relation
struct, and outputs its pages in whatever
order they physically appear in the file. This is available for you
to call from the debugger (e.g. by typing print
_btdump(rel)
from a breakpoint in
_bt_prefixKeyCompress
). Beware that it generates a lot
of output!
_btdumppage(Relation r, Buffer buf)
is the inner
loop of _btdump
that prints a particular index page from
the buffer pool.
- We have put a call to
_btdumppage()
into the routine _bt_insertonpg()
in
nbtinsert.c
, so that you can see the effects of
your compression as internal index keys are generated and
inserted. (If you want to call _btdumppage
from the
debugger, you first have to pin your page into the buffer pool via
the ReadBuffer()
function; see the code for
_btdump
for details.)
How do I know it's working? Testing your Code
If your key compression is effective, you should be able to see a
performance benefit in terms of the number of I/Os needed to service a
query over the index. PostgreSQL can allow you to observe the number of
buffer pool page requests as follows:
- In your
~/pgdata
directory, find a file named
postgresql.conf
- Edit that file, and search for a line that looks like:
Change it to say:
- Restart your postgres configuration by saying
pg_ctl restart
.
When you next start psql
, after each query you should be
presented with executor statistics. The one you are looking for is
prefaced by the phrase Shared blocks
-- these are the
number of I/O requests through the buffer pool made by your query.
The larger the keys, the more I/O requests you'll see. For fun you
can also observe the elapsed time, however since we're dealing with
small databases (since the course accounts have limited disk quotas),
you may not notice much of an absolute performance difference at this scale.
If you should want to turn off the executor statistics reporting,
repeat the instructions above but change the value back to
off
in the configuration file.
Grading your code
We will test your code on its ability to compress keys. We will also
check to see that your index still works properly. You can verify these
properties yourself by creating a table and an index, and loading some
data into your database. The combination of debug messages from
_bt_prefixKeyCompress
and output of
_bt_dumppage
should help you validate your implementation.
We have provided sample data for you to experiment with. To load it,
proceed as follows:
- if you have not done so already, start up PostgreSQL via
pg_ctl start
, and run createdb
as
described above.
- to create and load an index, run the following:
(that's a space between the single-quotes in the last line.)
You will see a lot of debugging output! It will be saved for you in a
file indexscript
. You can examine the debugging output
there to see the index page dumps, and any key compression that is
happening.
Important note: There are 3 lines in the
_bt_prefixKeyCompress()
routine that must remain unchanged in order for the autograder to
evaluate your code. They are clearly marked in the code. Be sure not
to modify or delete them!
Browsing the PostgreSQL Code Base with cs186scope
As was discussed at the debugging help session, CScope is a tool
that allows you to browse a large collection of C code easily. Since
instructional disk space places strict limitations on how much code
you can have in your home directory at any time, your pgsource
only contains actual copies of the file you will be editing. Because
of this, you will need to use our version of CScope, called CS186Scope,
to browse the PostgreSQL code base. To use CS186Scope, type the following
command:
cs186scope
Please note that this version of cscope has indexed PostgreSQL
without your modifications. As such, once you add lines to
nbtinsert.c
, CS186Scope might not point to the
correct line in that file. All other files should be unaffected.
Debugging PostgreSQL
PostgreSQL runs in a client-server fashion, meaning that a user doesn't
typically run a postgres
process (the PostgreSQL
server process), they run a client process like the
psql
command-line-interface which talks to a postgres
process via
inter-process communication. This complicates a number of techniques
you're probably used to using:
- Generating output to the screen.
If you put
printf()
's in your code, they will not be
visible from the psql
client. PostgreSQL provides a special
function called elog()
to get messages from the
postgres
server process to appear at the client (even if it's across
a network, actually.) To insert debugging statements in your
code, you should use the elog()
function, with the first argument being
DEBUG1
. Example use of elog()
is given in the nbtinsert.c
file
you'll be editing. Note that elog()
takes a message string as its
main argument; to construct such a string you may want to use the
sprintf()
routine; type man sprintf
at a shell prompt for more
information, or look at the sample code in nbtinsert.c
.
- Debugging the postgres server
If you run
psql
, it will cause a postgres
server process to be started on
your behalf. To debug that process, you need to start gdb, and say
attach XXX
where XXX is the processID (number) of the
corresponding postgres
process. What is that number, you ask?
Assuming you're only running one psql process, it is pretty
easy to find:
- While
psql
is running, open another shell window on the same
server and type
ps -ef | grep <yourlogin> | grep post
Look for a line something like
- The 2nd entry of
that line (29545 in the example above) is the process ID of the
process you want to attach to;
call it XXX for the moment.
In your new shell window, run
gdb postgres
, and when you
get a gdb prompt, type attach XXX
.
You will pause the process you attach to, which will allow you to
set whatever breakpoints you like. Then you can type continue
to
gdb to let the postgres process keep working; it will behave as if
you started it from the debugger. If you want to leave the debugger
without killing the process, you can issue a detach
command to gdb
.
What If I Run Out of Disk Space?
- Delete any unnecessary files (i.e. those not prudent for HW1)
from your home directory. Space on the instructional machines is
tight, so every kilobyte counts!
- Try deleting your
pgdata
directory, executing your
commands, and then recreating the pgdata
directory with
initdb
. Note that all information in the database will
be deleted.
- If you have already installed PostgreSQL and run out of disk space
during execution of
initdb
, try running
gmake clean
from within your pgsource
directory. That should free up some space.
Submission
Once you have things working to your satisfaction, use the following
steps to submit your solution:
- One member of your team should create a directory at the
top level of their class account called
hw1
.
- Copy your completed
~/pgsource/src/backend/access/nbtree/nbtinsert.c
file to
that directory.
- Also create a text file called
README
that
contains two lines of text, each of the form
full name, class account
one line for each member of your team!
- If you are going for extra credit, add
a third line to your
README
file that says
extra credit
and put your extra credit implementation in
a file called nbtinsert_EXTRACREDIT.c
.
cd
into your hw1
directory and type
submit hw1
.
- To submit the extra credit, type
submit hw1-extracredit
Extra Credit: i18n
An important practical issue in building "real" software is to deal with
character sets for different languages. This is often referred to as
"internationalization", or "i18n" for short (the 18 refers to the 18
characters omitted from the short version --
i.e. "nternationalizatio".)
PostgreSQL supports i18n in its text
data type, but we have
set up the class accounts to disable that support. This is done via
an environment variable called
LC_ALL
, which we set in your accounts to use an i18n
locale called C
that follows the default rules of the C programming language. In
particular, it ensures that
the text sort ordering (called collation in i18n-speak) follows the rules of the C
function strcmp()
. This means that the keys in your
B+-tree will be ordered alphabetically according to the rules of
strcmp()
, which you can read about in the corresponding
manual page.
However, on our server machines, the default locale is a US
English configuration called en_US.UTF-8
, which is an
example of a unicode
locale. In these
locales, the collation rules are different than in C. You can
read about unicode
collation on the web -- look especially at the discussion of Multi-level
Comparison.
For extra credit, implement prefix compression code that works
correctly regardless of whether the system is configured for
the C
locale or the en_US.UTF-8
locale. To
try your solution under the en_US.UTF-8
locale,
you need to delete your database and re-initialize with the --locale argument:
pg_ctl stop
/bin/rm ~/pgdata
(beware of typos!)
initdb --locale=en_US.UTF-8
You may want to look in the file
~cs186/fa06/postgresql-8.1.4/src/backend/utils/adt/varlena.c
which contains
all the type-specific logic for the SQL text
data type
(also known in the source as a variable-length array, or
varlena
.)
If you are going for extra credit, add a line to that effect to your README file as
described above.
Update You should be submitting your extra credit implementation in a separate
file. See Submission for details.
Frequently Asked Questions
Frequently asked questions will be posted here as they arise. Also
be sure to check the class newsgroup and blog regularly.
Q: I'm getting the following error when running initdb
:
ilinux1 [11] ~/pgsource > initdb
The files belonging to this database system will be owned by user "cs186-dh".
This user must also own the server process.
The database cluster will be initialized with locale C.
initdb: directory "/home/cc/cs186/fa06/class/cs186-dh/pgdata" exists but is not empty
If you want to create a new database system, either remove or empty
the directory "/home/cc/cs186/fa06/class/cs186-dh/pgdata" or run initdb
with an argument other than "/home/cc/cs186/fa06/class/cs186-dh/pgdata".
What's wrong?
A: You need to delete your pgdata
directory. To
do so, type rm -rf ~/pgdata
. If it prompts you for file
deletions, you can stop this by typing unalias rm
, then
re-executing the aforementioned command.
Q: I can't remove pgdata because it says some files are
in use. Also, I can't pg_ctl stop
because it says it can't
find the server. What should I do?
A: Kill the postmaster. See below.
Q: How do I kill the postmaster?
A:First, you have to locate the postmaster
process by executing ps -ef | grep $USER
. The output of that command
should look something like this:
ilinux1 [4] ~ > ps -ef | grep $USER
root 22918 2317 0 17:57 ? 00:00:00 sshd: cs186-em [priv]
cs186-em 22920 22918 0 17:57 ? 00:00:00 sshd: cs186-em@pts/3
cs186-em 22921 22920 0 17:57 pts/3 00:00:00 -tcsh
cs186-em 22982 1 1 17:58 pts/3 00:00:00 /home/cc/cs186/fa06/class/cs186-em/pgsql/bin/postmaster
cs186-em 22984 22982 0 17:58 pts/3 00:00:00 postgres: writer process
cs186-em 22985 22982 0 17:58 pts/3 00:00:00 postgres: stats buffer process
cs186-em 22986 22985 0 17:58 pts/3 00:00:00 postgres: stats collector process
cs186-em 23018 22921 0 17:58 pts/3 00:00:00 ps -ef
cs186-em 23019 22921 0 17:58 pts/3 00:00:00 grep cs186-em
The postmaster process is highlighted above. Its process ID
(indicated by the first column of numbers in the output) is the argument
you'll give to the kill
command. Run it like this:
kill
where is replaced by the process ID to
kill (in this case, 22982)