University of California, Berkeley

CS 186 Spring 2007

Homework Project 2 - Implementing B-Trees

Due Tuesday, March 6, 2007 10:00 pm


 [Homeworks]       [Syllabus and Lecture Notes]      [Resources

Introduction

An unclustered B+-tree index consists of a collection of records of the form (key, rid), where key is a value for the search key of the index, and rid is the record id of the record being indexed. (It is an index based on what is described as Alternative 2 in the textbook.)

The nodes in a B+-tree can be divided into two categories: internal index nodes and leaf nodes. The leaf nodes store the actual (key, rid) pairs, while the internal nodes store key values and the page identifiers (pageids) of their children.

You should read the textbook carefully and get familiarized with the B+ Tree indexing structure and algorithm.

The files you need can be found here.

Background:

The space manager

The homework 0 assignment discussed heap files, and homework 1 dealt with the buffer manager.  At this point you should be familiar with the Minibase class' static methods for creating/opening databases, and for setting up a buffer manager.  As you may recall, the diskmgr.DB class is responsible for maintaining files, allocation/deallocation of pages and performing I/O. The dismgr.Page class just holds an array of bytes of size PAGE_SIZE, and the BufMgr is responsible for maintaining "frequently" used pages in memory. In this assignment we are going to build a B+ Tree on top of these abstractions.

As described in the "extra credit" section of homework 0, a HeapFile contains a bunch of data records in no particular order.  In class we discussed how often it is desirable to have an index that lets us find records with a particular value without having to scan the entire HeapFile.  In this assignment you will be implementing a BTree index that supplements a HeapFile, holding key values and record IDs to allow quick access to records with particular values.  This B-Tree will be implemented using some of the same classes as HeapFiles, since it must also be able to be stored in pages on disk, moved into memory as needed.

Class Summary

For this assignment, you'll only need to implement methods in 3 classes: BTreeFile, BTFileScan, and BTTest.  You'll be working with other classes, including the following:
The only files you will need to change are BTFileScan, BTreeFile, and BTTest.  Every other class should work correctly as provided.

Buffer Manager Notes

You should be very familiar with the Buffer Manager right now. In this assignment you will have a chance to use what you have implemented in the previous assignment.

You will use many Pin and Unpin operations in this assignment. In general, use pin() when you have a PageId, and you want to read in the page from disk into memory. After you use the page, you must remember to unpin() it with the correct dirty flag.  Please Note: the BTHeaderPage and BTSortedPage constructors pin pages when you pass in a PageId.  Thus, if you have the PageId of a header, leaf or index page, call the appropriate constructor to pin the page, and unpin the page when you're done with it.  Take a look at the BTreeFile.printTree() method for an example of traversing the tree, turning PageIds into Index and Leaf pages, and unpinning the pages when done with them.

The buffer manager will also give you a new page (using newPage()) when ever you need one (e.g. when you create a B+ Tree). Remember to free the page when you do not need it anymore (e.g. where you want to delete the B+ Tree, or if the page is deleted from the B+ Tree), by calling freePage().

Implementation of B+ tree

A B+ Tree index is stored in MINIBASE as a database file.

The class IndexFile is an abstract class for indices. BTreeFile is a subclass of IndexFile, and implements the B+-tree in MINIBASE. The class BTreeFile maintains two types of nodes: BTLeafPage and BTIndexPage, which correspond to the leaf nodes and index (internal) nodes in a B+-tree.

The BTreeFile contains provides methods for inserting, deleting, and searching the tree. It does not contain the tree itself. It maintains the tree by keeping a pointer to the root of the tree which is a BTIndexPage. The BTreeFile class is responsible for maintaining the integrity of the tree. (i.e. when an insertion is made, it has to ensure that the new record is inserted in the correct place, or if a new entry is added to an index node, it has to ensure that there is sufficient room in the index node to hold the new entry, and splitting or merging nodes when they become too full or too empty).

The leaf and index nodes in a B+ Tree are implemented by the classes BTLeafPage and  BTIndexPage, respectively. They are both subclasses of BTSortedPage and are responsible for inserting into and deleting from leaf node and index node, respectively. These two classes are not responsible for the integrity of the tree.

For this assignment, the BTLeafPage and BTIndexPage are provided for you. We recommend that you study the source code for these two classes carefully. Their implementation can be found in homework zip file. You should not change these classes.

BTLeafPage

Each indexing entry that we insert or delete from a BTreeFile is a tuple (key, record id). These tuples are stored as records of type KeyEntry in BTLeafPage. To facilitate a scan through the leaf nodes, the BTLeafPages should be linked together as a doubly linked list.  These "pointers" in the link list are actually PageID of the previous page and next page. The two members corresponding to these pointers are called nextPage and prevPage. They can be set and retreived with  the HFPage methods setNextPage(), setPrevPage(), getNextPage() and getPrevPage().

Other functions provided for BTLeafPage are insertRecord and deleteRecord, which (as the name suggest) insert and delete a LeafEntry to/from the page. GetNext(), GetFirst() and GetCurrent() are also provided to scan through the records in a BTLeafPage. A typical use of these functions is shown below.

    RID rid;
KeyEntry s = page.getFirst(rid);
while (s != null)
{
// do something with s.key and s.data
s = page.getNext(rid);
}

BTIndexPage

The BTIndexPage contains a sequence <pid1, key1, pid2, key2, ..., keyk, pidk+1> (the semantics of this sequence are those that are described in the textbook). The prevPage member in BTIndexPage (also referred to as the leftlink or the leftmost child page) is used to store pid1 while (keyi, pidi+1) are stored as records of type KeyEntry in the BTIndexPage.

Just like in BTLeafPage, insert(), delete(), getFirst() and getNext() are provided for this class.

Scanning of a B+ Tree

We can also open a scan on a B+ Tree, and retrieve all records within a range of key values. The BTFilescan class is a data structure to keep track of the current cursor in the scan. It provides a getNext() method for retrieving the next record in the BTreeFile We can also delete the current record at the cursor while we are scanning, using the deleteCurrent() method.

Assignment requirements

In this assignment, you are required to implement two classes: BTreeFile and BTFileScan.  You will probably also want to extend BTTest to more thoroughly ensure that your BTree code is working correctly. 

To simplify the assignment, we assume that a BTreeFile only handles integer keys, and overflow or underflow in the B+ tree is handled using the split and merge algorithm. (You are not required to implement redistribution). However, keys may not be unique. But you can assume that all entries with the same key fit into one page.

The details for each of the functions is provided below. We also provide some sample code to illustrate the use of the classes we provide. However these samples are extremely simplified and do not reflect the actual coding that we expect from you. For example, we do not show any error checking in these samples. You are expected to write robust programs by checking the return code and signal errors when necessary.

BTreeFile constructor

The constructor for the BTreeFile takes in a filename, and checks if a file with that name already exists in the database. If the file exists, we "open" the file. Otherwise we create a new file with that name.

You can use DB.GetFileEntry() to check if the file exists, it returns either the PageId of the first page in the file, or null if the file doesn't exist.  In the case of a BTreeFile, the page that is returned will be the PageId of the header page, which contains information about the index and a pointer to the BTree root page.

Once we get the PageId of the header page, we call the BTHeaderPage constructor with the header PageId which pins the header page, this will caused the page to be read from disk into memory. The following example shows how the constructor might read an existing file and create the header page.

    // filename contains the name of the BTreeFile to be opened
PageId page = Minibase.JavabaseDB.get_file_entry(filename);
BTHeaderPage header = new BTHeaderPage(page);

If the B+ tree index doesn't exist yet, the no-argument constructor to BTHeaderPage() allocates, pins, and initializes a new header page.

BTreeFile.close()

The destructor of BTreeFile just "closes" the index. This includes unpinning any pages that are pinned. Note that it does not delete the file.

BTreeFile.destroyFile()

This method deletes the entire index file from the database. You need to free all pages allocated for this index file using DB.deallocatePage(). The file entry in the database also needs to be removed using DB.deleteFileEntry().

BTreeFile.insert()

This method inserts a pair (key, rid) into the B+ Tree Index. The actual pair (key, rid) is inserted into a leaf node. But this insertion may cause one or more (key, pid) pair to be inserted into index nodes. You should always check to see if the current node have enough space before you insert. If you don't have enough space, you have to split the current node by creating a new node, and copy some of the data over from the current node to the new node. When implementing duplicate handling, you also have to make sure that entries with the same key remain on the same node. Therefore, in this special case of handling duplicates, we might end up with a node that is less than half full after the insertion. Therefore, in order to keep all the duplicate keys in the same page, the criterion that requires a non-leaf page to be at least half full might be relaxed if splitting this page might force some of the multiple key copies with the same value end up on different pages. Have that in mind when you code your insert and delete operations.  In any case, splitting will cause a new entry to be added in the parent node.

Splitting of the root node should be considered separately, since if we have a new root, we may need to update the file entry to reflect the changes. Splitting of a leaf node should also be considered separately since the leaf nodes are linked as a link list.

Due to the complexity of this function, we recommend that you write separate private functions for different cases. For example, it is a good idea to write a function to insert into a leaf node, and a function to insert into an index node. The following shows some simplified code fragment that may be helpful.

Checking if there is enough space to insert a record into a leaf node :

    if (currentLeafPage.available_space() >= keyEntry.getSizeInBytes())
currentLeafPage.insert(keyEntry.key, (RID)keyEntry.getData());

Note: When splitting nodes on insert, you need to select the middle key from the page.  If the page has an even number of keys, round up when choosing the middle.  When splitting a node with duplicate keys, it is possible that the middle key and the key to the left are identical.  It is not permitted to split there, since that would put keys with the same value on different pages.  You must check to ensure that the split is done between keys of different values.  If the middle key has a duplicate to the left, ;you must search forward and backward to find the closest permissible split to the middle.

BTreeFile.delete()

This method deletes an entry (key, rid) from a leaf node. Deletion from a leaf node may cause one or more entries in the index node to be deleted. You should always check if a node becomes underflowed (less than 50% full) after deletion. If a node becomes underflowed, merging may occur. We can merge if the current node and one of its siblings (either left or right) are both undeflowed. After we merge, one entry in the parent node will need to be deleted. Note that you can only merge with a sibling, where a sibling is a node that has the same parent. Note also, that you might not be able to merge, because both siblings are not less than 50% full. You do not have to implement redistribution.

You should consider different scenarios separately (maybe write separate private functions for them). You should consider deletion from a leaf node and index node separately. Deletion from the root should also be consider separately (what happens if the root becomes empty after some deletion, but there are still some child nodes?)

The following code fragment may be helpful. Checking if a node is less than half full :

    if ((page.available_space()) <= ((PAGE_SIZE - HFPage.DPFIXED) / 2))
{
// check if merge can occur
// merge
}

Note that the total size of a page is PAGE_SIZE, but given the overhead of page structures, the amount of space available for records is PAGE_SIZE - HFPage.DPFIXED.

BTreeFile.new_scan

This method should create new a BTreeFileScan object and initialize it based on the search range specified. It is useful to find out which leaf node the first record to scan is in.

BTreeFileScan.destroyBTreeFileScan

First, note that BTreeFileScan does not have a public constructor since it is created inside BTreeFile.new_scan. destroyBTreeFileScan should clean up the necessary things (such as unpinning any pinned pages).

BTreeFileScan.get_next

GetNext returns the next record in the B+ Tree in increasing order of key. Basically, GetNext traverses through the link list of leaf nodes, and returns all the records in them that are within the scan range, one at a time.

For example, if the B+ Tree contains records with keys 1,2,4,5,6,7,8, the following code should print out "2 4 5 ".

    Key lowKey = new Key(2);
Key highKey = new Key(5);
KeyEntry next = null;
BTreeFileScan scan = btree.new_scan(lowKey, highKey);
while ((next = scan.get_next()) != null)
System.out.println(next.key);

BTreeFileScan.delete_current

DeleteCurrent should delete the current record, i.e, the record returned by previous call to GetNext(). For example, if the B+ Tree contains records with keys 1,2,4,5,6,7,8, the following code should print out "2 4", and the remaining B+ Tree should contain records with keys 1,5,6,7,8.

    Key lowKey = new Key(2);
Key highKey = new Key(5);
KeyEntry next = null;
BTreeFileScan scan = btree.new_scan(lowKey, highKey);
while ((next = scan.get_next()) != null)
if ((key == 2) || (key == 4))
scan.delete_current();

Testing your code

The BTTest.java class has one test program, that will read instructions from a file called "test.txt" to insert or delete keys from the tree, or print the contents of the tree.  Several test files have been provided with different numbers of keys.  With the default page size, about 60 keys will fit on a page, so testing a 2-level tree requires at least the 100 key test, and a 3-level tree requires the 10000 key test.  Note that the code for printing a tree does not use BTFileScan, so you will need to write code to test that.

We will be testing your BTreeFile and BTFileScan code through the public methods of these classes.

Submission procedure

Using the normal process, submit BTreeFile.java and BTFileScan.java.

Advice

Marking criteria

Separate credit will be allocated for:

Miscellaneous

Please let us know if there is any ambiguity in the assignment and please report any possible bugs as soon as possible by mailing the instructor (cs186@cs.berkeley.edu).