Project 1 | CS 61B Fall 2017

Project 1: DB61B

Due: Friday, 13 October 2017

Introduction

This project involves writing a miniature relational database management system (DBMS) that stores tables of data, where a table consists of some number of labeled columns of information. Our system will include a very simple query language for extracting information from these tables. For the purposes of this project, we will deal only with very small databases, and therefore will not consider speed and efficiency at all. For that sort of stuff, you might consider taking CS186 at some point.

As an example, consider the following set of three tables containing information about students, about course numbers and locations, and about grades in these courses. Each table has a name (given above the upper-left corner) and each column of each table has a name (given above the double line).

students

SID	Lastname	Firstname	SemEnter	YearEnter	Major
101	Knowles	Jason	F	2003	EECS
102	Chan	Valerie	S	2003	Math
103	Xavier	Jonathan	S	2004	LSUnd
104	Armstrong	Thomas	F	2003	EECS
105	Brown	Shana	S	2004	EECS
106	Chan	Yangfan	F	2003	LSUnd

schedule

CCN	Num	Dept	Time	Room	Sem	Year
21228	61A	EECS	2-3MWF	1 Pimentel	F	2003
21231	61A	EECS	1-2MWF	1 Pimentel	S	2004
21229	61B	EECS	11-12MWF	155 Dwinelle	F	2003
21232	61B	EECS	1-2MWF	2050 VLSB	S	2004
21103	54	Math	1-2MWF	2050 VLSB	F	2003
21105	54	Math	1-2MWF	1 Pimentel	S	2004
21001	1A	English	9-10MWF	2301 Tolman	F	2003
21005	1A	English	230-5TuTh	130 Wheeler	S	2004

enrolled

SID	CCN	Grade
101	21228	B
101	21105	B+
101	21232	A-
101	21001	B
102	21231	A
102	21105	A-
102	21229	A
102	21001	B+
103	21105	B+
103	21005	B+
104	21228	A-
104	21229	B+
104	21105	A-
104	21005	A-
105	21228	A
105	21001	B+
106	21103	A
106	21001	B
106	21231	A

Describing Command Syntax

You communicate with the database system using an artificial notation usually known as a language, although it is much simpler than any human language. The definition and processing of such languages is an important skill for any computer scientists. We normally think of programming languages such as Java, but there are many other contexts where small, domain-specific languages (DSLs) are appropriate engineering solutions to a design or implementation problem.

We typically describe the syntax (grammatical structure) of a language using a specialized metalanguage---a language for describing languages. Here, we'll use a version of a common metalanguage: BNF (Backus-Naur Form). A BNF grammar consists of a set of rules, For example:


<create statement> ::= "create" "table" <name> <table definition>
<table definition> ::= "(" <name>_,⁺ ")"
    | "as" <select clause>

means that a create statement consists of the (literal) words "create" and "table" followed by a name, followed by a table definition. A table definition, in turn, consists of either

a left parenthesis, followed by a list of one or more names separated by commas, followed by a right parenthesis; or
the word "as" followed by a select clause (which is defined in another rule.)

The labels bracketed by <> are metavariables and stand for sets of possible strings, as described by the rules.

We have extended BNF with four concise definitions:

<statement>* means "zero or more statements".
<statement>⁺ means "one or more statements".
<statement>_,⁺ means "one or more statements separated by commas.
Finally, unquoted parentheses group, so that for example, ("and" <condition>)* means zero or more occurrences of the word "and", each followed by a condition.

We define a few of the metavariables in English (sort of as base cases):

<name> denotes a sequence of letters, digits, and underscores that does not start with a digit.
<literal> denotes a sequence of zero or more characters other than ends-of-lines, commas, or single quotes (apostrophes), surrounded by single quotes. For example,
```
 'Mary Smith'
```
Remove any whitespace at the beginning or end of a literal. Whitespace in the interior of a literal should remain.
<empty> The empty string: stands for a missing clause.

As is traditional, we ignore whitespace (spaces, tabs, newlines) in the descriptions that follow. Whitespace or other punctuation must separate words from each other. For example, createtable is just a name, and not the two words create and table. Comments, which are the same as /* */ comments in Java, are treated as whitespace.

Commands

Our database system uses a very restricted dialect of SQL (Structured Query Language), a widely used notation for communicating with relational databases. When you run the database system, it will accept a sequence of commands from the standard input (i.e., normally the terminal), according to the following syntax:

<program> ::= <statement>*
<statement> ::=
     <create> <statement>
    | <exit> <statement>
    | <insert> <statement>
    | <load> <statement>
    | <print> <statement>
    | <select> <statement>
    | <store> <statement>

<create statement> :: "create" "table" <name> <table definition> ";"
<table definition> ::=
      "(" <column name>_,⁺ ")"
    | "as" <select clause>
<print statement> ::= "print" <table name> ";"

<insert statement> ::=
      "insert" "into" <table name> "values" ( "(" <literal>_,⁺ ")" )_,⁺ ";"
<load statement> ::= "load" <name> ";"
<store statement> ::= "store" <table name> ";"
<exit statement> ::= "quit" ";" | "exit" ";"

<select statement> ::= <select clause> ";"
<select clause> ::= "select" <column name>⁺_, "from" <tables> <condition clause>
<condition clause> ::=
      "where" <condition> ("and" <condition>)*
    | <empty>
<tables> ::= <table name> | <table name> "," <table name>
<condition> ::=
      <column name> <relation> <column name>
    | <column name> <relation> <literal>

<relation> ::= "<" | ">" | "=" | "!=" | "<=" | ">="
<table name> ::= <name>
<column name> ::= <name>

The above only defines the syntax, but doesn't say what these statements mean or do (known as the semantics). For that, we just use English:

"create" "table" <table name> "(" <column name>_,⁺ ")" ";" Creates an empty table with the given name (replacing it if it is already loaded). The names of its columns are given by the column names in order. There must not be any duplicate column names.
"create" "table" <table name> "as" <select clause> ";" Creates a table with the given name (replacing it if it is already loaded) whose columns and contents are those produced by the select clause. Each distinct set of column values gets only one row in the table (no duplicate rows).
"load" <name> ";" Load data from the file name.db to create a table named name.
"store" <table name> ";" Store the data from the table table name into the file table name.db.
"insert" "into" <table name> "values" ( "(" <literal>_,⁺ ")" )_,⁺ ";" Adds new rows to the given table whose values are given by the parenthesized lists of literals. In each parenthesized list, there must be exactly one literal for each column of the table, and the table must already exist. Rows are not added if they duplicate a row already in the table.
"print" <table name> ";" Print all rows of the table with the given name (which must be loaded). The rows are printed one per line and indented. Separate columns with blanks, and print the columns in the order they were specified when the table was created. You do not need to print the column names. See the example below for the format. Print rows in lexicographic order: that is, print them in order by their first column (comparing the contents as strings). If two rows have the same first column, print them in order by their second columns, and so forth. (Warning: in standard SQL, there is no such ordering by default).
<select clause> ";" Select clauses are described below. They represent tables created from other tables. When used alone as a statement (terminated by a semicolon), they indicate that the resulting table is to be printed, using the format described above for the print command.
"quit" ";" Exit the program.
"exit" ";" Synonym for quit.

Select clauses. Select clauses are used in select statements and in create statements. They denote tables whose information is selected from other tables.

"select" <column name>_,⁺ "from" <table name> <condition clause> A new (unnamed) table consisting of the named columns from the given table from all rows that satisfy the condition clause.
"select" <column name>_,⁺ "from" <table name> "," <table name> <condition clause> A new (unnamed) table consisting of the given columns from all rows of the natural inner join of the two tables that satisfy the condition clause. A natural inner join is a table whose rows are formed by combining pairs of rows, one from the first table and one from the second, such that any columns that have the same name in both tables also have the same value in both rows. Any pairs of rows that don't have the same values in common columns are not included in the joined table. If there are no common columns between the two tables, all row combinations are included in the natural inner join.
See the example below for a better understanding of how a natural inner join combines rows between the tables "student" and "enrolled" from above, which share the column "SID." Try drawing out both tables and your join on paper, then seeing if the result of the query in the example makes sense.

Condition clauses. An empty condition clause does not restrict the rows that appear in a select. Otherwise, it contains one or more conditions separated by and, all of which must be satisfied. The tests compare column values against either other columns or literal values. The relation symbols mean what they do in Java, except that all values are treated as strings (use the compareTo method on Strings to compare them). Thus you can write things like

  Lastname >= 'Chan'

to get all rows in which the value of the Lastname column comes after "Chan" in lexicographic order (i.e, by their first character, then their second, etc., ordering characters according to Java's collating sequence). The result is roughly like alphabetical order, with the major difference being that all digits come before all capital letters, which come before all lower-case letters, with punctuation and other characters inserted in various places.

Format of .db Files

A .db file starts with a line containing all column names (at least one) separated by commas, with any leading or trailing whitespace removed (see the .split and .trim methods of the String class). Column names must be valid identifiers and must be distinct. This is followed by any number of lines (zero of more), one for each row, containing the values for that row, separated by commas (again, leading or trailing whitespace is removed). For example, the students table shown previously would look like this in a file:

  SID,Lastname,Firstname,SemEnter,YearEnter,Major
  101,Knowles,Jason,F,2003,EECS
  102,Chan,Valerie,S,2003,Math
  103,Xavier,Jonathan,S,2004,LSUnd
  104,Armstrong,Thomas,F,2003,EECS
  105,Brown,Shana,S,2004,EECS
  106,Chan,Yangfan,F,2003,LSUnd

Example

If the information in these tables exists in three files—students.db, schedule.db, and enrolled.db—then a session with our DBMS might look like the following transcript. Lines that start with prompts (> or ...) are input from the user; the rest are output from the program

  DB61B System.  Version 2.0
  > load students ;
  Loaded students.db
  > load enrolled ;
  Loaded enrolled.db
  > load schedule ;
  Loaded schedule.db
  > /* What are the names and SIDS of all students whose last name
  ...  is 'Chan'? */
  > select SID, Firstname from students
  ...  where Lastname = 'Chan';
  Search results:
    102 Valerie
    106 Yangfan
  > /* Who took the course with CCN 21001, and what were their grades? */
  > select Firstname, Lastname, Grade
  ...      from students, enrolled where CCN = '21001';
  Search results:
    Jason Knowles B
    Shana Brown B+
    Valerie Chan B+
    Yangfan Chan B
  > /* Who has taken the course named 61A from EECS? */
  > /* First, create a table that contains SIDs */
  > create table enrolled2 as select SID
  ...  from enrolled, schedule
  ...  where Dept = 'EECS' and Num = '61A';
  > /* Print these SIDs */
  > print enrolled2;
  Contents of enrolled2:
    101
    102
    104
    105
    106
  > /* Now print the names of the students in this list */
  > select Firstname, Lastname from students, enrolled2;
  Search results:
    Jason Knowles
    Shana Brown
    Thomas Armstrong
    Valerie Chan
    Yangfan Chan
  > quit ;

Your Task

First, make sure that everything in your repository is properly updated and checked in. Before you start, the command

$ cd  ~/repo
$ git status

should report that the directory is clean and that there are no untracked files that should be added and committed. Never start a new project without doing this.

To obtain the skeleton files (and set up an initial entry for your project in the repository), you can use the command sequence

git fetch shared
git merge shared/proj1 -m "Get proj1 skeleton"
git push

from your Git working directory. Should we update the skeleton, you can use the same sequence (with an appropriate change in the -m parameter) to update your project with the same changes.

You are free to modify anything in the skeleton, as long as the final program conforms to this specification and the requirements below. You needn't ask permission to do so. Filling in parts we've marked with "FIXME" or similar comments is not part of the assignment; rather, it's a suggestion. Do be sure, however, that your documentation comments reflect any changes or additions you make.

Be sure to include tests of your program (yes, that is part of the grade). The makefile we provide has a convenient target for running such tests. Our skeleton directory contains a couple of trivial tests, but these do not constitute an adequate set of tests! Make up your tests ahead of time and update your makefile to run them.

The input to your program will come from fallible humans. Therefore, part of the problem is dealing gracefully with errors. When the user makes a syntax error, or mentions a non-existent column, your program should not simply halt and catch fire, but should give some sort of message and then try to get back to a usable state. For syntactic errors (errors in the format of commands) you should skip to the next semicolon (if it hasn't occurred yet) or the end-of-file, whichever comes first. For all errors, you should print a single, separate line starting with the word "error" (in upper or lower case) and followed by any message, and the erroneous command should have no effect on any table. Your program should not exit with a cryptic stack trace.

Our testing of your projects (but not our grading!) will be automated. The testing program will be finicky, so be sure that:

Your makefile is set up to compile everything on the command make and to run all your tests on the command make check. The makefile provided in our skeleton files is already set up to do this. Be sure to keep it up to date if you add additional .java files.
Your main function is in a public class called db61b.Main. The skeleton is already set up this way.
The first line of output of your program identifies the program. It may contain anything.
Before reading the first command and on reading each subsequent end-of-line or comment, your program must print the prompt '>' (greater than followed by a blank), or (to indicate a continuation), `\verb*|...|'. The testing script will ignore these. The skeleton is set up to do this already.
Output things in exactly the format shown in the example, with no extra adornment.
Put your error messages on separate lines, starting with the word error (upper or lower case). The grading program will ignore the text of the message.
Your program should exit without further output when it encounters a quit (or exit) command or an end-of-file in the input.
Your final version must not print any debugging information.
The grading program will ignore extra blank lines and extra blanks at the ends of lines, and will treat any run of consecutive blanks as if it were a single blank.

Submission and Version Control

It is important that you commit work to your repository at frequent intervals. Version control is a powerful tool for saving yourself when you mess something up or your dog eats your project, but you must use it regularly if it is to be of any use. Feel free to commit every 15 minutes; Git only saves what has changed, even though it acts as if it takes a snapshot of your entire project.

The command git status will tell you what you have modified, removed, or possibly added since the last commit. It will also tell you how much you have not yet sent to your central repository. You needn't just assume that things are as you expect; git status will tell you whether you've committed and pushed everything.

If you are switching between using a clone of your central repository on the instructional computers and another at home (or on your laptop), be careful to synchronize your work. When you are done working on one system, be sure push your work to the central repository:

$ git status       # To see what needs to be added or committed.
$ git commit -a    # If needed to get everything committed.
$ git push

When you start working on the other system, you then do

$ git status          # To make sure you didn't accidentally leave
                      # stuff uncommitted or untracked.
$ git pull --rebase   # Get changes from your central repo.

Submit your project by committing and tagging it:

$ git tag proj1-0     # Or proj1-1, etc.
$ git push
$ git push --tags

Be sure to respond to all prompts and to make sure the messages you get indicate that the submission was successful. Don't just "say the magic words" and assume that everything's OK.

Advice

You will find this project challenging enough without helping to make it harder. Much of what might look complicated to you is actually pretty easy, once you take the time to look to see what has already been written for you—specifically what is in the skeleton and the standard Java library of classes. So, before doing any hard work on anything that looks tricky, browse the skeleton and the documentation.

If you're not using the skeleton, you'll have a couple of extra things worth mentioning to figure out at some point during the project, namely reading input files and choosing a way to store the rows of a table. For reading from files (the load command), you'll probably want to use java.io.FileReader together with java.util.Scanner, which is also good for reading from the standard input. In fact, we've tried to design the syntax to make it particularly easy to parse using Scanners. If you find yourself writing lots of complicated junk just to read in and interpret the input, you might back off and ask one of us to look over your approach. The standard Java string type, java.lang.String, also contains useful member functions for doing comparisons (including comparisons that ignore case). For choosing a way to store rows in a table, you can use arrays or devise a kind of linked list to do so, but you might instead want to take a look at such interfaces as java.util.List, java.util.Set, and java.util.Map and at the library classes that implement them (such as java.util.ArrayList, java.util.HashSet, and java.util.HashMap. Read the on-line documentation for all of these.

It's important to have something working as soon as possible. You'll prevent really serious trouble by doing so. I suggest the following order to getting things working:

Get the printing of prompts, handling of comments, and the `quit' and exit' commands, and the end of input to work. If you're using our skeleton, this step has already been completed for you.
Figure out how to represent the contents of a table, either by understanding the representation we propose in the skeleton or creating your own. Fill in the get method in Table.java.
Implement the parts of the Table class needed to create a new Table, add a row to it, and print an entire Table.
Implement the Database class.
Implement create table (the non-select version), insert, and print.
Implement load and store.
Implement the kind of select that takes a single table and has no conditions.
Now get single-table select with conditions to work.
Finally, work on the two-table variety of select.
In the above steps, as you decide on implementation strategies and the data representations you will use, write them down in your internal documentation. When you introduce new methods or classes, write the comments first.
Throughout the project, write test cases (in fact, do this every chance you get). Among other things, writing test cases gets you to understand the specification better. Whenever you find an input that breaks your program, make sure you capture it in a test case.

You may throw all our skeleton code away if you prefer. You are not required to implement Table, Column, etc. However, we strongly recommend that you don't do this if your only reason is that you can't figure out how to do it our way. In that case, pester us with questions until you do understand.

Navigation