Project 3: Gitlet, your own version-control system

Due Wednesday, 9 December 2015

A. Getting and Sharing Code

In a departure from our usual practice, we will allow partnerships for this project. Both partners will get the same grade. We'll will assign partners to those who want them and do not choose one.

Those of you who want a partner can register by means of the "Selecting a Partner" section of the Account Administration page. We will periodically update the repository settings to allow you to share each other's repository (read-only).

Those of you working alone can use the usual procedure:

$ git fetch shared
$ git checkout master
$ git merge -m "Start project 3" shared/proj3
$ git push

or, to put it in a new branch

$ git fetch shared
$ git checkout -b Proj3 shared/proj3
$ git push -u origin Proj3

If we update the skeleton, you can then merge in our updates with

$ git fetch shared
$ git merge -m "Update to current skeleton" shared/proj3

Those of you with a partner will want to proceed a bit differently. Let's say cs61b-xya and cs61b-xyb are partners. First, arrange to get access to each other's repositories. cs61b-xya should issue the commands

$ git remote add partner
$ git config remote.partner.tagOpt --no-tags

And cs61b-xyb should issue

$ git remote add partner
$ git config remote.partner.tagOpt --no-tags

Do this in every repository you'll use (home and on instructional machines). The git config command prevents you from pulling in all your partner's tags accidentally when you fetch (which would create considerable confusion). To fetch anything your partner has pushed to the central (cs61b-ta) repository, you can now use

$ git fetch partner

Now set up a proj3 branch as usual like this:

$ git fetch shared
$ git checkout -b Proj3 shared/proj3
$ git push -u origin Proj3

So now you each have a Proj3 branch in your central repository and locally.

At any time, you can merge what your partner has been doing into your Proj3 branch like this:

$ git commit ...         # If needed
$ git checkout Proj3     # If needed
$ git fetch partner Proj3
$ git merge -m "merging changes from cs61b-xya" partner/Proj3

You may have to resolve conflicts and commit before proceeding.

Naturally, you'll only see changes your partner has pushed. Thus, you had best not be as careless as we fear some of you have about committing and pushing regularly! You won't be able to mess up your partner's repository, since you will not have write access to it.

When the time comes to submit, only one of you should do so. That person should make very sure to have all his partner's changes merged in first.

B. Overview of Gitlet

In this project you'll be implementing a version-control system. This version-control system mimics some of the basic features of the popular version-control system git, but it is smaller and simpler, so we have named it gitlet.

A version-control system is essentially a backup system for files on your computer. The main functionality that gitlet supports is:

  1. Saving backups of directories of files. In gitlet, this is called committing, and the backups themselves are called commits.
  2. Restoring a backup version of one or more files or entire commits. In gitlet, this is called checking out those files or that commit.
  3. Viewing the history of your backups. In gitlet, you view this history in something called the log.
  4. Maintaining related sequences of commits, called branches.
  5. Merging changes made in one branch into another.

The point of a version-control system is to help you when coding complicated projects, or when collaborating with others on a project. You save versions of the project periodically. If at some later point in time you accidentally mess up your code, then you can restore your source to a previously committed version (without losing any of the changes you made since then).

In gitlet, you don't just commit individual files at a time. Instead, you can commit an arbitrary set files at the same time. We like to think of each commit as a snapshot of your entire project at one point in time. However, for simplicity, many of the examples in the remainder of this document involve just committing one file at a time. Just keep in mind you could add in multiple files to each commit.

In this project, it will be helpful for us to visualize the commits we make over time. Suppose we have a file wug.txt, we add some text to it, and commit it. Then we modify the file and commit these changes. Then we modify the file again, and commit the changes again. Now we have saved three total backup versions of this file, each one further in time than the previous. We can visualize these commits like so:

Three commits

Here we've drawn an arrow indicating that each commit contains some kind of reference to the commit that came before it. We call the commit that came before it the parent commit — this will be important later. But for now, does this drawing look familiar? That's right; it's a linked list!

The big idea behind gitlet is that we can visualize the history of the different versions of our files in a list like this. Then it's easy for us to restore old versions of files. You can imagine making a command like: "Gitlet, please revert to the state of the files at commit #2", and it would go to the second node in the linked list and restore the copies of files found there.

If we tell gitlet to revert to an old commit, the front of the linked list will no longer reflect the current state of your files, which might be a little misleading. In order to fix this problem, we introduce something called the head pointer. The head pointer keeps track of where in the linked list we're currently "at". Normally, as we make commits, the head pointer will stay at the front of the linked list, indicating that the latest commit reflects the current state of the files:

Simple head

However, let's say we revert to the state of the files at commit #2 (technically, this is the reset command, which you'll see later in the spec). We move the head pointer back to show this:

Reverted head

All right, now, if this were all gitlet could do, it would be a pretty simple system. But gitlet has one more trick up its sleeve: it doesn't just maintain older and newer versions of files, it can maintain differing versions. Imagine you're coding a project, and you have two ideas about how to proceed: let's call one Plan A, and the other Plan B. Gitlet allows you to save both versions, and switch between them at will. Here's what this might look like, in our pictures:

Two versions

It's not really a linked list anymore. It's more like a tree. We'll call this thing the commit tree. Keeping with this metaphor, each of the separate versions is called a branch of the tree. You can develop each version separately:

Two developed versions

There are two pointers into the tree, representing the furthest point of each branch. At any given time, only one of these is the currently active pointer, and this what's called the head pointer. The head pointer is the pointer at the front of the current branch.

That's it for our brief overview of the gitlet system! Don't worry if you don't fully understand it yet; the section above was just to give you a high level picture of what its meant to do. A detailed spec of what you're supposed to do for this project follows this section.

But a last word here: one feature of the commit tree that it is in some sense immutable: once a commit node has been created, it can never be destroyed (or changed at all). We can only add new things to the commit tree, not modify existing things. This is an important feature of gitlet! Remember, it's a version-control system, and one of our goals with it is to allow us to save things so we don't delete them accidentally.

C. Internal Structures

Real git distinguishes several different kinds of objects. For our purposes, the important ones are

We will simplify from git still further by

Every object — every blob and every commit in our case — has a unique integer id that serves as a reference to the object. An interesting feature of git is that these ids are universal: unlike a typical Java implementation, two objects with exactly the same content will have the same id on all systems. In the case of blobs, "same content" means the same file contents. In the case of commits, it means the same metadata, the same mapping of names to references, and the same parent reference. The objects in a repository are thus said to be content addressable.

Both git and gitlet accomplish this the same way: by using a cryptographic hash function called SHA-1 (Secure Hash 1), which produces a 160-bit integer hash from any sequence of bytes. Cryptographic hash functions have the property that it is extremely difficult to find two different byte streams with the same hash value (or indeed to find any byte stream given just its hash value), so that essentially, we may assume that the probability of that any two objects with different contents have the same SHA-1 hash value is 2-160 or about 10-48. Basically, we simply ignore the possibility of a hashing collision, so that the system has, in principle, a fundamental bug that in practice never occurs!

Fortunately, there are library classes for computing SHA-1 values, so you won't have to deal with the actual algorithm. All you have to do is to make sure that you correctly label all your objects. In particular, this involves

By the way, the SHA-1 hash value, rendered as a 40-character hexadecimal string, makes a convenient file name for storing your data in your .gitlet directory (more on that below). It also gives you a convenient way to compare two files (blobs) to see if they have the same contents: if their SHA-1s are the same, we simply assume the files are the same.

For remotes (like origin and shared, which we've been using all semester), we'll simply use other gitlet repositories. Pushing simply means copying all commits and blobs that the remote repository does not yet have to the remote repository, and resetting a branch reference. Pulling is the same, but in the other direction

Reading and writing your internal objects from and to files is actually pretty easy, thanks to Java's serialization facilities. The interface has no methods, but if a class implements it, then the Java runtime will automatically provide a way to convert to and from a stream of bytes, which you can then write to a file using the I/O class and read back (and deserialize) with The term "serialization" refers to the conversion from some arbitrary structure (array, tree, graph, etc.) to a serial sequence of bytes.

Here is a summary example of the structures discussed in this section. As you can see, each commit (rectangle) points to some blobs (circles), which contain file contents. The commits contain the file names and references to these blobs, as well as a parent link. These references, depicted as arrows, are represented in the .gitlet directory using their SHA-1 hash values (the small hexdecimal numerals above the commits and below the blobs). The newer commit contains an updated version of wug1.txt, but shares the same version of wug2.txt as the older commit.

Two commits and their blobs

D. Detailed Spec of Behavior

Overall Spec

The only structure requirement we're giving you is that you have a class named gitlet.Main and that it has a main method. Here's your skeleton code for this project (in package gitlet):

public class Main {
    public static void main(String[] args) {
       // FILL IN

We are also giving you some utility methods for performing a number of mostly file-system-related tasks, so that you can concentrate on the logic of the project rather than the peculiarities of dealing with the OS.

You may, of course, write additional Java classes to support your project — in fact, please do. But don't use any external code (aside from JUnit), and don't use any programming language other than Java. You can use all of the Java Standard Library that you wish, plus utilities we provide.

The majority of this spec will describe how's main method must react when it receives various arguments which correspond to commands to the gitlet system. But before we break down command-by- command, here are some overall guidelines the whole project should satisfy:

E. The Commands




Here's a picture of before-and-after commit:

Before and after commit



   Commit a0da1ea5a15ab613bf9961fd86f010cf74c7ee48
   2015-03-14 11:59:26
   A commit message.

   Commit 3e8bf1d794ca2e9ef8a4007275acf3751c7170ff
   2015-03-14 11:49:29
   Another commit message.

   Commit e881c9575d180a215d1a636545b8fd9abfb1d2bb
   2015-03-14 11:39:26
   initial commit

There is a === before each commit and an empty line after it. As in real git, each entry displays the unique SHA-1 id of the commit object. Display commits with the most recent at the top. By the way, there's a class in the Java standard library that will help you format the dates really easily. Look into that instead of trying to construct it manually yourself!

Here's a picture of the history of a particular commit. If the current branch's head pointer happened to be pointing to that commit, log would print out information about the circled commits:


The history ignores other branches and the future. Now that we have the concept of history, let's refine what we said earlier about the commit tree being immutable. It is immutable precisely in the sense that the history of a commit with a particular id may never change, ever. If you think of the commit tree as nothing more than a collection of histories, then what we're really saying is that each history is immutable.





Checkout is a kind of general command that can do a few different things depending on what its arguments are. There are 3 possible use cases. In each section below, you'll see 3 bullet points. Each corresponds to the respective usage of checkout.

A [commit id] is, as described earlier, a hexadecimal numeral. A convenient feature of real git is that one can abbreviate commits with a unique prefix. For example, one might abbreviate




in the (likely) event that no other object exists with a SHA-1 identifier that starts with the same six digits. You should arrange for the same thing to happen for commit ids that contain fewer than 40 characters. Unfortunately, using shortened ids might slow down the finding of objects if implemented naively (making the time to find a file linear in the number of objects), so we won't worry about timing for commands that use shortened ids. We suggest, however, that you poke around in a .git directory (specifically, .git/objects) and see how it manages to speed up its search. You will perhaps recognize a familiar data structure implemented with the file system rather than pointers.

None of these versions modifies the staging area: files scheduled for addition or removal remain so.


All right, let's see what branch does in detail. Suppose our state looks like this:

Simple history

Now we call java gitlet.Main branch cool-beans. Then we get this:

Just called branch

Hmm... nothing much happened. Let's switch to the branch with java Gitlet checkout cool-beans:

Just switched branch

Nothing much happened again?! Okay, say we make a commit now. Modify some files, then java gitlet.Main add... then java gitlet.Main commit....

Commit on branch

I was told there would be branching. But all I see is a straight line. What's going on? Maybe I should go back to my other branch with java Gitlet checkout master:

Checkout master

Now I make a commit...


Phew! So that's the whole idea of branching. Did you catch what's going on? All that creating a branch does is to give us a new pointer. At any given time, one of these pointers is considered the currently active pointer, or the head pointer (indicated by *). We can switch the currently active head pointer with checkout [branch name]. Whenever we commit, it means we add a new commit in front of the currently active head pointer, even if one is already there. This naturally creates branching behavior.

Make sure that the behavior of your branch, checkout, and commit match what we've described above. This is pretty core functionality of gitlet that many other commands will depend upon. If any of this core functionality is broken, very many of our autograder tests won't work!




F. Miscellaneous Things to Know about the Project

Phew! That was a lot of commands to go over just now. But don't worry, not all commands are created equal. You can see for each command the approximate number of lines we took to do each part (that this only counts code specific to that command — it doesn't double-count code reused in multiple commands). You shouldn't worry about matching our solution exactly, but hopefully it gives you an idea about the relative time consumed by each command. Merge is a lengthier command than the others, so don't leave it for the last minute!

Anyway, by now this spec has given you enough information to get working on the project. But to help you out some more, there are a couple of things you should be aware of:

Dealing with Files

This project requires reading and writing of files. In order to do these operations, you might find the classes and java.nio.file.Files helpful. Actually, you may find various things in the and java.nio packages helpful. Be sure to read the gitlet.Utils package for other things we've written for you. If you do a little digging through all of these, you might find a couple of methods that will make the io portion of this project much easier! One warning: If you find yourself using readers, writers, scanners, or streams, you're making things more complicated than need be. all.

Serialization Details

If you think about gitlet, you'll notice that you can only run one command every time you run the program. In order to successfully complete your version-control system, you'll need to remember the commit tree across commands. This means you'll have to design not just a set of classes to represent internal gitlet structures during execution, but you'll a parallel representation as files within your .gitlet directories, which will carry across multiple runs of your program.

As indicated earlier, the convenient way to do this is to serialize the runtime objects that you will also need to store permanently in files. In Java, this simply involves implementing the interface:


class MyObject implements Serializable {

This interface has no methods; it simply marks its subtypes for the benefit of some special Java classes for performing I/O on objects. For example,

    MyObject obj = ....;
    File outFile = new File(someFileName);
    try {
        ObjectOutputStream out =
            new ObjectOutputStream(new FileOutputStream(outFile));
    } catch (IOException excp) {

will convert obj to a stream of bytes and store them in the file whose name is stored in someFileName. The object may then be reconstructed with a code sequence such as

    MyObject obj;
    File inFile = new File(someFileName);
    try {
        ObjectInputStream inp =
            new ObjectInputStream(new FileInputStream(inFile));
        obj = (MyObject) inp.readObject();
    } catch (IOException | ClassNotFoundException excp) {
        obj = null;

The Java runtime does all the work of figuring out what fields need to be converted to bytes and how to do so.

There is, however, one annoying subtlety to watch out for: Java serialization follows pointers. That is, not only is the object you pass into writeObject serialized and written, but any object it points to as well. If your internal representation of commits, for example, represents the parent commit as a pointer to another commit object, then writing the head of a branch will write all the commits (and blobs) in the entire chain of commits into one file, which is generally not what you want. With a little work, you can avoid this problem. One technique is simply not to use Java pointers to refer to commits and blobs in your runtime objects, but instead to use SHA-1 hash strings. You then have a runtime map between these strings and the runtime objects they refer to. You create and fill in this map while gitlet is running, but never read or write it to a file.

G. Testing

As usual, testing is part of the project. Be sure to provide your own integration tests for each of the commands, covering all the specified functionality. Also add unit tests to or other testing classes it invokes in its main method.

We have provided a testing program that makes it relatively easy to write integration tests: testing/ As for Project #2, this interprets testing files with an .in extension. Running the command

python3 testing/

will provide a message documenting the format of these files. Specifically, the tester allows you to

As usual, we will test your code on the the instructional machines, so do be sure it works there! You can tell us how "it works on my machine" until you are blue in the face; we heartlessly won't care about that.

We've added a few things to the makefile to adjust for differences in people's setups. If your system's command for invoking Python 3 is simply python, you can still use our makefile unchanged by using

  make PYTHON=python check

You can pass additional flags to with, for example,

  make TESTER_FLAGS="--show=all --keep"

H. Extra Credit: Going Remote

This project is all about mimicking git's local features. These are useful because they allow you to backup your own files and maintain multiple versions of them. However, git's true power is really in its remote features, allowing collaboration with other people over the internet. The point is that both you and your friend could be collaborating on a single code base. If you make changes to the files, you can send them to your friend, and vice versa. And you'll both have access to a shared history of all the changes either of you have made.

To get extra credit, implement some basic remote commands: namely add-remote, rm-remote, push, fetch, and pull You will get 5 extra-credit points for completing them. Don't attempt or plan for extra credit until you have completed the rest of the project.

Depending on how flexibly you have designed the rest of the project, 5 extra-credit points may not be worth the amount of effort it takes to do this section. We're certainly not expecting everyone to do it. Our priority will be in helping students complete the main project; if you're doing the extra credit, we expect you to be able to stand on your own a little bit more than most students.

The Commands

All right, now that you've gotten scp working, onto the rest of the project!

A few notes about the remote commands:

So now let's go over the commands:






I. Acknowledgments

Thanks to Alicia Luengo, Josh Hug, Sarah Kim, Austin Chen, Andrew Huang, Yan Zhao, Matthew Chow, especially Alan Yao, Daniel Nguyen, and Armani Ferrante for providing feedback on this project. Thanks to git for being awesome.

This project was largely inspired by this excellent article by Philip Nilsson.

This project was created by Joseph Moghadam. Modifications for Fall 2015 by Paul Hilfinger.