Project 2: Gitlet, your own version control system

Before we begin, please note that this is a group project. You must work in a group of total size 3 or 4. Your code will not be accepted unless it is turned in with a group.

A. Overview of Gitlet

In this project you'll be implementing a version control system. This version control system mimics some of the basic features of the popular version control system git, but it is smaller and simpler, so we have named it gitlet.

A version control system is essentially a backup system for files on your computer. The main functionality that gitlet supports is:

  1. Saving backups of files. In gitlet, this is called committing.
  2. Restoring a backup version of a file. In gitlet, this is called checking out that version of the file.
  3. Viewing the history of your backups. In gitlet, you view this history in something called the log.

The point of a version control system is to help you when coding complicated projects. You save various versions of your project when you think it's working pretty well. If at some later point in time you accidentally mess up your code, then you can look back over the history of your backups, and choose one that you would like to restore.

In gitlet, you don't just commit individual files at a time. Instead, you can commit a bunch of files at the same time. We like to think of each commit as a snapshot of some part of your project at one point in time. However, for simplicity, many of the examples in the remainder of this document involve just committing one file at a time. Just keep in mind you could add in multiple files to each commit.

In this project, it will be helpful for us to visualize the commits we make over time. Suppose we have a file wug.txt, we add some text to it, and commit it. Then we modify the file and commit these changes. Then we modify the file again, and commit the changes again. Now we have saved three total backup versions of this file, each one further in time than the previous. We can visualize these commits like so:

Three commits

Here we've drawn an arrow indicating that each commit contains some kind of reference to the commit that came before it. We call the commit that came before it the parent commit — this will be important later. But for now, does this drawing look familiar? That's right; it's a linked list!

The big idea behind gitlet is that we can visualize the history of the different versions of our files in a list like this. Then it's easy for us to restore old versions of files. You can imagine making a command like: "Gitlet, please revert to the state of the files at commit #2", and it would go to the second node in the linked list and restore the copies of files found there.

If we tell gitlet to revert to an old commit, the front of the linked list will no longer reflect the current state of your files, which might be a little misleading. In order to fix this problem, we introduce something called the head pointer. The head pointer keeps track of where in the linked list we're currently "at". Normally, as we make commits, the head pointer will stay at the front of the linked list, indicating that the latest commit reflects the current state of the files:

Simple head

However, let's say we revert to the state of the files at commit #2 (technically, this is the reset command, which you'll see later in the spec). We move the head pointer back to show this:

Reverted head

All right, now, if this were all gitlet could do, it would be a pretty simple system. But gitlet has one more trick up its sleeve: it doesn't just maintain older and newer versions of files, it can maintain differing versions. Imagine you're coding a project, and you have two ideas about how to proceed: let's call one Plan A, and the other Plan B. Gitlet allows you to save both versions, and switch between them at will. Here's what this might look like, in our pictures:

Two versions

It's not really a linked list anymore. It's more like a tree. We'll call this thing the commit tree. Keeping with this metaphor, each of the separate versions is called a branch of the tree. You can develop each version separately:

Two developed versions

Notice there are two pointers into the tree, representing the furthest point of each branch. At any given time, only one of these is the currently active pointer, and this what's called the head pointer. The head pointer is the pointer at the front of the current branch.

That's it for our brief overview of the gitlet system! Don't worry if you don't fully understand it yet; the section above was just to give you a high level picture of what its meant to do. A detailed spec of what you're supposed to do for this project follows this section.

But a last word here: one feature of the commit tree that it is in some sense immutable: once a commit node has been created, it can never be destroyed (or changed at all). We can only add new things to the commit tree, not modify existing things. This is an important feature of gitlet! Remember, it's a version control system, and one of our goals with it is to allow us to save things so we don't delete them accidentally.

B. Detailed Spec of Behavior

Overall Spec

The only structure requirement we’re giving you is that you have a class named Gitlet and that it has a main method. Here’s your skeleton code for this project:

public class Gitlet {
    public static void main(String[] args) {
    }
}

You may, of course, write additional Java classes to support your project — in fact, please do. But don’t use any external code (aside from JUnit), and don’t use any programming language other than Java. You can use all of the Java Standard Library that you wish.

The majority of this spec will describe how Gitlet.java's main method must react when it receives various arguments which correspond to commands to the gitlet system. But before we break down command-by-command, here are some overall guidelines the whole project should satisfy:

C. The Commands

init

add

commit

Here's a picture of before-and-after commit:

Before and after commit

rm

log

Notice there is a === separating each commit. There is also an empty line between each commit. Also notice that commits are displayed with the most recent at the top. By the way, there's a class in the Java standard library that will help you format the dates really easily. Look into that instead of trying to construct it manually yourself!

Here's a picture of the history of a particular commit. If the current branch's head pointer happened to be pointing to that commit, log would print out information about the circled commits:

History

Note that it ignores other branches and the future. Now that we have the concept of history, let's refine what we said earlier about the commit tree being immutable. It is immutable precisely in the sense that the history of a commit with a particular id may never change, ever. If you think of the commit tree as nothing more than a collection of histories, then what we're really saying is that each history is immutable.

global-log

find

status

Notice there is an empty line between each section. The order of branches/files within each section does not matter.

checkout

Checkout is a kind of general command that can do a few different things depending on what its arguments are. There are 3 possible use cases. In each section below, you'll see 3 bullet points. Each corresponds to the respective usage of checkout.

In addition, you might wonder: what happens if you have a file name that's the same as a branch name? In this case, let the branch name take precedence.

branch

All right, let's see what branch does in detail. Suppose our state looks like this:

Simple history

Now we call java Gitlet branch cool-beans. Then we get this:

Just called branch

Hmm... nothing much happened. Let's switch to the branch with java Gitlet checkout cool-beans:

Just switched branch

Nothing much happened again?! Okay, say we make a commit now. Modify some files, then java Gitlet add... then java Gitlet commit....

Commit on branch

I was told there would be branching. But all I see is a straight line. What's going on? Maybe I should go back to my other branch with java Gitlet checkout master:

Checkout master

Now I make a commit...

Branched

Phew! So that's the whole idea of branching. Did you catch what's going on? All creating a branch does is give us a new pointer. At any given time, one of these pointers is considered the currently active pointer, or the head pointer (indicated by *). We can switch the currently active head pointer with checkout [branch name]. Whenever we commit, it means we add a new commit in front of the currently active head pointer, even if one is already there. This naturally creates branching behavior.

Make sure that the behavior of your branch, checkout, and commit match what we've described above. This is pretty core functionality of gitlet that many other commands will depend upon. If any of this core functionality is broken, very many of our autograder tests won't work!

rm-branch

reset

merge

rebase

Rebase has one special case to look out for. If the current branch pointer is in the history of the given branch, rebase just moves the current branch to point to the same commit that the given branch points to. No commits are replayed in this case.

There's one more point to make about rebase: If the commit at the front of the given branch has files that have been modified since the split point, these these changes should propagate through the replay. This means, essentially, that the versions of the files in the given branch should take the place of their counterparts in the replayed commits, up until one of the replayed commits has a version of the file that had also been modified since the split point. In this case, what you might expect to happen is that you would get conflicted files, much like merge. However, for simplicity, we're not going to have you deal with conflicts in rebase: in this case, just keep the current branch's copies of the files. The bottom line: A file from the given branch stops propagating through once it meets a modified file in the replayed branch.

Finally, after successfully replaying nodes, reset to the node at the front of the replayed branch.

By the way, if there are multiple branches after the split point, you should NOT replay the other branches. For example, say we are on branch branch1 and we make the call java Gitlet rebase master: Branching rebase

D. Miscellaneous Things to Know about the Project

Phew! That was a lot of commands to go over just now. But don't worry, not all commands are created equal. You can see for each command the approximate number of lines we took to do each part (note that this only counts code specific to that command — it doesn't double-count code reused in multiple commands). You shouldn't worry about matching our solution exactly, but hopefully it gives you an idea about the relative time consumed by each command. Notice that merge and rebase are lengthier commands than the others, so don't leave them for the last minute!

Anyway, by now this spec has given you enough information to get working on the project. But to help you out some more, there are a couple of things you should be aware of:

E. Style

A portion of your grade for this project will be based on the style of your code. Here are the requirements.

  1. All methods must have a Javadoc comment. A Javadoc comment is a comment at the top of the method that starts with /** (instead of just /*). The comment describes what the method does, as well as describes the return value, parameters, and thrown exceptions of the method. It uses a specific @ syntax to communicate this information. Below is an example of a Javadoc comment for a method (the method is complete nonsense. It's just to show you an example of the structure of Javadoc).

    /**
     * Reveals the secrets of the human mind.
     * 
     * @param berko
     *            a string that will be used for very important things
     * @param gleason
     *            a list that will be used for not very important things
     * @return true if the sentence below is a lie
     * @throws TooManyListenersException
     *             if the sentence above is not a lie
     */
    private boolean jeansMethod(String berko, List gleason)
    		throws TooManyListenersException {
    	return false == false;
    }

    Hint: If you start typing /** at the top of a method and then hit enter, Eclipse will automatically create a skeleton of the Javadoc for you.

  2. No method may be longer than 60 lines (and most should get nowhere near this big). Break them up into helper methods!
  3. All code must follow Eclipse's standard indentation conventions (essentially the same as the standard for all Java). This is easy to enforce using Eclipse. Just hit ctrl + shift + f (or command + shift + f), and Eclipse will auto-format your code for you (You can even set this up so that Eclipse will auto-format your code every time you save).

F. Testing

So far in this course we've introduced you to unit testing. These tests test small parts of your code in isolation, such as individual methods. Unit tests, however, may not be sufficient for verifying the correctness of large and complicated systems. Another type of test, end-to-end tests, may be helpful. End-to-end tests test higher level behavior of a system. They don't just test individual methods; they test whether the complete system provides its expected functionality as the user would experience it.

For this project, you'll be required to submit both some unit tests and some end-to-end tests. Please at least provide the following end-to-end tests:

These are just basic ones we're requiring you to have to get you started. You should of course do more of your own testing to ensure your code is fully correct! You should also provide unit tests where you feel they are more appropriate. Although unit tests are not strictly required, we may use them to give partial credit in case your project catastrophically fails in the autograder.

You can write unit tests just as you always have done in this course: using JUnit. But how do you write end-to-end tests? The point of the end-to-end tests is that they should simulate how the user would interact with gitlet, that is, using the terminal. The most straightforward way to write end-to-end tests would be to use shell scripting — this is essentially just automating terminal commands. However, since you're not required to be familiar with shell scripting for this course, we recommend you to just use JUnit even for your end-to-end testing, although it is a little awkward. JUnit, as its name suggests, was really designed for unit testing.

To help you over this awkwardness, we've written some example end-to-end tests you can look at in GitletTest.java. We recommend you pattern your own after these. Notice they use a helper method called gitlet, which calls your code by executing a terminal command that calls Java. For end-to-end tests, this is the only way you should interact with your code; you shouldn't call your methods directly.

By the way, you should also try running your code from the command line and use it just like git! Don't only test with JUnit. In addition, if you're using Windows, be sure to test out your code and tests on a unix/linux/mac machine, such as the lab computers. You want to make sure that your code does not only work in a Windows environment, since our autograders will be run in linux.

G. Submission, Grading, and Checkpoints

Submit your project by putting it in a directory called proj2 on the lab machines and typing submit proj2. In your submission, you should include Gitlet.java, other .java files you wrote to complete the project, and your unit and end-to-end tests.

About grading: because we gave you complete freedom to organize your code how you want, we can't unit test your code, because we don't know exactly what methods you used. We can only end-to-end test your code. What this means is that we have to test sequences of commands together. Be aware that if any of the core functionality is broken (namely add, commit, checkout, or log), then many of our tests may break, and you will end up with little points. Make sure they work exactly as the spec describes! Although the fringe functionality is more difficult and time consuming to write (like merge and rebase), fewer tests depend on these methods, so they won't impact your grade as much.

In addition, this project has two checkpoints due along the way so we can ascertain you're on the right track. These will be worth a small portion of your final project grade.

For the first checkpoint, you'll have to submit a design document for your project. In the design document, describe all classes you plan to have, and all of their instance variables and methods. In addition, describe the structure of your .gitlet folder — what kinds of things will it contain, and how are they organized? It's okay if you change all of this stuff later — we just want to see that your group has thought through the whole project by this point. Bring a paper copy of the design document to lab.

At the checkpoint, expect your TA to randomly quiz individual group members on how the project works. Your whole group's grade will depend on how each member does, so make sure that every group member understands the whole project.

For the second checkpoint, you must be able to demonstrate for your TA that all of the commands except possibly merge and rebase basically work (this is easiest to do if you have tests that demonstrate this). At the second checkpoint, your TA will again randomly quiz different group members on the project's functionality. Everyone should understand how different parts of the project work, even if they didn't code it themselves.

H. Extra Credit: Going Remote

This project is all about mimicking git’s local features. These are useful because they allow you to backup your own files and maintain multiple versions of them. However, git’s true power is really in its remote features, allowing collaboration with other people over the internet. The point is that both you and your friend could be collaborating on a single code base. If you make changes to the files, you can send them to your friend, and vice versa. And you'll both have access to a shared history of all the changes either of you have made.

This project’s extra point is to implement some basic remote commands: namely add-remote, rm-remote, push, fetch, pull, and clone. You will get 2 extra credits point for completing them. This extra credit will be significantly more challenging than the rest of the project: don't attempt or plan for it until you have completed the rest of the project. In fact, we don't even recommend reading the remainder of the spec until you've completed the rest of the project.

We should also mention we think that 2 extra credits point is clearly not worth the amount of effort it takes to do this section. The extra credit is meant as a fun little additional challenge for students who might be bored with the class so far. We think this extra credit portion is super awesome, but we're not expecting everyone to do it. We should also say that there will be significantly less support from the staff to help you complete the extra credit. If you're doing the extra credit, we expect you to be able to stand on your own a little bit more than most students.

And one more note: You cannot use slip days to complete the extra credit. If you take a slip day, you will automatically get 0 on the extra credit.

Setting Up scp

Okay, have you finished the rest of the project now? Let's go on then. But before describing how the specific remote commands work, let’s first go over the basics of how the commands can interact with the internet.

All of the remote commands should work off scp (or pretend like they do). scp is a terminal command that allows you to copy files back and forth between computers across the internet. Your remote gitlet commands should work on anything that you have scp access to. That means, before beginning coding this part of the project, you should check to make sure you can use scp from the terminal. Note that scp is a unix command; Windows users should probably just use the lab computers for this (it is possible to get this set up on Windows, but it's a bit complicated and probably not worth your time unless you're familiar with this sort of thing).

Anyway, you should have scp access to your user account on the berkeley lab computers. For example, try out a command like:

scp [some file] cs61bl-[xx]@torus.cs.berkeley.edu:[some other file]

Where [xx] is your login, [some file] is the name of a file on your local computer, and [some other file] is what you want the file to be named when you copy it onto the lab computers. torus.cs.berkeley.edu is the name of a computer on campus; you can use other ones too, such as cory.eecs.berkeley.edu.

After you do the scp command, you’ll be prompted for your password to log-in to your account on the lab computer.

Unfortunately, it just won’t do to have to enter your password every time you run gitlet’s remote commands. So, we’re actually going to need to take advantage of scp’s password-less login features.

So let’s revise what we said earlier to: Your remote gitlet commands should work on anything that you have password-less scp access to.

In order to get yourself password-less login to stuff over scp, you’ll want to set up an ssh key.

You can look up guides for setting up password-less ssh online. For example, this guide on github has some instructions on creating an ssh key. Only steps 1 - 3 will be relevant to you, though, because you don't want to add the ssh key to your github account; you want to add your ssh key to your user account on the berkeley lab computers. For instructions specifically about the lab computers, you might want to check out inst's help page here (see the sections SSH Public and Private Keys (passphrases) and Password-less Logins (OpenSSH)), though the instructions aren't as clear as github's are.

You can look up other resources too, if these aren't good enough for you. Keep in mind that setting up scp is not supposed to be the difficult part of this extra credit! If you get stuck, ask questions.

Note: the simplest way to get Java to transfer files over scp is probably just to make Java call terminal commands; though there are more legit ways using scp in Java, you’re not required to use them. You can try if you want, but don't feel compelled. That said, please do not just make Java directly call terminal commands in the other portions of the project; take advantage of the file system abstractions that Java offers.

The Commands

All right, now that you've gotten scp working, onto the rest of the project!

A few notes about the remote commands:

So now let's go over the commands:

add-remote

rm-remote

push

fetch

pull

clone

More About Remotes

There are two final notes to make about the remote commands.

  1. If you think about the remote commands hard enough, you'll realize that using an arbitrary id scheme won't work in the hypothetical scenario where you're collaborating with a friend using a remote. If you and your friend make different commits on each of your local machines that end up with the same id even though they're different commits, then the remote commands will break.

    One way real git solves this problem is that each commit's id is generated from its contents and metadata. The id number is not simply an arbitrary id number, but a hash. So, if you and your friend make different commits, they have to end up with different ids.

    So to do the remote commands, please change your arbitrary id scheme to a hashing id scheme. The hashes should be determined at least from the commit's message, time, and hash of its parent (if it has one). This means it's okay to get collisions in a scenario where you have two commits that have the same message, time, and history. However, it is imperative you don't get collisions otherwise. To ensure this, look into using an existing hashing algorithm, rather than writing your own. Also, to help you avoid collisions, commit ids can be long hex numbers instead of just integers. Look into java standard library methods for hashing strings.

  2. In order to be able to push after completing a pull, you have to be able to tell whether you've already pulled down everything there is to pull down. However, the pull scheme uses merge, and because merge only maintains one back pointer, the current branch won't necessarily be able to tell that the remote commit is in its history. You'll have to change that. Modify merge so that it adds multiple back pointers to the commit following a merge (The additional histories won't appear in the log, however). Notice this has two consequences. One, the definition of one node being in another's history has changed: being in the history of a commit means being anywhere at all behind it, which can now branch out. Two, during a fetch, there may actually be mutiple latest common nodes. You'll have to attach the newly pulled down commits at all of the appropriate places.

I. Acknowledgements

Thanks to Alicia Luengo, Josh Hug, Sarah Kim, Austin Chen, Andrew Huang, Yan Zhao, Matthew Chow, especially Alan Yao, Daniel Nguyen, and Armani Ferrante for providing feedback on this project. Thanks to git for being awesome.

This project was largely inspired by this excellent article by Philip Nilsson.

This project was created by Joseph Moghadam.