CS61C Fall 2013 Lab 02: MapReduce I

Goals

Setup

Make sure you are using a machine from 330 Soda or 273 Soda when doing this lab. If you are not sitting physically in front of a lab machine, you can access one (list of machine names) remotely by following these instructions.

Copy the Lab 2 files into your account. We recommend you create a seperate repository for each assignment. For example, if you would like to have the Lab 2 files in a folder called lab02 in your home directory, you can do:

$ mkdir ~/lab02
$ cd ~/lab02
$ git init
$ git pull ~cs61c/labs/fa13/02 master

Please look at Lab 1 or the additional git notes if you need a quick refresher on git.

Background Information

In lecture we've talked about how the MapReduce system is set up and executed, but now it's time to get some hands-on experience running programs on it!

The MapReduce programming framework is primarily designed to be used on large distributed clusters. However, large, distributed jobs are harder to debug. For this lab, we'll be using Hadoop -- an open source platform which implements MapReduce programming -- in "local mode", where your Map and Reduce routines are run entirely within one process. Because Hadoop is open source, you can download and install it (see the Hadoop webpage) on your home machine! This is rather difficult to setup on Windows, but manageable on Mac or Linux.

Quick Tips/FAQ

Don't Reinvent The Wheel!

Sometimes within your map() or reduce() functions you'll need to make use of data structures, such as lists or dictionaries, or common string methods, such as splitting a string. Instead of writing your own classes/methods, you should use those that are already implemented in Java. The classes java.util.HashMap, java.util.HashSet and java.util.ArrayList are particularly likely to be useful to you. If you are unfamiliar with what these classes do, you should take a moment to review them (see link for Java API documentation below).

Avoid Global Variables.

One of the core tenets of MapReduce is that we want to avoid multiple machines working on a single, unpartitioned data set because of the associated overhead. As a result, your algorithms will very rarely need to use global variables. In the worst case, you may need to share one or two variables for configuration across machines. If this is necessary, we will indicate it to you specifically in the spec.

Don't Hash Mutable Objects.

Hashing mutable objects is dangerous and can be a key source of bugs that are very difficult to find and resolve. For more background, see this post on StackOverflow.

Wrapper Classes.

Sometimes, Java primitives need to be stored as an object, so Java provides classes that "wrap" the primitive in an object. For example, the Integer class is intended to be a wrapper for a int primitive. See this link for more info.

Hadoop implements its own wrappers for many common Java data types (eg. LongWritables for longs and Text for Strings) that implements Writable and Comparable interfaces. You should use classes that implement Writable as values and classes that implement Writable and Comparable as keys in Hadoop.

Additional Documentation

Exercises

The following exercises use three different sample input files, two of which are provided by the staff and can be found in ~cs61c/data:

  1. billOfRights.txt.seq -- the 10 Amendments split into separate documents (a very small input)
  2. complete-works-mark-twain.txt.seq -- The Complete Works of Mark Twain (a medium-sized input)

Notice the .seq extension, which signifies a Hadoop sequence file. These are NOT human-readable. To get a sense of the texts you'll be using, simply drop the .seq portion to view the text file (i.e. ~cs61c/data/billOfRights.txt).

Although an exercise may not explicitly ask you to use it, we recommend testing your code on the billOfRights data set first in order to verify correct behavior and help you debug.

We recommend deleting output directories when you have completed the lab, so you don't run out of your 500MB of disk quota. You can do this by running:

$ make destroy-all

Please be careful with this command as it will delete all MapReduce outputs generated in this lab.

Exercise 0: Generating an Input File for MapReduce

For this exercise you will need the Makefile and Importer.java. In this lab, we'll be working heavily with textual data. We have some pre-generated datasets as indicated above, but it's always more fun to use a dataset that you find interesting. This section of the lab will walk you through generating your own dataset using works from Project Gutenberg (a database of public-domain literary works).

Step 1: Head over to Project Gutenberg, pick a work of your choosing, and download the "Plain Text UTF-8" version into your lab directory.

Step 2: Open up the file you downloaded in your favorite text editor and insert "---END.OF.DOCUMENT---" (without the quotes) by itself on a new line wherever you want MapReduce to split the input file into separate (key, value) pairs. The importer we're using will assign an arbitrary key (like "doc_xyz") and the value will be the contents of our input file between two "---END.OF.DOCUMENT---" markers. You'll want to break the work into reasonably-sized chunks, but don't spend too much time on this part (chapters/sections within a single work or individual works in a body of works are good splitting points).

Step 3: Now, we're going to run our Importer to generate a .seq file that we can pass into the MapReduce programs we'll write. The importer is actually itself a MapReduce program! You can take a look at Importer.java if you want, but the implementation details aren't important for this part of the lab. You can generate your input file like so:

$ make generate-input myinput=YOUR_FILE_FROM_STEP_2.txt

Your generated .seq file can now be found in the convertedOut directory in your lab12 directory. Throughout the rest of this lab, you'll be able to run the mapreduce programs we write using make commands. The make commands will be of the form make PROGRAMNAME-INPUTSIZE. If you wish to try out the input file you generated here, you can instead run:

$ make PROGRAMNAME myinput=YOUR_SEQ_FILE_FROM_STEP_3.txt.seq # Output in wc-out-PROGRAMNAME/ directory

Exercise 1: Running Word Count

For this exercise you will need the Makefile and already-completed WordCount.java. You must compile and package the .java source file into a .jar and then run it on our desired input. Luckily, this is available as a convenient make command:

$ make wordcount-small

This will run WordCount over billOfRights.txt.seq. Your output should be visible in wc-out-wordcount-small/part-r-00000. If we had used multiple reduces, the output would be split across part-r-[id.num], where Reducer "id.num" outputs to the corresponding file. The key-value pair for your Map tasks is a document identifier and the actual document text.

Next, try your code on the larger input file complete-works-mark-twain.txt.seq. In general, Hadoop requires that the output directory not exist when a MapReduce job is executed, however our Makefile takes care of this by removing our old output directory. Remember that we DON'T need to rebuild wc.jar, separately; the Makefile takes care of all the details.

$ make wordcount-medium

Your output for this command will be located in the wc-out-wordcount-medium directory. The first few lines will be confusing since the words you see there are actually numbers (for example, chapter numbers). Search through the file for a word like "the" to get a better understanding of the output. You may also notice that the Reduce "percentage-complete" moves in strange ways. There's a reason for it -- your code is only the last third of the progress counter. Hadoop treats the distributed shuffle as the first third of the Reduce. The sort is the second third. The actual Reduce code is the last third. Locally, the sort is quick and the shuffle doesn't happen at all. So don't be surprised if progress jumps to 66% and then slows.

Exercise 2: Document Word Count

Open DocWordCount.java. Notice that it currently contains the same code as WordCount.java (but with modified class names), which you just compiled and tried for yourself. Modify it to count the number of documents containing each word rather than the number of times each word occurs in the input.

You should only need to modify the code inside the map() function for this part. Each call to map() gets a single document, and each document is passed to exactly one map().

You can test DocWordCount using either of the following (for our two data sets):

$ make docwordcount-small  # Output in wc-out-docwordcount-small/

OR

$ make docwordcount-medium # Output in wc-out-wordcount-medium/

Check-off

Exercise 3: Full Text Index Creation

Open Index.java. Notice again that it contains similar code to WordCount.java (class names have been modified, but the code and the comments have not). Modify it to output, for every word and document pair, a list of comma separated indices for the word. Please make sure your word indices start at zero and that your output matches the format specified below:

word1 document1-id:	word#, word#, ...
word1 document2-id: 	word#, word#
. . .
word2 document1-id: 	word#
word2 document3-id: 	word#, word#, ...
. . .

Note: 1) when outputting key-value pairs, Hadoop automatically inserts a tab between the key and the value and 2) there is no comma after the last index.

You can assume that there's just one reducer and hence just one output file. Remember that the Reduce task will execute on key-value pairs in whatever order they are passed to it. The order of your key-value pairs won't necessarily be the same every time you run the program.

For this exercise, you may need to modify map(), reduce(), and the type signature for reduce(). You will also need to make a minor edit to main() to tell the framework about the new type signature for reduce().

You can test Index using either of the following (for our two data sets):

$ make index-small # Output in wc-out-index-small/

OR

$ make index-medium # Output in wc-out-index-medium/

Compile and run Index.java on both data sets. The output from running make index-medium will be a large file. In order to more easily look at its contents, you can use the commands cat, head, more, and grep:

$ head -25 OUTPUTFILE       # view the first 25 lines of output
$ cat OUTPUTFILE | less     # scroll through output one screen at a time (use Space)
$ cat OUTPUTFILE | grep the # output only lines containing 'the' (case-sensitive)

Make sure to verify your output. Open complete-works-mark-twain.txt and pick a few words. Manually count a few of their word indices and make sure they all appear in your output file.

Check-off

  1. Explain your code in Index.java to your TA.
  2. Show your TA the first page of your output for the word "Mark" in complete-works-mark-twain.txt.seq to verify correct output. You can do this by running: cat wc-out-index-medium/part-r-00000 | grep Mark | less