CS61C Spring 2013 Lab 6

MapReduce

Java API Quickstart

Lab 2 will introduce you to the Hadoop MapReduce framework, which is implemented in the Java programming language. This section will provide you with a quick introduction to relevant APIs of the Java programming language. If you are interested in learning Java in greater detail, the recommended book is Head First Java.

The most essential part to this lab is harnessing Java's String API. Here is an example of important API method that you might find useful for this lab:

To split strings using a delimiter such as a space (" "), or a comma (","), you can use the split method, as shown in the following example:

/* SplitTest.java */
// To compile and execute (the main method) this file, use the following commands:
// $ javac SplitTest.java
// $ java SplitTest

public class SplitTest
{
	//
	// The main method will break a string (consisting of multiple words) up into an array
	// of strings (which would be the constituent words of the original sentence),
	// which are separated by the specified delimiter regular expression (in this case, simply a space).
	//
	// Using the array, the main method will output all words of length greater or equal to 3 characters.
	//
	public static void main(String args[])
	{
		String testString = "This is a test string consisting of words of varying lengths";
		String words[] = testString.split(" ");
		for(int i = 0; i < words.length; i++)
		{
			String currentWord = words[i];
			if(currentWord.length() >= 3)
			{
				System.out.println(currentWord);
			}
		}
	}
}
		

Other useful String methods can be found in the Java String API documentation.

Complex MapReduce applications sometime require additional data structures to be used, such as java.util.HashSet, java.util.HashMap, and java.util.ArrayList. You will find these classes to be particularly useful in the later exercises of this lab.

Here is an example of each of these three data structures, demonstrating some handy API methods:

Additional Resources

Setup

You may work on this lab in partners. It will be especially helpful if one of the partners has some experience in coding Java.

The MapReduce programming framework is primarily designed to be used on large distributed clusters. However, large distributed jobs are harder to debug. So instead, for this lab, we’ll be using Hadoop -- an open source platform which implements MapReduce programming -- in “local mode”, where your Map and Reduce routines are run entirely within one process. (Never fear, the next lab will involve running your code on distributed clusters on EC2!)

You should complete this lab on the machines in 330 Soda. If you are not sitting physically in front of one of these lab machines, you can access one of them remotely by following these instructions.

Pull the template files for this lab by doing the following:

$ mkdir ~/lab06
$ cp -r ~cs61c/labs/06/ ~/lab06

Exercise 0: Running Word Count

First, compile the code by running:

$ make

The make command follows the instructions inside of Makefile to compile the source in WordCount.java into wc.jar. Next, run the word count example

$ hadoop jar wc.jar WordCount ~cs61c/data/billOfRights.txt.seq wc-out

This will run word count over a sample input file (US Bill of Rights). Your output should be visible in wc-out/part-r-00000. If you had multiple reduces, then the output would be split across part-r-[id. num], where Reducer "id. num" outputs to the corresponding file. The plain-text for the test code is in ~cs61c/data/billOfRights.txt. For the input to your MapReduce job, the key for map() is the document identifier and the value is the actual text.

Once you have things working on the small test case, try your code on the larger input file ~cs61c/data/sample.seq (approx. 34 MiB). This file contains the text of one week's worth of newsgroup posts (corpus).

$ rm -rf wc-out
$ hadoop jar wc.jar WordCount ~cs61c/data/sample.seq wc-out

Hadoop requires the output directory not to exist when a MapReduce job is executed, which is why we deleted the wc-out directory. Alternatively, we could have chosen a different output directory.

You may notice that the Reduce percentage-complete percentage moves in strange ways. There’s a reason for it. Your code is only the last third of the progress counter. Hadoop treats the distributed shuffle as the first third of the Reduce and the sort as the second third. Locally, the sort is quick and the shuffle doesn‘t happen at all. So don’t be surprised if progress jumps to 66% and then slows.

Exercise 1: Document Word Count

Now, open DocWordCount.java. Notice that it contains the same code as WordCount.java, which you just compiled and tried for yourself. Modify it to count the number of documents containing each word rather than the number of times each word occurs in the input. Run make to compile your modified version, and then run it on the same inputs as before.

You should only need to modify the code inside the map() function for this part. Each call to map() gets a single document, and each document is passed to exactly one map().

Exercise 2: Full Text Index Creation

Create a new class file called Index.java based on WordCount.java (don't forget to change the name of the class to Index too!). Modify Index.java to output, for each word, a list of the locations (word number in the document and identifier for the document) in which the word occurs. (An identifier for each document is provided as the key to the mapper.) Your output should have lines that look like:

word1   document1-id:word#,word#,word#...
word1	document2-id:word#,word#,word#...
...
word2	document1-id:word#,word#,word#...
word2	document2-id:word#,word#,word#...
...

Minor line formatting details don’t matter. You should number words in a document starting with zero. The output for each word should be together in your output file. You can assume that there’s just one reducer and hence just one output file. For each word, there should be one line for each document containing that word. To lower the output size and memory requirements, don’t output more than a thousand locations for any given word.

For this exercise, you probably need to modify map(), reduce() and the type signature for Reduce. You will also need to make a minor edit to main() to tell the framework about the new type signature for Reduce.

Compile and run this MapReduce program on the same inputs as the previous exercises. Do not request checkoff until Index has successfully completed on sample.seq and you have a file called part-r-00000 in the wc-out directory. This will be a very large file. In order to easily look at its contents, run $ head -25 wc-out/part-r-00000 to view the first 25 lines of output.

Exercise 3: Midsemester Survey

Please complete this anonymous survey. You will have to show your TA the survey submission confirmation page.

Checkoff

Show the TA your DocWordCount.java
Show the TA your Index.java
Show the TA the output from running your Index on sample.seq
Show your TA the submission confirmation page (ie. the page you get after hitting submit).

Before you leave, be sure to save your code since you may want to look at it while doing Project 2. We recommend deleting output directories when you have completed the lab, so you don't run out of your 500MB of disk quota