CS61C Fall 2013 Lab 03 - MapReduce II

Goals

Setup

Copy the Lab 3 files into your local lab03 directory. For example, if you are using the work folder you created in the last lab, you can do:

$ mkdir ~/lab03
$ cd ~/lab03
$ git init
$ git pull ~cs61c/labs/fa13/03 master

Please look at Lab 1 or the additional git notes if you need a quick refresher on git.

Now run this command to configure your account to work with ec2.

$ bash ec2-init.sh
$ source ~/ec2-environment.sh

You should only need to run these commands once. If you have Hadoop or EC2 problems during this lab, please see the "When Things Go Wrong" section.

There are some slight changes to "WordCount.java" from last week's Lab 2, so do be sure to use this week's copy. As with last lab, you can compile it by typing:

$ make

It is okay to ignore the warning that arises during compilation.

Before You Start

As mentioned in the last lab, MapReduce is primarily designed to be run on large, distributed clusters, and now you will have the chance to do so. We will be using Amazon's EC2 service, which rents virtual machines by the hour.

You should complete this lab on the machines in 330 Soda or 273 Soda, which have the relevant tools and scripts for starting up a Hadoop cluster on EC2. If you are not sitting physically in front of a lab machine, you can access one (list of machine names) remotely by following these instructions. These directions rely on commands on the instructional machines. You won't be able to do this from your desktop. As a result, you'll use your course account to complete this lab.

Our Test Corpus and Accessing it with Hadoop

We have several datasets stored on Amazon's Simple Storage Service (S3), which provides storage that is located in the same datacenter as the EC2 virtual machines. In this lab, we will be working with data collected from the Usenet newsgroup in 2005 and 2006. You can access it from Hadoop jobs by specifying the "filename" s3n://cs61cUsenet ("s3n" stands for S3 Native). The data is stored compressed, so you should expect output from a job run against it to be substantially bigger.

EC2 Billing

Amazon provides several price points of virtual machines. In this lab, we will be using "High-CPU Extra-Large" ("c1.xlarge") virtual machines, which cost around 0.58 dollars an hour (when rented on an on-demand basis) and provide the equivalent of about 8 2.5GHz commodity processor cores and 7GB of RAM. Note that 1) hours of use are rounded up and 2) we are billed for all time when the machines are on, regardless of whether the machines are active. Thus, we pay for at least one hour every time new machines are started, and so starting a machine for 5 minutes, terminating it, and starting an identical one for another 5 minutes causes us to be billed for 2 hours of machine time. In addition to billing for virtual machine usage by the hour, Amazon also charges by usage for out-of-datacenter network bandwidth and long-term storage. Usually these costs are negligible compared to the virtual machine costs.

Exercise 0: Custom Keys

Note: This exercise is designed to be done concurrently with the other parts of the lab. You should work on this part when you are waiting for your code to finish running on the EC2 servers, for example by opening a new terminal window. While working on this, make sure to check back periodically on your EC2 servers, as we do not want them sitting idle for too long. Please start a cluster (see Exercise 1) before continuing with this exercise.

First, read the sections "Writable Types", "Custom Key Types", "Using Custom Keys", and "Final Writable Notes" in the following Hadoop tutorial located here. You do not need to do the exercise stated in the link. As a minor clarification, keys in Hadoop do not need to override the equals() function.

Now it is time for you to implement your own custom key. The file NumStr.java contains a partial implementation of a custom key that contains a long and a String. Please fill out the sections with the comment "YOUR CODE HERE". It may be be helpful for you to search up the Java documentation for the DataInput and DataOutput interfaces to see what methods are available for serializing and deserializing your data. Hint: You should use matching pairs of read and write methods.

The NumStr class contains test cases for each of the methods you need to implement. You can compile and run the test cases with the command:

$ make numstr

Exercise 1: Running WordCount

Starting a Cluster

Start a Hadoop cluster of 4 c1.xlarge worker nodes and 1 c1.xlarge master and worker node using:

$ hadoop-ec2 launch-cluster --auto-shutdown=170 large 5 

This command may take a several minutes to complete. When it is done, it should give you 2 URLs, one for the "namenode" and one for the "jobtracker". Open the two URLs in a web browser. If you lose track of these URLs, you can list them again with

$ hadoop-ec2 list large

(The --auto-shutdown option specifies to terminate the cluster after 170 minutes. This is intended to reduce the likelihood of expensive accidents, but please do not rely on this option to terminate your machines. Amazon rounds up to the nearest hour, so 170 is intended to give you nearly 3 hours -- which should be plenty for a two-hour lab!)

Hadoop includes both a distributed filesystem implementation (called the Hadoop Distributed File System (HDFS)), which is similar to the Google File System, and MapReduce implementation. Both of these use a single master program and several worker programs. The master program HDFS is called the "namenode" since it stores all the filesystem metadata (file and directory names, locations, permissions, etc.), and the worker programs for HDFS are called "datanodes" since they handle unnamed blocks of data. Similarly, the master program for the MapReduce implementation is called the "jobtracker" (which keeps track of the state of MapReduce jobs) and the workers are called "tasktrackers" (which keep track of tasks within a job). Storage on HDFS tends to be substantially faster but more expensive than S3, so we're going to use it for outputs. This storage is automatically cleared when you shut down your instances.

Run WordCount against the corpus using:

$ hadoop-ec2 proxy large
$ hc large jar wc.jar WordCount s3n://cs61cUsenet/s2006 hdfs:///wordcount-out

The first command sets up an encrypted connection from your lab machine to the master machine on cluster. The hc command runs the "hadoop" command you ran in last week's lab against the remote cluster, using the connection created by the first command to submit jobs, etc. (The "large" here refers to the sizes of the individual machines, not to the size of your cluster.)

(If you get errors like "11/01/24 14:47:41 INFO ipc.Client: Retrying connect to server: .... Already tried 0 time(s)." from hc, that indicates that the connection probably went down, and you should run the proxy command again to reconnect.)

The filename "hdfs:///wordcount-out" specifies a location on the distributed filesystem of the cluster you have started.

Monitoring the Job

You should watch the job's progress through the web interface for the jobtracker. After the job starts, the main webpage for the job tracker will have an entry for the running job. While the job is running, use the web interface to look at a map task's logs. (Click on the number under "running map tasks".)

Find out where the following are displayed on the web interface and record (to a file) the final values when the job completes:

  1. The total runtime
  2. The total size of the input in bytes and records. (labeled as S3N_BYTES_READ)
  3. The total size of the map output in bytes and records.
  4. The total size of the "shuffled" (sent from map tasks to reduce tasks) bytes
  5. The number of records (i.e. words) in the output
  6. The number of killed map and reduce task attempts.
  7. The number of combine input and output records.

Retrieving the Output

You can view your output with the following comand:

$ hc large dfs -cat hdfs:///OUTPUT_DIR/FILE_NAME | less

For example, if you wanted to view the file part-r-00000 in directory hdfs:///wordcount-out, you would type hc large dfs -cat hdfs:///wordcount-out/part-r-00000 | less (or you can substitute less with another text viewer of your choice).

After the job is complete you could in principle retrieve your output with the command:

$ hc large dfs -cp hdfs:///FILE_NAME <dst>

The output file for these jobs is prohibitively large, however, so we won't require you to actually retrieve the output.

Exercise 2: On a Larger Cluster

Now, run the same job on a larger cluster. Terminate your existing cluster with:

$ hadoop-ec2 terminate-cluster large

and start another with 10 worker nodes as with:

$ hadoop-ec2 launch-cluster --auto-shutdown=170 large 10

From the web interface, record or answer the following:

  1. How long did your job take?
  2. What was the total size of the "shuffled" bytes on the larger cluster?
  3. How many times faster was the larger cluster than the smaller one?
  4. What was the throughput of the large and small cluster (in MB/sec)?

Exercise 3: Big and Little Data

And now for one last experiment. This is designed to give you some insight about the difference between "strong scaling" and "weak scaling". Strong scaling is when adding more machines makes your algorithm faster on a given data. Weak scaling is when adding more machines lets you process more data in the same amount of time. As we alluded to in lecture, these don't always go together.

You should have the bigger cluster running at this point. Try running the job with a smaller input, 's3n://cs61cUsenet/s2005'. (This is a substantially smaller amount of data than 's3n://cs61cUsenet/s2006' contained.) You should use a different output directory than you did for exercise 1. There is plenty of space, but if you want to delete it, see "Deleting Old Output Directories" below for instructions.

Record the amount of data your job processed (given as S3N_BYTES_READ) and how long your job took. Then, answer the following:

  1. What was the throughput of this job (in MB/sec)?
  2. What type of scaling is observed with WordCount?

Check off:

First, terminate any clusters using:

$ hadoop-ec2 terminate-cluster large

Then, answer the following:

  1. How much money did you spent on EC2?

Before you leave:

Checkoff: