CS61C Project 2 - MapReduce

Updates

03/20/2013 - Clarified meaning of input size in Assignment section.
03/19/2013 - Removing/Listing files section of Troubleshooting is now updated with new commands.
03/19/2013 - ec2-usage is known to provide incorrect values sometimes, so please don't be concerned if it reports ridiculously large costs/usage times.

Administrative

PLEASE START EARLY. A limited number of clusters can be initialized at once under the cs61c master account. That means that if you procrastinate on the project right now, there's a good chance that you won't be able to start a cluster easily towards the end of the week, when traffic is heaviest.
Please consult the troubleshooting guide at the bottom of this page if you run into issues. DO NOT post on Piazza until you have exhausted all of the relevant troubleshooting tips
You MUST work with a partner on this project (your partner does not have to be in your section). You may not work alone and you may not work in groups larger than two.

Prepare Your Account

First, we need to take a few steps to prepare your account for launching jobs on EC2. Begin by running these commands (you should only need to do this once):

$ mkdir ~/proj2-2
$ cd ~/proj2-2
$ cp ~cs61c/proj/02_EC2/ec2-init.sh .
$ bash ec2-init.sh
$ source ~/ec2-environment.sh

Next, copy over your Makefile and SmallWorld.java from part 1 of the project into the current directory (proj2-2).

Now that we have a copy of SmallWorld.java, we need to modify it so that it will use multiple reducers (hadoop will scale the number of mappers automatically). In order to do this, we have to use the method job.setNumReduceTasks(int num). We will set the following number of reducers for each cluster size:

Cluster Size  |  Number of Reducers
--------------|--------------------
 6 Machines   |         24
 9 Machines   |         36
 12 Machines  |         48

For the first two jobs (Loader and BFS), we'll want to do the following after the job object is created in main:

job.setNumReduceTasks(NUMBER FROM ABOVE TABLE);

Since we only want a single output file, we need to limit the last job (Histogram) to use only a single reducer. We do this by adding the following after the job object for the Histogram phase is created in main:

job.setNumReduceTasks(1);

Yay! Everything should now be ready to go for EC2. Don't forget to keep updating the argument of setNumReduceTasks as we try different cluster sizes. Also, don't forget to run make every time you change something in SmallWorld.java (the compiled sw.jar file is the file given to your cluster to execute). Head over to the next section to learn how to actually run things on EC2.

How to Run on EC2

Here, we'll practice by running our code on EC2 with the graph ring4.seq. Please read this entire section carefully before performing any of it. Also, please do not start a cluster for the sole purpose of running this example. Even if you run it only for 30 seconds and then shut down the instances, the university is charged for a full hour.

First, we'll want to go ahead and launch our cluster. To launch a large cluster with N workers, we run the following command. Note that this may take a few minutes to complete.

$ hadoop-ec2 launch-cluster --auto-shutdown=230 large N

The execution of this command may take a while, but once it successfully completes, you'll see some URLs in the output. If you paste these into your browser, you'll be able to view job status and all kinds of cool stats in your browser, as long as the cluster is still running. You'll need some of this information to answer the required questions later on. If you ever forget what these URLs are, you can run hadoop-ec2 list large to view them again, along with a list of all of your running instances.

Now that the cluster is started, we can go ahead and run the following to connect to our cluster:

$ hadoop-ec2 proxy large

Once this completes successfully, we can go ahead and launch the job (in this case a run of SmallWorld on ring4.seq) on EC2.

$ hc large jar sw.jar SmallWorld s3n://cs61cSmallWorld/ring4.seq hdfs:///smallworld-out <DENOM>

After the job is complete (which you can check using the web GUI or the stream of output in terminal), you can fetch your output using:

$ hc large dfs -cat hdfs:///smallworld-out/part-r-00000 > ec2_output.txt

If everything ran correctly, your output should now be located in ec2_output.txt in your current working directory.

Finally, once we're all done grabbing all the information/data we need, we go ahead and shutdown the cluster. This destroys all data that we placed on the machines in the cluster. In addition, the URLs you used earlier to see the web GUI will cease to function. Thus, be sure that you grab all the data you need to finish your project before terminating. In order to terminate, run the following:

$ hadoop-ec2 terminate-cluster large

To confirm that your cluster was successfully shutdown, run hadoop-ec2 list large in terminal. There should be no output if your cluster has successfully terminated.

Please note that if you leave an instance running for an unreasonable amount of time (for example, overnight), you will lose points. We are able to track the usage of each account and you will need to give us a cost assessment in your final report.

Finally, for reference, here are the graphs available to you on EC2:

Graphs on EC2 (`s3n://cs61cSmallWorld/GRAPH_NAME_HERE`)

hollywood.sequ - Hollywood co-star database circa 2009 (1M actors)
wikipedia.seq - Wikipedia links circa 2011 (5.7M articles) - For fun: not part of assignment. Do not try this until you have completed the rest of the project. Also, do not try this unless you have efficient code.
ring4.seq - The same 4 vertex ring from part one.

Assignment

Part 2 (due 3/24/13 @ 23:59:59)

Before running your code on EC2, run it locally to be sure it is correct and decently efficient (you should have done this for part 1 anyway). Complete the following questions (and do the required runs on EC2) and place the answers in a plain-text file called proj2.txt. The contents of this file are primarily what will be graded for this part of the project. Please read through all of the questions before starting to run anything on EC2, so that you know exactly what information you need to collect before terminating a cluster.

Run your code on hollywood.sequ with denom=100000 on clusters of size 6, 9, and 12. Be sure to set the appropriate number of reducers for each cluster size as indicated above. How long does each take? How many searches did each perform? How many reducers did you use for each? (Read the rest of the questions to see what other data you will need.) Also, be sure to properly scale the number of reducers as indicated in the setup section.
For the Hollywood dataset, at what distance are the 50th, 90th, and 95th percentiles? (Choose the output of any one of your three runs to use as source data.)
What was the mean processing rate (MB/s) for 6, 9, and 12 instances? You can approximate the data size to be (input size) * (# of searches). Input Size is equal to the value given for S3N_BYTES_READ on the job page for your first Mapper.
What was the speedup for 9 and 12 instances relative to 6 instances? What do you conclude about how well Hadoop parallelizes your work? Is this a case of strong scaling or weak scaling? Why or why not?
Do some research about the combiner in Hadoop. Can you add a combiner to any of your MapReduces? ("MapReduces" being Loader, BFS, and Histogram.) If so, for each MapReduce phase that you indicated "yes" for, briefly discuss how you could add a combiner and what impact it would have on processing speed, if any. (NOTE: You DO NOT have to actually code anything here. Simply discuss/explain.)
What was the price per GB processed for each cluster size? (Recall that an extra-large instance costs $0.68 per hour, rounded up to the nearest hour.)
How many dollars in EC2 credits did you use to complete this project? If ec2-usage returns bogus values, please try to approximate (and indicate this).

Submission

In order to submit part 2, type:

$ submit proj2-2

Only one partner should submit. Please be sure to indicate your partner's login when prompted. Submit proj2.txt, your modified SmallWorld.java, the Makefile (if modified), and any additional source files your code needs. Be sure to put both your name and login and your partner's name and login at the top of proj2.txt.

Grading

Part 1 is worth 2/3 of your Project 2 grade.

Part 2 is worth 1/3 of your Project 2 grade.

Troubleshooting

Stopping Running Hadoop Jobs

Often it is useful to kill a job. Stopping the Java program that launches the job is not enough; Hadoop on the cluster will continue running the job without it. The command to tell Hadoop to kill a job is:

$ hc large job -kill JOBID

where JOBID is a string like "job_201101121010_0002" that you can get from the output of your console log or the web interface. You can also list jobs using:

$ hc large job -list

Proxy Problems

Seeing output like the following?

"12/34/56 12:34:56 INFO ipc.Client: Retrying connect to server: ec2-123-45-67-89.amazonaws....."

"Exception in thread "main" java.io.IOException: Call to ec2-123-45-67-89.
compute-1.amazonaws.com/123.45.67.89:8020 failed on local exception: java.
net.SocketException: Connection refused

<long stack trace>"

If you get this error from 'hc' try running

$ hadoop-ec2 proxy large

again. If you continue getting this error from hc after doing that, check that your cluster is still running using

$ hadoop-ec2 list large

and by making sure the web interface is accessible.

Deleting Configuration Files

If you've accidentally deleted one of the configuration files created by ec2-init.sh, you can recreate it by rerunning $ bash ec2-init.sh

Last Resort

It's okay to stop and restart a cluster if things are broken. But it wastes time and money.

Appendix: Miscellaneous EC2-related Commands

We don't think you'll need any of these, but just in case:

Terminating/Listing Instances Manually

You can get a raw list of all virtual machines you have running using

$ ec2-my-instances

This will include the public DNS name (starts with "ec2-" and ends with "amazonaws.com") and the private name (starts with "ip-...") of each virtual machine you are running, as well as whether it is running or shutting down or recently terminated, its type, the SSH key associated with it (probably USERNAME-default) and the instance ID, which looks like "i-abcdef12". You can use this instance ID to manually shut down an individual machine:

$ ec2-terminate-instances i-abcdef12

Note that this command will not ask for confirmation. ec2-terminate-instances comes from the EC2 command line tools. ec2-my-instances is an alias for the command line tools' ec2-describe-instances command with options to only show your instances rather than all instances belonging to the class.

Listing/removing files from your cluster

LISTING FILES:
$ hc large dfs -ls hdfs:///[DIRECTORY NAME HERE]
REMOVING FILES/DIRECTORIES:
$ hc large dfs -rmr hdfs:///[FILENAMEHERE]

Note that you may be able to delete/view other people's files. Please do not abuse this. As usual, tampering with files belonging to other people is a student conduct violation.

Logging into your EC2 virtual machines

$ hadoop-ec2 login large
# or using a machine name listed by ec2-my-instances or hadoop-ec2 list
$ ssh-nocheck -i ~/USERNAME-default.pem root@ec2-....amazonaws.com

The cluster you start is composed of ordinary Linux virtual machines. The file ~/USERNAME-default.pem is the private part of an SSH keypair for the machines you have started.

Viewing/changing your AWS credentials

You can view your AWS access key + secret access key using:

$ ec2-util --list

If you somehow lose control of your AWS credentials, you can get new AWS access keys using:

$ ec2-util --rotate-secret 
$ new-ec2-certificate

This is likely to break any running instances you have.

Acknowledgements

Thanks to Scott Beamer (a former TA), the original creator of this project, and Alan Christopher and Ravi Punj for the "Troubleshooting" and "Appendix" sections.

Additional Links (NOT required reading) on Social Network Analysis

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twitter, a Social Network or a News Media?. WWW 2010. (analysis of Twitter and source of our Twitter data)
Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature:393, 1998. (mathematics behind small-world networks)
Stanford Network Analysis Project (a great source of social network data and the source of cit-HepPh)
Laboratory for Web Algorithmics (source of the Hollywood data and has current metrics on largest social networks)
Wikipedia Crawl

CS61C Spring 2013 Project 2 Part 2: Running SmallWorld on Amazon EC2

Updates

Administrative

Summary

Prepare Your Account

How to Run on EC2

Graphs on EC2 (`s3n://cs61cSmallWorld/GRAPH_NAME_HERE`)

Assignment

Part 2 (due 3/24/13 @ 23:59:59)

Submission

Grading

Troubleshooting

Stopping Running Hadoop Jobs

Proxy Problems

Deleting Configuration Files

Last Resort

Appendix: Miscellaneous EC2-related Commands

Terminating/Listing Instances Manually

Listing/removing files from your cluster

Logging into your EC2 virtual machines

Viewing/changing your AWS credentials

Acknowledgements

Additional Links (NOT required reading) on Social Network Analysis

CS61C Spring 2013 Project 2 Part 2: Running SmallWorld on Amazon EC2

Graphs on EC2 (s3n://cs61cSmallWorld/GRAPH_NAME_HERE)

Part 2 (due 3/24/13 @ 23:59:59)

Stopping Running Hadoop Jobs

Proxy Problems

Deleting Configuration Files

Last Resort

Appendix: Miscellaneous EC2-related Commands

Terminating/Listing Instances Manually

Listing/removing files from your cluster

Logging into your EC2 virtual machines

Viewing/changing your AWS credentials

Acknowledgements

Graphs on EC2 (`s3n://cs61cSmallWorld/GRAPH_NAME_HERE`)