CS61C Spring 2014 Project 2 Part 2: Running MapReduce on Amazon EC2

TA: Sung-Roa Yoon / Shreyas Chand
Part 1 Due: 03/19 @ 23:59:59
Part 2 Due: 04/02 @ 23:59:59

Updates

Administrative

Summary

In this part of the project, we'll use the power of EC2 to solve our exponentially large game trees for Connect N. We'll be using 6, 9, or 12 machines at a time, each with the following specifications:

High-CPU Extra Large Instance:
7 GiB of memory
20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each)
(according to Amazon, this is approximately equivalent to a machine
with 20 early-2006 1.7 GHz Intel Xeon Processors)
1.65 TiB of instance storage
64-bit platform
I/O Performance: High
API name: c1.xlarge

Thus, when we're working with 12 instances (machines), we're working under the following "constraints":

12 High-CPU Extra Large Instances:
Total of 84 GiB of memory
Total of 240 early-2006 1.7 GHz Intel Xeon Processors
Total of 19.8 TiB of instance storage

That's quite a bit of power!

Prepare Your Account

First, we need to take a few steps to prepare your account for launching jobs on EC2. Begin by running these commands (you should only need to do this once):

$ cd ~/proj2
$ bash ec2-init.sh
$ source ~/ec2-environment.sh

Next, copy over our updated Proj2.java and Makefile, overwriting the old version. Be careful not to accidentally overwrite all the code you wrote for part 1! You may want to create a backup copy just to be safe.

$ cp ~cs61c/proj/02-2/Proj2.java ~/proj2
$ cp ~cs61c/proj/02-2/Makefile ~/proj2

In order to take advantage of the many machines we will be using, we need to tell Hadoop to scale up the number of reducers it uses (the number of mappers scale automatically with the reducers). The updated version of Proj2.java adds the following line right after each mapreduce job to do so.

job.setNumReduceTasks([NUMBER]);

For InitFirst.java and FinalMoves.java, we always set the number of reducers to one, since they are not compute intensive tasks and we would like only one final output file. For PossibleMoves and SolveMoves however, we will change the number for different machine sizes. To make the change really easy, an instance variable called NUMBER_OF_REDUCERS has been added that is used when setting the number of reducers for PossibleMoves and SolveMoves.

This table lists the number of reducers we will use for each cluster size:
Cluster Size  |  Number of Reducers
--------------|--------------------
 6  Workers   |         24
 9  Workers   |         36
 12 Workers   |         48 

Yay! Everything should now be ready to go for EC2. Don't forget to keep updating the value of NUMBER_OF_REDUCERS as we try different cluster sizes. Also, don't forget to run make every time you change something in Proj2.java (the compiled proj2.jar file is the file given to your cluster to execute). Head over to the next section to learn how to actually run things on EC2.

How to Run on EC2

Here, we'll practice by running our code on EC2 for Connect 3 on a 3 by 3 board. Please read this entire section carefully before performing any of it. Also, please do not start a cluster for the sole purpose of running this example. Even if you run it only for 30 seconds and then shut down the instances, the University is charged for a full hour.

First, we'll want to go ahead and launch our cluster. To launch a large cluster with N workers, we run the following command. Note that this may take a few minutes to complete.

$ hadoop-ec2 launch-cluster --auto-shutdown=230 large N

The execution of this command may take a while, but once it successfully completes, you'll see some URLs in the output. If you paste these into your browser, you'll be able to view job status and all kinds of cool stats in your browser, as long as the cluster is still running. You'll need some of this information to answer the required questions later on. If you ever forget what these URLs are, you can run hadoop-ec2 list large to view them again, along with a list of all of your running instances.

Now that the cluster is started, we can go ahead and run the following to connect to our cluster:

$ hadoop-ec2 proxy large

Once this completes successfully, we can go ahead and launch the job (in this case solving Connect 3 for a 3 by 3 board) on EC2. We've made this really simply by putting the appropriate commands in the updated Makefile.

$ make proj2-hadoop WIDTH=3 HEIGHT=3 CONNECT=3

After the job is complete (which you can check using the web GUI or the stream of output in terminal), you can fetch your output using:

$ make proj2-hadoop-output

If everything ran correctly, your output should now be located in ec2_output.txt in your current working directory.

For the larger board sizes this command will not be very feasible since the output file size will be in the gigabytes. Specifically, do NOT copy the output from running your code on Connect 4 on a 5 by 5 board. In that case, you can copy over some selected parts of the output by using one of the following commands:

# Copy the last kilobyte of the output file
$  hc large dfs -tail hdfs:///all_data/final/part-r-00000 > output_ec2.txt

# Read the first NUM lines of the output file
$  hc large dfs -cat hdfs:///all_data/final/part-r-00000 | head -n NUM > output_ec2.txt

If you need to run another mapreduce job after one concludes, you need to clean the output directory on the Amazon servers. Before you do clean the output directory make sure you retrieve all the data you need! Once you are sure you are done, run the following:

$ make proj2-hadoop-clean

Finally, once we're all done grabbing all the information/data we need, and have run all the jobs that are needed, we go ahead and shutdown the cluster. This destroys all data that we placed on the machines in the cluster. In addition, the URLs you used earlier to see the web GUI will cease to function. Thus, be sure that you grab all the data you need to finish your project before terminating. In order to terminate, run the following:

$ hadoop-ec2 terminate-cluster large 

To confirm that your cluster was successfully shutdown, run hadoop-ec2 list large in a terminal. There should be no output if your cluster has successfully terminated.

Please note that if you leave an instance running for an unreasonable amount of time (for example, overnight), you will lose points. We are able to track the usage of each account and you will need to give us a cost assessment in your final report.

Assignment

Part 2 (due 4/02/13 @ 23:59:59)

Before running your code on EC2, run it locally to be sure it is correct and decently efficient (you should have done this for part 1 anyway). Complete the following questions (and do the required runs on EC2) and place the answers in a plain-text file called proj2.txt. The contents of this file are what will be graded for this part of the project. Please read through all of the questions before starting to run anything on EC2, so that you know exactly what information you need to collect before terminating a cluster.

  1. Run your code on Connect 4 on a 5 by 5 board (WIDTH=5 HEIGHT=5 CONNECT=4) on clusters of size 6, 9, and 12. Be sure to set the appropriate number of reducers for each cluster size as indicated above. How long does each take? How many total mappers and reducers did you use for each? (Read the rest of the questions to see what other data you will need.)
  2. What was the mean processing rate (in MB/s) of your code for 6, 9, and 12 instances? You can approximate the total data size to be (output size * 2) (Since you process all the game state data once each in PossibleMoves and SolveMoves). Output size is equal to the value given for HDFS_BYTES_READ on the job page for the mapper in FinalMoves.
  3. What was the speedup for 9 and 12 instances relative to 6 instances? What do you conclude about how well Hadoop parallelizes your work? Is this a case of strong scaling or weak scaling? Why or why not?
  4. Do some research about the combiner in Hadoop. Can you add a combiner to any of your MapReduces? ("MapReduces" being either of the four jobs: PossibleMoves, SolveMoves, InitFirst or FinalMoves.) If so, for each MapReduce phase that you indicated "yes" for, briefly discuss how you could add a combiner and what impact it would have on processing speed, if any. (NOTE: You DO NOT have to actually code anything here. Simply discuss/explain.)
  5. What was the price per GB processed for each cluster size? (Recall that an extra-large instance costs $0.68 per hour, rounded up to the nearest hour.)
  6. How many dollars in EC2 credits did you use to complete this project? You can try to use the ec2-usage command to do this, however if it returns bogus values, please try to approximate (and indicate this).

Submission

The full proj2-2 is due Wednesday (4/02/2014). To submit proj2-2, enter in the following. You should be turning in proj2.txt

$ submit proj2-2

Grading

Part 1 is worth 2/3 of your Project 2 grade.

Part 2 is worth 1/3 of your Project 2 grade.

Troubleshooting

Stopping Running Hadoop Jobs

Often it is useful to kill a job. Stopping the Java program that launches the job is not enough; Hadoop on the cluster will continue running the job without it. The command to tell Hadoop to kill a job is:

$ hc large job -kill JOBID 

where JOBID is a string like "job_201101121010_0002" that you can get from the output of your console log or the web interface. You can also list jobs using:

$ hc large job -list

Proxy Problems

Seeing output like the following?

"12/34/56 12:34:56 INFO ipc.Client: Retrying connect to server: ec2-123-45-67-89.amazonaws....."

or

"Exception in thread "main" java.io.IOException: Call to ec2-123-45-67-89.
compute-1.amazonaws.com/123.45.67.89:8020 failed on local exception: java.
net.SocketException: Connection refused

<long stack trace>"

If you get this error from 'hc' try running

$ hadoop-ec2 proxy large

again. If you continue getting this error from hc after doing that, check that your cluster is still running using

$ hadoop-ec2 list large

and by making sure the web interface is accessible.

Deleting Configuration Files

If you've accidentally deleted one of the configuration files created by ec2-init.sh, you can recreate it by rerunning $ bash ec2-init.sh

Last Resort

It's okay to stop and restart a cluster if things are broken. But it wastes time and money.

Appendix: Miscellaneous EC2-related Commands

We don't think you'll need any of these, but just in case:

Terminating/Listing Instances Manually

You can get a raw list of all virtual machines you have running using

$ ec2-my-instances 

This will include the public DNS name (starts with "ec2-" and ends with "amazonaws.com") and the private name (starts with "ip-...") of each virtual machine you are running, as well as whether it is running or shutting down or recently terminated, its type, the SSH key associated with it (probably USERNAME-default) and the instance ID, which looks like "i-abcdef12". You can use this instance ID to manually shut down an individual machine:

$ ec2-terminate-instances i-abcdef12

Note that this command will not ask for confirmation. ec2-terminate-instances comes from the EC2 command line tools. ec2-my-instances is an alias for the command line tools' ec2-describe-instances command with options to only show your instances rather than all instances belonging to the class.

Listing/removing files from your cluster

LISTING FILES:
$ hc large dfs -ls hdfs:///[DIRECTORY NAME HERE]
REMOVING FILES/DIRECTORIES:
$ hc large dfs -rmr hdfs:///[FILENAMEHERE]

Note that you may be able to delete/view other people's files. Please do not abuse this. As usual, tampering with files belonging to other people is a student conduct violation.

Logging into your EC2 virtual machines

$ hadoop-ec2 login large
# or using a machine name listed by ec2-my-instances or hadoop-ec2 list
$ ssh-nocheck -i ~/USERNAME-default.pem root@ec2-....amazonaws.com

The cluster you start is composed of ordinary Linux virtual machines. The file ~/USERNAME-default.pem is the private part of an SSH keypair for the machines you have started.

Viewing/changing your AWS credentials

You can view your AWS access key + secret access key using:

$ ec2-util --list

If you somehow lose control of your AWS credentials, you can get new AWS access keys using:

$ ec2-util --rotate-secret 
$ new-ec2-certificate

This is likely to break any running instances you have.

Acknowledgements

Thanks to Scott Beamer (a former TA), the original creator of this project, and Alan Christopher and Ravi Punj for the "Troubleshooting" and "Appendix" sections.