Updates
- 03/20/2014 : Fixed name of field to look for in question 2 from
S3N_BYTES_READ
toHDFS_BYTES_READ
- 03/19/2014 : Released Project 2-2
Administrative
- PLEASE START EARLY. A limited number of clusters can be initialized at once under the cs61c master account. That means that if you procrastinate on the project right now, there's a good chance that you won't be able to start a cluster easily as we get closer to the deadline, when traffic is heaviest.
- Please consult the troubleshooting guide at the bottom of this page if you run into issues. DO NOT post on Piazza until you have exhausted all of the relevant troubleshooting tips
- It is suggested that you work with one partner for this project. Keep in mind this has to be a different person from the one you worked with on project 1. You may share code with your partner, but ONLY your partner. The submit command will ask you for your partner's name, so only one partner needs to submit. (It would be nice if only one partner submitted).
Summary
In this part of the project, we'll use the power of EC2 to solve our exponentially large game trees for Connect N. We'll be using 6, 9, or 12 machines at a time, each with the following specifications:
High-CPU Extra Large Instance: 7 GiB of memory 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each) (according to Amazon, this is approximately equivalent to a machine with 20 early-2006 1.7 GHz Intel Xeon Processors) 1.65 TiB of instance storage 64-bit platform I/O Performance: High API name: c1.xlarge
Thus, when we're working with 12 instances (machines), we're working under the following "constraints":
12 High-CPU Extra Large Instances: Total of 84 GiB of memory Total of 240 early-2006 1.7 GHz Intel Xeon Processors Total of 19.8 TiB of instance storage
That's quite a bit of power!
Prepare Your Account
First, we need to take a few steps to prepare your account for launching jobs on EC2. Begin by running these commands (you should only need to do this once):
$ cd ~/proj2 $ bash ec2-init.sh $ source ~/ec2-environment.sh
Next, copy over our updated Proj2.java
and Makefile
,
overwriting the old version. Be careful not to accidentally overwrite all the code you
wrote for part 1! You may want to create a backup copy just to be safe.
$ cp ~cs61c/proj/02-2/Proj2.java ~/proj2 $ cp ~cs61c/proj/02-2/Makefile ~/proj2
In order to take advantage of the many machines we will be using, we need to tell Hadoop
to scale up the number of reducers it uses (the number of mappers scale automatically with the reducers).
The updated version of Proj2.java
adds the following line right
after each mapreduce job to do so.
job.setNumReduceTasks([NUMBER]);
For InitFirst.java
and FinalMoves.java
, we always set the number
of reducers to one, since they are not compute intensive tasks and we would like only one final output
file. For PossibleMoves
and SolveMoves
however, we will change the number for
different machine sizes. To make the change really easy, an instance variable called
NUMBER_OF_REDUCERS
has been added that is used when setting the number of reducers
for PossibleMoves
and SolveMoves
.
Cluster Size | Number of Reducers --------------|-------------------- 6 Workers | 24 9 Workers | 36 12 Workers | 48
Yay! Everything should now be ready to go for EC2. Don't forget to keep
updating the value of NUMBER_OF_REDUCERS
as we try different cluster sizes.
Also, don't forget to run make
every time you change something in
Proj2.java
(the compiled proj2.jar
file is the
file given to your cluster to execute). Head over to the next section to learn how
to actually run things on EC2.
How to Run on EC2
Here, we'll practice by running our code on EC2 for Connect 3 on a 3 by 3 board. Please read this entire section carefully before performing any of it. Also, please do not start a cluster for the sole purpose of running this example. Even if you run it only for 30 seconds and then shut down the instances, the University is charged for a full hour.
First, we'll want to go ahead and launch our cluster. To launch a
large cluster with N
workers, we run the following command.
Note that this may take a few minutes to complete.
$ hadoop-ec2 launch-cluster --auto-shutdown=230 large N
The execution of this command may take a while, but once it successfully
completes, you'll see some URLs in the output. If you paste these into
your browser, you'll be able to view job status and all kinds of cool
stats in your browser, as long as the cluster is still running. You'll need
some of this information to answer the required questions later on. If you
ever forget what these URLs are, you can run hadoop-ec2 list large
to view
them again, along with a list of all of your running instances.
Now that the cluster is started, we can go ahead and run the following to connect to our cluster:
$ hadoop-ec2 proxy large
Once this completes successfully, we can go ahead and launch the job (in this case
solving Connect 3 for a 3 by 3 board) on EC2. We've made this really simply by putting
the appropriate commands in the updated Makefile
.
$ make proj2-hadoop WIDTH=3 HEIGHT=3 CONNECT=3
After the job is complete (which you can check using the web GUI or the stream of output in terminal), you can fetch your output using:
$ make proj2-hadoop-output
If everything ran correctly, your output should now be located in ec2_output.txt
in your current working directory.
For the larger board sizes this command will not be very feasible since the output file size will be in the gigabytes. Specifically, do NOT copy the output from running your code on Connect 4 on a 5 by 5 board. In that case, you can copy over some selected parts of the output by using one of the following commands:
# Copy the last kilobyte of the output file $ hc large dfs -tail hdfs:///all_data/final/part-r-00000 > output_ec2.txt # Read the first NUM lines of the output file $ hc large dfs -cat hdfs:///all_data/final/part-r-00000 | head -n NUM > output_ec2.txt
If you need to run another mapreduce job after one concludes, you need to clean the output directory on the Amazon servers. Before you do clean the output directory make sure you retrieve all the data you need! Once you are sure you are done, run the following:
$ make proj2-hadoop-clean
Finally, once we're all done grabbing all the information/data we need, and have run all the jobs that are needed, we go ahead and shutdown the cluster. This destroys all data that we placed on the machines in the cluster. In addition, the URLs you used earlier to see the web GUI will cease to function. Thus, be sure that you grab all the data you need to finish your project before terminating. In order to terminate, run the following:
$ hadoop-ec2 terminate-cluster large
To confirm that your cluster was successfully shutdown, run hadoop-ec2
list large
in a terminal. There should be no output if your cluster has successfully terminated.
Please note that if you leave an instance running for an unreasonable amount of time (for example, overnight), you will lose points. We are able to track the usage of each account and you will need to give us a cost assessment in your final report.
Assignment
Part 2 (due 4/02/13 @ 23:59:59)
Before running your code on EC2, run it locally to be sure it is correct and decently
efficient (you should have done this for part 1 anyway). Complete the following
questions (and do the required runs on EC2) and place the answers in a plain-text file
called proj2.txt
. The contents of this file are what will be graded for
this part of the project. Please read through all of the questions before starting to run
anything on EC2, so that you know exactly what information you need to collect before
terminating a cluster.
- Run your code on Connect 4 on a 5 by 5 board (
WIDTH=5 HEIGHT=5 CONNECT=4
) on clusters of size 6, 9, and 12. Be sure to set the appropriate number of reducers for each cluster size as indicated above. How long does each take? How many total mappers and reducers did you use for each? (Read the rest of the questions to see what other data you will need.) - What was the mean processing rate (in MB/s) of your code for 6, 9, and 12 instances?
You can approximate the total data size to be (output size * 2) (Since you process
all the game state data once each in
PossibleMoves
andSolveMoves
). Output size is equal to the value given forHDFS_BYTES_READ
on the job page for the mapper inFinalMoves
. - What was the speedup for 9 and 12 instances relative to 6 instances? What do you conclude about how well Hadoop parallelizes your work? Is this a case of strong scaling or weak scaling? Why or why not?
- Do some research about the combiner in Hadoop. Can you add a combiner
to any of your MapReduces? ("MapReduces" being either of the four jobs:
PossibleMoves
,SolveMoves
,InitFirst
orFinalMoves
.) If so, for each MapReduce phase that you indicated "yes" for, briefly discuss how you could add a combiner and what impact it would have on processing speed, if any. (NOTE: You DO NOT have to actually code anything here. Simply discuss/explain.) - What was the price per GB processed for each cluster size? (Recall that an extra-large instance costs $0.68 per hour, rounded up to the nearest hour.)
- How many dollars in EC2 credits did you use to complete this project? You can try to use the
ec2-usage
command to do this, however if it returns bogus values, please try to approximate (and indicate this).
Submission
The full proj2-2 is due Wednesday (4/02/2014). To submit proj2-2, enter in the following. You should be turning in proj2.txt
$ submit proj2-2
Troubleshooting
Stopping Running Hadoop Jobs
Often it is useful to kill a job. Stopping the Java program that launches the job is not enough; Hadoop on the cluster will continue running the job without it. The command to tell Hadoop to kill a job is:
$ hc large job -kill JOBID
where JOBID is a string like "job_201101121010_0002" that you can get from the output of your console log or the web interface. You can also list jobs using:
$ hc large job -list
Proxy Problems
Seeing output like the following?
"12/34/56 12:34:56 INFO ipc.Client: Retrying connect to server: ec2-123-45-67-89.amazonaws....."
or
"Exception in thread "main" java.io.IOException: Call to ec2-123-45-67-89. compute-1.amazonaws.com/123.45.67.89:8020 failed on local exception: java. net.SocketException: Connection refused <long stack trace>"
If you get this error from 'hc' try running
$ hadoop-ec2 proxy large
again. If you continue getting this error from hc after doing that, check that your cluster is still running using
$ hadoop-ec2 list large
and by making sure the web interface is accessible.
Deleting Configuration Files
If you've accidentally deleted one of the configuration files created by
ec2-init.sh
, you can recreate it by rerunning $ bash ec2-init.sh
Last Resort
It's okay to stop and restart a cluster if things are broken. But it wastes time and money.
Appendix: Miscellaneous EC2-related Commands
We don't think you'll need any of these, but just in case:
Terminating/Listing Instances Manually
You can get a raw list of all virtual machines you have running using
$ ec2-my-instances
This will include the public DNS name (starts with "ec2-" and ends with "amazonaws.com") and the private name (starts with "ip-...") of each virtual machine you are running, as well as whether it is running or shutting down or recently terminated, its type, the SSH key associated with it (probably USERNAME-default) and the instance ID, which looks like "i-abcdef12". You can use this instance ID to manually shut down an individual machine:
$ ec2-terminate-instances i-abcdef12
Note that this command will not ask for confirmation. ec2-terminate-instances comes from the EC2 command line tools. ec2-my-instances is an alias for the command line tools' ec2-describe-instances command with options to only show your instances rather than all instances belonging to the class.
Listing/removing files from your cluster
LISTING FILES: $ hc large dfs -ls hdfs:///[DIRECTORY NAME HERE] REMOVING FILES/DIRECTORIES: $ hc large dfs -rmr hdfs:///[FILENAMEHERE]
Note that you may be able to delete/view other people's files. Please do not abuse this. As usual, tampering with files belonging to other people is a student conduct violation.
Logging into your EC2 virtual machines
$ hadoop-ec2 login large # or using a machine name listed by ec2-my-instances or hadoop-ec2 list $ ssh-nocheck -i ~/USERNAME-default.pem root@ec2-....amazonaws.com
The cluster you start is composed of ordinary Linux virtual machines. The file ~/USERNAME-default.pem is the private part of an SSH keypair for the machines you have started.
Viewing/changing your AWS credentials
You can view your AWS access key + secret access key using:
$ ec2-util --list
If you somehow lose control of your AWS credentials, you can get new AWS access keys using:
$ ec2-util --rotate-secret $ new-ec2-certificate
This is likely to break any running instances you have.
Acknowledgements
Thanks to Scott Beamer (a former TA), the original creator of this project, and Alan Christopher and Ravi Punj for the "Troubleshooting" and "Appendix" sections.