Updates
- 03/20/2013 - Clarified meaning of input size in Assignment section.
- 03/19/2013 - Removing/Listing files section of Troubleshooting is now updated with new commands.
- 03/19/2013 -
ec2-usage
is known to provide incorrect values sometimes, so please don't be concerned if it reports ridiculously large costs/usage times.
Administrative
- PLEASE START EARLY. A limited number of clusters can be initialized at once under the cs61c master account. That means that if you procrastinate on the project right now, there's a good chance that you won't be able to start a cluster easily towards the end of the week, when traffic is heaviest.
- Please consult the troubleshooting guide at the bottom of this page if you run into issues. DO NOT post on Piazza until you have exhausted all of the relevant troubleshooting tips
- You MUST work with a partner on this project (your partner does not have to be in your section). You may not work alone and you may not work in groups larger than two.
Summary
In this part of the project, we'll use the power of EC2 to run our massively parallel BFS on huge graphs. We'll be using 6, 9, or 12 machines at a time, each with the following specifications:
High-CPU Extra Large Instance: 7 GiB of memory 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each) (according to Amazon, this is approximately equivalent to a machine with 20 early-2006 1.7 GHz Intel Xeon Processors) 1.65 TiB of instance storage 64-bit platform I/O Performance: High API name: c1.xlarge
Thus, when we're working with 12 instances (machines), we're working under the following "constraints":
12 High-CPU Extra Large Instances: Total of 84 GiB of memory Total of 240 early-2006 1.7 GHz Intel Xeon Processors Total of 19.8 TiB of instance storage
That's quite a bit of power!
Prepare Your Account
First, we need to take a few steps to prepare your account for launching jobs on EC2. Begin by running these commands (you should only need to do this once):
$ mkdir ~/proj2-2 $ cd ~/proj2-2 $ cp ~cs61c/proj/02_EC2/ec2-init.sh . $ bash ec2-init.sh $ source ~/ec2-environment.sh
Next, copy over your Makefile
and SmallWorld.java
from part 1
of the project into the current directory (proj2-2
).
Now that we have a copy of SmallWorld.java, we need to modify it so
that it will use multiple reducers (hadoop will scale the number of
mappers automatically). In order to do this, we have to use the method
job.setNumReduceTasks(int num)
. We will set the following number
of reducers for each cluster size:
Cluster Size | Number of Reducers --------------|-------------------- 6 Machines | 24 9 Machines | 36 12 Machines | 48
For the first two jobs (Loader and BFS), we'll want to do the following after the job object is created in main:
job.setNumReduceTasks(NUMBER FROM ABOVE TABLE);
Since we only want a single output file, we need to limit the last job (Histogram) to use only a single reducer. We do this by adding the following after the job object for the Histogram phase is created in main:
job.setNumReduceTasks(1);
Yay! Everything should now be ready to go for EC2. Don't forget to keep
updating the argument of setNumReduceTasks
as we try different cluster sizes.
Also, don't forget to run make
every time you change something in
SmallWorld.java
(the compiled sw.jar
file is the
file given to your cluster to execute).
Head over to the next section to learn how to actually run things on EC2.
How to Run on EC2
Here, we'll practice by running our code on EC2 with the graph ring4.seq. Please read this entire section carefully before performing any of it. Also, please do not start a cluster for the sole purpose of running this example. Even if you run it only for 30 seconds and then shut down the instances, the university is charged for a full hour.
First, we'll want to go ahead and launch our cluster. To launch a
large cluster with N
workers, we run the following command.
Note that this may take a few minutes to complete.
$ hadoop-ec2 launch-cluster --auto-shutdown=230 large N
The execution of this command may take a while, but once it successfully
completes, you'll see some URLs in the output. If you paste these into
your browser, you'll be able to view job status and all kinds of cool
stats in your browser, as long as the cluster is still running. You'll need
some of this information to answer the required questions later on. If you
ever forget what these URLs are, you can run hadoop-ec2 list large
to view
them again, along with a list of all of your running instances.
Now that the cluster is started, we can go ahead and run the following to connect to our cluster:
$ hadoop-ec2 proxy large
Once this completes successfully, we can go ahead and launch the job (in this case a run of SmallWorld on ring4.seq) on EC2.
$ hc large jar sw.jar SmallWorld s3n://cs61cSmallWorld/ring4.seq hdfs:///smallworld-out <DENOM>
After the job is complete (which you can check using the web GUI or the stream of output in terminal), you can fetch your output using:
$ hc large dfs -cat hdfs:///smallworld-out/part-r-00000 > ec2_output.txt
If everything ran correctly, your output should now be located in ec2_output.txt
in your current working directory.
Finally, once we're all done grabbing all the information/data we need, we go ahead and shutdown the cluster. This destroys all data that we placed on the machines in the cluster. In addition, the URLs you used earlier to see the web GUI will cease to function. Thus, be sure that you grab all the data you need to finish your project before terminating. In order to terminate, run the following:
$ hadoop-ec2 terminate-cluster large
To confirm that your cluster was successfully shutdown, run hadoop-ec2
list large
in terminal. There should be no output if your cluster has successfully terminated.
Please note that if you leave an instance running for an unreasonable amount of time (for example, overnight), you will lose points. We are able to track the usage of each account and you will need to give us a cost assessment in your final report.
Finally, for reference, here are the graphs available to you on EC2:
Graphs on EC2 (s3n://cs61cSmallWorld/GRAPH_NAME_HERE
)
hollywood.sequ
- Hollywood co-star database circa 2009 (1M actors)wikipedia.seq
- Wikipedia links circa 2011 (5.7M articles) - For fun: not part of assignment. Do not try this until you have completed the rest of the project. Also, do not try this unless you have efficient code.ring4.seq
- The same 4 vertex ring from part one.
Assignment
Part 2 (due 3/24/13 @ 23:59:59)
Before running your code on EC2, run it locally to be sure it is correct and decently
efficient (you should have done this for part 1 anyway). Complete the following
questions (and do the required runs on EC2) and place the answers in a plain-text file
called proj2.txt
. The contents of this file are primarily what will be graded for
this part of the project. Please read through all of the questions before starting to run
anything on EC2, so that you know exactly what information you need to collect before
terminating a cluster.
- Run your code on
hollywood.sequ
with denom=100000 on clusters of size 6, 9, and 12. Be sure to set the appropriate number of reducers for each cluster size as indicated above. How long does each take? How many searches did each perform? How many reducers did you use for each? (Read the rest of the questions to see what other data you will need.) Also, be sure to properly scale the number of reducers as indicated in the setup section. - For the Hollywood dataset, at what distance are the 50th, 90th, and 95th percentiles? (Choose the output of any one of your three runs to use as source data.)
- What was the mean processing rate (MB/s) for 6, 9, and 12 instances?
You can approximate the data size to be (input size) * (# of searches). Input Size
is equal to the value given for
S3N_BYTES_READ
on the job page for your first Mapper. - What was the speedup for 9 and 12 instances relative to 6 instances? What do you conclude about how well Hadoop parallelizes your work? Is this a case of strong scaling or weak scaling? Why or why not?
- Do some research about the combiner in Hadoop. Can you add a combiner to any of your MapReduces? ("MapReduces" being Loader, BFS, and Histogram.) If so, for each MapReduce phase that you indicated "yes" for, briefly discuss how you could add a combiner and what impact it would have on processing speed, if any. (NOTE: You DO NOT have to actually code anything here. Simply discuss/explain.)
- What was the price per GB processed for each cluster size? (Recall that an extra-large instance costs $0.68 per hour, rounded up to the nearest hour.)
- How many dollars in EC2 credits did you use to complete this project?
If
ec2-usage
returns bogus values, please try to approximate (and indicate this).
Submission
In order to submit part 2, type:
$ submit proj2-2
Only one partner should submit. Please be sure to indicate your
partner's login when prompted.
Submit proj2.txt
, your modified SmallWorld.java
, the
Makefile
(if modified), and any additional source files your code needs.
Be sure to put both your name and login and your partner's name
and login at the top of proj2.txt
.
Troubleshooting
Stopping Running Hadoop Jobs
Often it is useful to kill a job. Stopping the Java program that launches the job is not enough; Hadoop on the cluster will continue running the job without it. The command to tell Hadoop to kill a job is:
$ hc large job -kill JOBID
where JOBID is a string like "job_201101121010_0002" that you can get from the output of your console log or the web interface. You can also list jobs using:
$ hc large job -list
Proxy Problems
Seeing output like the following?
"12/34/56 12:34:56 INFO ipc.Client: Retrying connect to server: ec2-123-45-67-89.amazonaws....."
or
"Exception in thread "main" java.io.IOException: Call to ec2-123-45-67-89. compute-1.amazonaws.com/123.45.67.89:8020 failed on local exception: java. net.SocketException: Connection refused <long stack trace>"
If you get this error from 'hc' try running
$ hadoop-ec2 proxy large
again. If you continue getting this error from hc after doing that, check that your cluster is still running using
$ hadoop-ec2 list large
and by making sure the web interface is accessible.
Deleting Configuration Files
If you've accidentally deleted one of the configuration files created by
ec2-init.sh
, you can recreate it by rerunning $ bash ec2-init.sh
Last Resort
It's okay to stop and restart a cluster if things are broken. But it wastes time and money.
Appendix: Miscellaneous EC2-related Commands
We don't think you'll need any of these, but just in case:
Terminating/Listing Instances Manually
You can get a raw list of all virtual machines you have running using
$ ec2-my-instances
This will include the public DNS name (starts with "ec2-" and ends with "amazonaws.com") and the private name (starts with "ip-...") of each virtual machine you are running, as well as whether it is running or shutting down or recently terminated, its type, the SSH key associated with it (probably USERNAME-default) and the instance ID, which looks like "i-abcdef12". You can use this instance ID to manually shut down an individual machine:
$ ec2-terminate-instances i-abcdef12
Note that this command will not ask for confirmation. ec2-terminate-instances comes from the EC2 command line tools. ec2-my-instances is an alias for the command line tools' ec2-describe-instances command with options to only show your instances rather than all instances belonging to the class.
Listing/removing files from your cluster
LISTING FILES: $ hc large dfs -ls hdfs:///[DIRECTORY NAME HERE] REMOVING FILES/DIRECTORIES: $ hc large dfs -rmr hdfs:///[FILENAMEHERE]
Note that you may be able to delete/view other people's files. Please do not abuse this. As usual, tampering with files belonging to other people is a student conduct violation.
Logging into your EC2 virtual machines
$ hadoop-ec2 login large # or using a machine name listed by ec2-my-instances or hadoop-ec2 list $ ssh-nocheck -i ~/USERNAME-default.pem root@ec2-....amazonaws.com
The cluster you start is composed of ordinary Linux virtual machines. The file ~/USERNAME-default.pem is the private part of an SSH keypair for the machines you have started.
Viewing/changing your AWS credentials
You can view your AWS access key + secret access key using:
$ ec2-util --list
If you somehow lose control of your AWS credentials, you can get new AWS access keys using:
$ ec2-util --rotate-secret $ new-ec2-certificate
This is likely to break any running instances you have.
Acknowledgements
Thanks to Scott Beamer (a former TA), the original creator of this project, and Alan Christopher and Ravi Punj for the "Troubleshooting" and "Appendix" sections.
Additional Links (NOT required reading) on Social Network Analysis
- Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twitter, a Social Network or a News Media?. WWW 2010. (analysis of Twitter and source of our Twitter data)
- Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature:393, 1998. (mathematics behind small-world networks)
- Stanford Network Analysis Project (a great source of social network data and the source of cit-HepPh)
- Laboratory for Web Algorithmics (source of the Hollywood data and has current metrics on largest social networks)
- Wikipedia Crawl