CS61C Fall 2010 — Lab 3b: EC2

Introduction: EC2 and Hadoop

EC2

Amazon’s Elastic Compute Cloud (EC2) is one of the most popular “cloud computing” services. EC2 allows users to rent virtual machines on an hourly basis. These virtual machines are hosted on physical machines in Amazon’s datacenters. Because of the economies of scale involved in datacenters and the higher utilization achieved through more sharing of machines, Amazon can rent the virtual machines at a fairly low cost and still make a profit.

 

EC2 offers several types of virtual machines, called “instances,”  two of which you will use in this lab. A small instance (“m1.small”) has the equivalent of roughly a 1 GHz CPU core, 2GB of RAM, a single hard drive and 400Megabit network. A high-CPU extra large instance (“c1.xlarge”) has the equivalent of 8 3GHz cores, 8 GB of RAM, four hard drives and a Gigabit network. Amazon sells these instances by the hour, rounded up to the next hour; for a small instance, Amazon bills $0.085 per hour, for a high-CPU extra large $0.68 per hour. The fee per virtual machine hour is the same no matter what the activity on the virtual machine is; however, there are some additional charges for data transfer outside the datacenter. Amazon charges $0.15 per gigabyte transferred out of EC2. (Currently, Amazon does not charge for data transfer into EC2.)

 

Along with EC2, Amazon offers related services. In this lab, we will also be using Amazon’s Simple Storage Service (S3). S3 offers long-term storage at a rate of about 15 cents per gigabyte-month, billed by the hour. S3 charges data transfer rates like EC2 does; however, data transfer between S3 and EC2 (in either direction) is free.

EC2 Tools

Most new users of EC2 and S3 would use them through Amazon’s web interface. For this class, we will be sharing one account and so cannot usefully give you access to Amazon’s web interface. So, instead, you will be accessing EC2 and S3 through a variety of command-line tools. Most of these tools were not created for this class; they were developed to be used by “real” users of EC2 and S3. We will give you credentials that will allow you to use these and other third-party EC2 and S3 tools with our account. For some functionality, for example, starting virtual machines, we have developed custom wrappers so we can track students’ EC2 usage accurately.

Hadoop

The Apache foundation’s Hadoop project is the most popular open source implementation of Map Reduce. Google has never released the original implementation of Map Reduce or much of the infrastructure it depends on (for example, the Google File System), so Hadoop provides an implementation of both. In this lab, we will be deploying Hadoop’s Distributed File System (HDFS) and Hadoop’s Map Reduce implementation across a cluster of virtual machines from EC2.

 

HDFS divides files up into fixed-sized chunks and distributes multiple copies of those chunks across all the machines in the cluster. This provides opportunities for Hadoop’s MapReduce implementation, when using one of those files as input, to avoid reading the file over the network and, when outputting files, to spread the load out over the cluster evenly. Hadoop’s Map Reduce implementation can also take input from other sources or send output to other destinations, for example, S3. (Of course, accessing files not stored on the cluster will be slower than accessing files stored in HDFS.)

Setup

Copy the lab materials to your home directory:

 

cp -R ~cs61c/labs/03b ./lab3b

 

Further commands will assume you are working from the “lab3b” directory you copied.

 

We are assuming that you are using the lab machines (possibly remotely). This lab is unlikely to work on other machines.

Exercise 0. Setup AWS Access

From your class account, run the commands

 

        ec2-util --init

        new-ec2-certificate

    . ~/ec2-environment.sh

 

These will allocate you an AWS access key and secret access key (two short random-looking strings) and an SSH private key (a file called username-default.pem in your home directory) and set up several configuration files for command-line utilities you will use in this lab to access EC2 and S3. See Note #1 for more information regarding your AWS keys.

 

If you find that you are having trouble with your AWS secret key, see Note #2 below!

 

For more interesting information about command-line utilities, see Note #3.

 

Accounting for your EC2 Usage

We have provided a command ec2-usage which will estimate the cost of your EC2 usage.

 

Note that this is an estimate and does not account for networking or storage charges. The estimate of instance hours may be off because it does not observe the exact finishing times of instances, so it may not realize when the hour-mark is not crossed before termination.

Exercise 1. Testing Locally

The lab directory contains a Map Reduce implementation of word count. Type make to build them.

 

The program files in the directory:

  1. Makefile — compilation instructions used by ‘make’
  2. common.c, common.h — headers and utility functions for reading and writing keys and values. By default, these expect input in the form “key\tvalue\n”, but they can also be used with a lower overhead “typedbytes” format.
  3. wordcount-mapper (source: wordcount-mapper.c) — Outputs (word, 1) for each word in the input keys and values. Takes an optional argument of how many fields to skip (0 = words in key, value, 1 = skip the key, 2 = skip the key and everything before the first tab in the values, etc.)
  4. wordcount-reducer (source: wordcount-reducer.c)
  1. Takes in a sorted list of keys and values like

(bar, 1)

(bar, 4)

(foo, 5)

(foo, 7)

(foo, 3)

(quux, 1)

and outputs a list like

(bar, 5)

(foo, 15)

(quux, 1)

  1. Numbers are represented with leading zeros in fixed with, so they can be sorted as text.
  2. Takes an optional argument “flip”. If supplied, outputs (count, word) instead of (word, count).
  1. wordcount-mapper-linux-x86, wordcount-reducer-linux-x86 — 32-bit Intel x86 Linux versions of the wordcount-mapper and wordcount-reducer. You will use these to run this program on EC2, where you will be running virtual machines that run 32-bit x86 Linux.

 

Now run

 

./wordcount-mapper <~cs61c/data/shaks12.txt | sort | ./wordcount-reducer flip >out.txt

 

This runs wordcount-mapper with ~cs61c/data/shaks12.txt as input, sorts the output (using the Unix sort utility), and then runs wordcount-reducer on the sorted output, saving the result in out.txt. You can use the Unix sort utility on out.txt to see the most common word in our original corpus, which is Project Gutenberg’s version of the complete works of Shakespeare.

 

To parallelize this task across multiple machines, we will be running this in Hadoop Streaming, an interface to the Hadoop MapReduce implementation which allows the mapper, reducer, and combiner programs to be ordinary executables. Before using EC2 credits, you should test the program on Hadoop in local mode.

 

Test the same sequence with hadoop using

 

hadoop jar ~cs61c/hadoop/streaming.jar \

-mapper ./wordcount-mapper \

-reducer './wordcount-reducer flip' \

-combiner './wordcount-reducer' \

-input ~cs61c/data/shaks12.txt \

-output hadoop-local-out

 

(The "combiner" program takes reducer input and produces new, ideally smaller reducer input. Hadoop runs combiners to reduce the volume of data saved to disks or sent over the network.)

 

This will create a directory called 'hadoop-local-out' containing a file named 'part-00000'. Hadoop MapReduce creates output directories like this with one file for each reducer. (In local mode, Hadoop only supports one reducer.) The output collected in part-00000 should be the same as that collected in 'out.txt'.

 

Question 1, Part 1:

What were the most common words in our corpus?

 

Question 2, Part 2:

The input file is around 5.3 MiB. Use the shell's "time" command ("time ./wordcount-mapper ...") to estimate the throughput of the manual and Hadoop local mode. Estimate the startup overhead of Hadoop Streaming by using the empty input file "~cs61c/data/empty.txt" instead of "~cs61c/data/shaks12.txt". What have you found out about the overhead associated with Hadoop?

 

Exercise 2a. Testing Remotely

Running make should have produced Linux versions of the above programs, called wordcount-mapper-linux-x86 and wordcount-reducer-linux-x86.

 

First, let's upload these files to S3. You can use the command-line program 's3cmd' to access S3. First create a new bucket on S3:

 

        s3cmd mb s3://YOUR-USERNAME

 

Then upload the binaries into that bucket:

 

        s3cmd put wordcount-mapper-linux-x86 wordcount-reducer-linux-x86 s3://YOUR-USERNAME/

 

Now edit wordcount.sh, and replace each instance of “BUCKET” with the name of the S3 bucket you created during setup. As you might guess, a URL like “s3n://BUCKET/name” specifies the file “/name” stored in “BUCKET” on S3 (Hadoop expects 's3n://' and not 's3://' unlike s3cmd). The “-files" option instructs Hadoop to make a file available to every map and reduce task which is run. “-D mapred.reduce.tasks=2” specifies that 2 reducers should be run (the default is 1).

 

After editing wordcount.sh, start an MapReduce cluster on EC2 using:

 

        hadoop-ec2 launch-cluster --auto-shutdown=115 YOUR-USERNAME-testing 1 nn,snn,dn,jt,tt

 

For details regarding this command and the associated options, see Note #2.

 

It will take some time (possibly minutes) for the virtual machine to boot. Note the URL that is output.

 

Now, copy wordcount.sh to your instance with

 

        hadoop-ec2 push YOUR-USERNAME-testing wordcount.sh

 

And run it with

 

        hadoop-ec2 exec YOUR-USERNAME-testing bash wordcount.sh

 

When complete, this job will create a directory in your S3 bucket called “wordcount-out” containing two files (“part-00000” and “part-00001”).

 

List these files with

 

        s3cmd ls s3://YOUR-USERNAME/wordcount-out/

 

and download them with

 

        s3cmd get s3://YOUR-USERNAME/wordcount-out/part-00000

 

These files are the output of each reducer. Though each reducer received its keys in sorted order, the keys (the words) were split randomly (by a hash function) across the reducers, so each file will contain the complete word counts for approximately half the words.

 

We’d probably be more interested in a sorted output to get the most frequently occurring words. Hadoop provides a preconstructed sorting MapReduce job that partitions the keys between reducers evenly. To divide the keys approximately evenly across the reducers at the same time, the first step of this sorting job is to take a sample of the input files and choose places to divide the keyspace. With two reducers on your wordcount output with number of occurances as keys, 0 through 2 will be assigned to the first reducer and the remaining keys to the second reducer.

 

sort.sh contains a template to run this sort job. Edit it to use your S3 bucket and run it as before. Collect its output as before. (Note that our script first copies the output from S3 to a local filesystem (HDFS) because the sort job is not written to handle input on S3.)

 

You can monitor your Hadoop cluster through the URL that was output by your launch-cluster command. For security, access to this URL is restricted to the machine you ran hadoop-ec2 on. If you have misplaced this URL, look at the name "ec2-..." output in 'ec2din | grep YOUR-USERNAME' and goto "http://ec2-...:50030/".

 

If you are doing the lab remotely, see Note #6 below!!

 

Question 2:

        

   From the web interface, how long did each of the MapReduce map and reduce tasks take? What was the size of their inputs and outputs?

 

After you have gathered your answer to the previous question, shutdown this 1-node cluster by running:

 

        hadoop-ec2 terminate-cluster YOUR-USERNAME-testing

Exercise 3. Scaling Up

Now, we’d like run this program over a much larger dataset. We will use wordcount-big.sh. Modify this file to use your S3 bucket (to get the wordcount-mapper and wordcount-reducer programs) as before. This file specifies a sequence of two jobs; the first is the wordcount job, set to output to the Hadoop Distributed File System (HDFS). HDFS storage is local to the virtual machines of the cluster you will start, and is thus faster (and cheaper) than S3. The second job reads that output and sorts it, and stores the result back in HDFS. (Since the output is quite large and we are only interested in a small part of it, we don't bother writing it all to S3.)

 

The input to this set of jobs is in s3n://cs61c-data/enwiki-pages/*, a copy of the article text from the English language Wikipedia as of January 2010. This corpus is about 20G and contains about 3 billion words of text.

 

We will use 5 high-cpu extra large instancesfor our cluster:

 

        hadoop-ec2 launch-cluster  --auto-shutdown=60 YOUR-USERNAME-large 1 nn,snn,dn,jt,tt 4 dn,tt

 

Copy wordcount-big.sh to that cluster and run it as before. With this much compute power, the job should complete in about 20 minutes. Observe the jobs through the web interface as before.

 

Question 3, Part 1:

While your code is running on the cluster, consider this problem:

 

“Determine which Wikipedia articles can't be reached by clicking links in articles up to 10 times starting with the article "2009".”

 

Outline way to solve this problem using Map Reduce over the Wikipedia text. (Hint: You may use multiple Map Reduce jobs) Discuss your approach with the TA.

 

To download files from the Hadoop cluster securely, we need to setup an encrypted network tunnel from your local machine. Run that with

 

eval `hadoop-ec2 proxy YOUR-USERNAME-large`

 

Now, look at the hdfs:///user/root/wikipedia-wordcount-sorted-out using a command 'hc' we have provided,:

 

hc YOUR-USERNAME-large dfs -ls hdfs:///user/root/wikipedia-wordcount-sorted-out

 

And download the last reducer's output (because this output file is large, download it to /tmp or similar, not your home directory) using:

 

        hc YOUR-USERNAME-large dfs -copyToLocal \

hdfs:///user/root/wikipedia-wordcount-sorted-out/part-00031 /tmp/YOUR-NAME-out

 

The hc command runs the local 'hadoop' command we used to test locally configured to use your Hadoop cluster through the proxy we ran.

 

Now stop the tunnel you started earlier with:

 

        kill $HADOOP_CLOUD_PROXY_PID

 

Question 3, Part 2:

        What was the throughput of this large word count + sort? What's the ratio of the number of times "a" appears in the English Wikipedia to the number of times "an" appears?

 

After you're done, shutdown your cluster with:

 

        hadoop-ec2 terminate-cluster YOUR-USERNAME-large

 

And remove the output file you downloaded.

 

Question 3, Part 3:

        Estimate the total cost of your Amazon Web Services (EC2 + S3) usage today.

Before You Go (even if you don’t finish)

Cleanup (save money!):

  1. Terminate all virtual machines:
  1. Check if you have clusters any running with hadoop-ec2 list
  2. Run ec2-usage or ec2-describe-instances | grep YOUR-USERNAME and see if it reports you have any instances running.
  1. Delete files from S3.
  1. Run 's3cmd del 's3://YOUR-USERNAME/**/*' 's3://YOUR-USERNAME/*'' to delete all files in your S3 bucket.
  2. Then run 's3cmd rb s3://YOUR-USERNAME/'  to remove your S3 bucket.

 

Notes        

  1. Your access key and secret access key permit you to (among other things):
  1. monitor all virtual machines under the course's EC2 (Elastic Compute Cloud) account
  2. shutdown any virtual machine under the course's EC2 account
  3. read and delete any files under the course's S3 (Simple Storage Service) account
  4. upload files to and create S3 buckets whose name starts with your student account username under our S3 account
  1. If you lose control of your AWS secret access key, you can run ec2-util --rotate-secret --init to generate a new access key and secret key pair and invalidate your previous access key and secret key.

    You can also use
    ec2-util to create new a SSH private key (see ec2-util --help).

    The command
    new-ec2-certificate generates an X.509 certificate and private key (~/.aws-cert-private.pem and ~/.aws-cert-public.pem), which is used instead of the access key and secret key by some AWS tools. Rerunning new-ec2-certificate will invalidate your previous X.509 certificate and generate a new one.
  2. We have provided a utility ec2-run to start virtual machines under our EC2 account that are associated with one of your SSH private keys. (Associating the instance with an SSH key allows us to determine approximately how much of our AWS credits each student used. To assure this accounting exists, the access key we have distributed is not permitted to start new virtual machines.) The hadoop-ec2 script that we have provided has been modified to call ec2-run instead of using the Amazon API directly.

 

We have also provided a copy of the EC2 command-line API tools, which includes                         commands like:

  1. ec2-terminate-instance
  2. ec2-cancel-spot-requests
  3. ec2-describe-instances
  4. ec2-reboot-instances
  1. Looking at the command to start a cluster, we note that YOUR-USERNAME-testing is a name specified in the configuration file ~/.hadoop-cloud/clusters.cfg. '1' specifies the number of instances (virtual machines) to run, and 'nn,snn,dn,jt,tt' specifies the services that this instance should run:
  1. nn,snn (namenode and secondary namenode) — master program for a distributed filesystem (keeps track of files stored on the cluster and their data)
  2. dn (datanode)— worker program for the distributed filesystem (keeps data for files stored on the cluster)
  3. jt (jobtracker) — master program for Map Reduce (keeps track and runs of MapReduce jobs)
  4. tt (tasktracker) — worker program for Map Reduce (runs tasks (parts of MapReduce jobs))
  5. --auto-shutdown 115 will automatically shutdown this instance after 115 minutes.
  1. Once you have started a cluster, you can login to your virtual machine using hadoop-ec2 login YOUR-USERNAME-testing. Alternately, lookup your machine in 'ec2din | grep YOUR-USERNAME'  and log into it with 'ssh -i ~/YOUR-USERNAME-default.pem root@HOSTNAME'. (The hostname will start with ec2-.)
  2. If you are doing the lab remotely, figure out your IP and authorize it to access the instance's web interfaces with

 

        ec2-authorize YOUR-USERNAME-testing -p 50000-60000 -s YOUR-IP/32

ec2-authorize YOUR-USERNAME-testing -p 80 -s YOUR-IP/32

Alternately, you can proxy access over an SSH tunnel:

  1. Copy the YOUR-USERNAME-default.pem file from your class account home directory to your own machine.
  2. Retrieve the public hostname of the master machine using ‘ec2din | grep YOUR-USERNAME-testing-master’ (the hostname starts with “ec2-”)
  3. If you are using MacOS X or Linux, move the .pem file to your home directory, then open a terminal and run ssh -i YOUR-USERNAME-default.pem -D localhost:6666 root@ec2-...If you are using PuTTY on Windows, follow the instructions in this document to import your private key (the .pem file) into a format PuTTY will understand. Then, when configuring a connection to your master instance go to “Connection > SSH > Tunnels”, and add a new tunnel with port 6666 and a destination of “dynamic”.
  4. Setup your browser to use this Proxy Autoconfiguration File (Firefox 4: Under Preferences > Advanced > Network > Settings..., enter the URL; Other browsers: follow instructions at http://www.lib.berkeley.edu/Help/proxy.html but use “http://cloudera-public.s3.amazonaws.com/ec2/proxy.pac” instead of the Berkeley library proxy PAC file)