For this project, you're finally going to roll up your sleeves and use MapReduce to answer a big data problem.
The question you're going to answer is this: given a word, what other words are statistically associated with it? If I say 'love', or 'death', or 'terrorism', what other words and concepts go with it?
A reasonable statistical definition is:
Let Aw be the number of occurrences of w in the corpus.
Let Cw be the number of occurrences of w in documents that also have the target word.
Co-occurrence rate := if(Cw > 0) Cw * (log(Cw))3 / Aw
else 0
Here is an example that illustrates this definition. Let the target word be "Dave"
This does nothing to account for the distance between words however. A fairly straightforward generalization of the problem is to, instead of giving each co-occurence a value of 1, give it a value f(d), where d is the minimum distance (number of spaces separating the word occurrences, infinity if one of the words is not in the document) from the word occurrence to the nearest instance of the target word. To make our definition cleaner we will restrict our choice of f so that f sends infinity to 0 and positive numbers to numbers greater than or equal to one. The result of the refinement is as follows:
Let W be the set of all instances of a given word in the corpus
Let Sw be the the sum of f(dw) over all w in W .
Co-occurrence rate := if(Sw > 0) Sw * (log(Sw))3 / |W|
else 0
Here's another generalization, instead of only looking at how words relate to one another, we can look at how phrases relate to one another. To do that we will look a sequences of n words, or an n-gram, instead of just one word, which is a 1-gram. To do this we will instead of defining a target word define a target gram, and we will define the distance between two n-gram instances to be the number of spaces between their left-most word or infinity if they are not in the same document. Thus, the final refinement of our model is:
Let G be the set of all instances of a given n-gram in the corpus
Let Sg be the the sum of f(dg) over all g in G .
Co-occurrence rate := if(Sg > 0) Sg * (log(Sg))3 / |G|
else 0
Your task is to produce an ordered list of n-grams for the target gram sorted by generalized Co-occurrence rate.
Your list should be ordered with the biggest co-occurrence rates at the top.
The data will be the same Usenet corpus as in Lab 3 (and in turn, the same format as in Lab 2). Hence, the test data supplied for Lab 2 will also work here.
We'll give you skeleton code for two MapReduce jobs and a Makefile in a git repo at ~cs61c/proj/fa12/01. You should pull from that repository into your working repository (e.g. git pull ~cs61c/proj/fa12/01 master), and then start modifying your copy of Proj1.java. Both jobs live in the same source file. That directory will also include some test data which will contain some of the results from our reference implementation. You should get the same output as we did.
We'll also give you a program (Importer.java) that you can use to convert text files to the appropriate sequence file format. This will let you create your own test data.
Your solution should implement the algorithm described above, with the exception that we will only be implementing three different f's for convenience. Still, it should be trivial to modify your code to accept more f's. Our reference solution does define new classes and methods, and we encourage you to do the same so that your code remains readable. You should not need to change much in main, but it is likely that you will want to change the type signatures of your MR jobs, and this requires some modifications to main. We've commented the regions we expect you to need to edit.
Note that Hadoop guarantees that a reduce task receives its keys in sorted order. You may (should!) use this fact in your implementation.
It's possible that some Map or Reduce phases will be the identity function. This is the default behavior if you don't override the base class map() and reduce(). (aka if you just delete the map or reduce function.)
As you parse the text, you should convert it to lower case. Have a look at the Javadoc for java.lang.String here for a Java method that should be helpful to you. You may also need to use Java's HashMap and Math classes.
We've defined Combiner functions for both jobs. These are disabled by default; you can turn them on by specifying "-Dcombiner=true" on the command line. You should try to write the code to make use of these combiners. It'll make your job run faster, and more importantly you can't get full credit without implementing a non-trivial combiner. It shouldn't be hard — our combiner implementation was just a handful of lines.
The framework expects you to output (from the second job) a DoubleWritable and Text pair. These should be the score for each word and the word itself, respectively.
The way you should launch the jobs is via:
hadoop jar proj1.jar Proj1 -conf conf.xml <input> <intermediateDir> <outDir>
You should edit conf.xml to refer to the target gram and f for the analysis (the three f's we'll be using are numbered 0, 1, and 2), and <intermediateDir> and <outDir> can be whatever paths you like.
The -D syntax sets a Hadoop configuration parameter, and the -conf syntax specifies an xml which defines multiple configuration parameters simultaneously. Inside your job, you can retrieve this parameter via context.getConfiguration().get("targetGram"). The framework we give you has this code already there for you.
The intermediate directory will hold the output from the first MapReduce job, which is used as input for the second job. This may be helpful in debugging.
The framework we give you supports two other options: Specifiying -Dcombiner=true will turn on the combiner. Specifiying -DrunJob2=false will cause only the first job to run, and will cause its output to be in a friendly (Text) format instead of as sequence files. This is intended for debugging.
Make sure you delete the intermediate and final outputs between runs.
STOP. You have now done enough work for the intermediate checkpoint, due on the 16th. For the intermediate milestone, submit the contents of your proj1-1 directory with the tag proj1-1.
Before submitting, you should verify that the invoke command above causes your code to run correctly. We'll be using that command, or one very similar to it, for our grading.
We're going to grade your code by running it in an automated test harness on sample data. We expect you to give [approximately] the right numeric answers, barring floating point roundoff. So don't try to improve on the formula we give you to get "better" results (you are free to toy around with additional f's, but do not modify the standard three). We promise to avoid tests that are overly sensitive to roundoff.
PLEASE START EARLY. Only a certain number of clusters can be initialized under the cs61c master account. That means that if you procrastinate on the project right now, there's a good chance that you won't be able to start a cluster easily towards the end of the week, when traffic is heaviest.
We expect you to test your code in the cloud. This should be a fun experience, not a chore. You're making machines hundreds of miles away jump to your bidding. And you don't even have to pay for it. Berkeley, however, does have to pay for it so please try to be cost-conscious.
Start by creating a proj1-2 directory in your work directory, and copying files over to the proj1-2 directory. You can use the following commands to achieve this:
mkdir ~/<working directory>/proj1-2
cp -r ~/<working directory>/proj1-1/* ~/<working directory>/proj1-2/
On EC2 servers, Hadoop assigns LongWritable document IDs, instead of Text. So, you should change the signature of the Map1 class and its map function. Changing the input key class to org.apache.hadoop.io.LongWritable or org.apache.hadoop.io.WritableComparable will work.
For this assignment, we want you to run your job a total six times in the cloud.
You'll be exclusively using "large" instances. For each of the three n-grams, and funcNum pairs { ("jurisdiction", 0), ("court order", 1), ("in my opinion", 2) } run using both 5 workers and then 9 workers in the cluster. Be sure to save the output and the job status pages (so you can determine input size and runtime) for each run. Once you terminate a cluster, you will not be able to access the logs for that cluster, and neither Hadoop's distributed filesystem.
We estimate that each run with 5 workers will take around 15-20 minutes (a few minutes less with a combiner).
Do not leave EC2 machines running when you are not using them. The virtual machines cost the same amount of money when they are active as when they are sitting idle.
The EC2 usage instructions are substantially the same as those in Lab 3.
If you haven't already setup your account for EC2 usage using:
bash ~/<working directory>/lab03/ec2-init.sh
source ~/ec2-environment.sh
To start a cluster with N workers, say:
hadoop-ec2 launch-cluster --auto-shutdown=230 <cluster name> N
To redisplay the web interface URLs use:
hadoop-ec2 list <cluster name>
You can then start the job on this cluster via:
hadoop-ec2 proxy <cluster name>
hc <cluster name> jar proj1.jar Proj1 -conf conf.xml -Dcombiner=<true/false> -DrunJob2=<true/false> s3n://cs61cUsenet/s2006 hdfs:///<intermediateDir> hdfs:///<outDir>
Remember to terminate your cluster when you're done.
We don't think you'll need any of these...