CS61C Project 2 - MapReduce

Updates

03/16/2013 - If you modified MAX_ITERATIONS for testing, make sure you set it back to 20 before submitting. If it is not set to 20, you will not pass the autograder
03/14/2013 - A FAQ page has been posted at https://piazza.com/class#spring2013/cs61c/976
03/14/2013 - Submission Instructions are now up, please see the bottom of this page

Administrative

You MUST work with a partner on this project (your partner does not have to be in your section). You may not work alone and you may not work in groups larger than two.
If you don't know Java well, partner with someone who knows Java well.

Summary

In this project you will use MapReduce to analyze data from real social networks. In Part 1, you will develop and test your code locally (on hive) and for Part 2, you will run it on Amazon EC2.

Have you ever heard of the infamous six degrees of separation or played the game Six Degrees of Kevin Bacon? They are both outcomes of the small-world property of social networks, which states that the average distance between any two vertices grows very slowly (logarithmically) relative to the size of the network. In practice, even for very large social networks, the median distance is small -- Twitter user follows (4), MSN Message Conversations (6), Facebook friendships (5), and Hollywood co-actors (4).

We will use the large processing capabilities of a Hadoop cluster to analyze data from real social networks to approximate the distance distribution to get a feel for how far apart most things are. To do this, we will randomly pick a subset of the vertices in the graph, and compute their distances to all other vertices. Your final program will take in a social graph and a constant for sampling and it will return data that we could use to produce a histogram of distance versus frequency of that distance for our selected sample.

Algorithm

Formally, we can think of each input social network as defining a graph. Each person is a vertex and each relationship is an edge. This graph is unweighted since all relationships (friendships, follows, collaborations) are equivalent. In general, the distance between any two vertices is the sum of the weights of edges in the shortest path between them. To find the all of the distances from one starting point, one would normally use a Single-source Shortest Path algorithm such as Dijkstra's Algorithm, but since the graph is unweighted, we can get away with a simple Breadth-First Search (BFS). Better yet, the BFS will be much easier to parallelize. To get a feel for how this works, examine the pseudocode for a serial BFS implementation below:

def bfs(vertices, s)
  for v in vertices
    dist[v] = -1
  dist[s] = 0
  queue = [s]
  for v in queue
    for n in neighbors(v)
      if dist[n] == -1
        dist[n] = dist[v] + 1
        queue += [n]
  return dist

This variant of breadth-first search will return the distance of each vertex from the starting point s. Notice that each vertex is visited only once, so its distance is only updated if it is ever reached and has yet to be visited. Your parallel version for MapReduce will be very different (this project is not as straightforward as dumping the above code into a mapper!), but the above example was given to show that BFS can compute distances on an unweighted graph.

We will only perform breadth-first searches from a subset of the nodes to save on time since not much more fidelity is gained by examining all of them. In short, our program will: load in the social graph, perform multiple breadth-first searches, and compute the histogram. We'll be performing all of these breadth first searches simultaneously, since we want to maximize the amount of work we complete in a given unit of time. This will also result in fewer copies of the graph (lower disk/memory usage) and fewer sequential mapreduces (saving time), both of which accelerate processing drastically.

To pick the starting points, we search from each vertex with the probability of 1/denom, so that in expectation, the program will perform num_vertices/denom searches. We leave the actual number of searches to chance since this method is better for a distributed setting like MapReduce. For each vertex, the choice of whether to search from it is completely parallel, and the only global information it needs is the probability with which it should be selected. (Hint/sidenote: At no point should you be calculating how many vertices are in the graph). Once we're doing running our mapper and reducer for BFS, we'll have a final phase of MapReduce that produces the histogram data, which is simply the totals of how many shortest paths are of each distance.

At this point, you're probably wondering how you'll test your code if everything is randomized. Notice however that if we set denom to one, we'll run a BFS starting at every vertex. Thus, the deterministic way to test our code involves setting denom to one and passing in a relatively small graph.

Problem Details

The input graphs are encoded in Hadoop's SequenceFileType and each element is (key,value) = (LongWritable source, LongWritable destination). The output should be of the form (key,value) = (LongWritable distance, LongWritable total), where total is the number of shortest paths with that distance. In order to keep things managable, we use the variable MAX_ITERATIONS to limit the depth of our BFS. By default, this value is set to MAX_ITERATIONS = 20. For our purposes, this is a pretty reasonable limit that lets us keep runtime under control in the presence of a few outliers.

The vertices in each graph are given by long identifiers. The address range is not necessarily contiguous (e.g. could have vertices {0,1,5,9}). Each input relation is intended to be treated as a directed edge. If the original relation is undirected, the other direction for that relation will be somewhere in the input. There can be repeat relations, but there will not be loops (self-edges).

The denom constant will be fed in on the command line. However, you'll notice that the skeleton takes care of making denom accessible from within your mappers and reducers by attaching it to the Configuration object for MapReduce job.

Finally, note that if a vertex is unreachable (or has a distance greater than MAX_ITERATIONS), it should not contribute to the histogram data.

To help understand the intended operation, we provide an example.

Provided Resources

We provide you with SmallWorld.java and a Makefile in ~cs61c/proj/02. You can copy them to your home directory by:

$ mkdir ~/proj02
$ cp -r ~cs61c/proj/02 ~/proj02

There should not be a need for you to create any additional files, but if you do, be sure that make compiles them all and that main is still present in the SmallWorld class in sw.jar. Please note that if you submit code that does not compile by running make, we will not grade it and you will receive a zero.

Feel free to modify SmallWorld.java however you want while still completing the program requirements. The code in there is intended to take care of tedious coding details or provide examples of useful things you may want to modify or mimic in your solution.

The skeleton code assumes your code will use three types of mapreduces:

Graph loading, which will run once. (Provided as example)
Breadth-First Search, which will run MAX_ITERATIONS times. (You'll need to add these classes.)
Histogram making, which will run once. (You'll need to add these classes.)

The given code in main supports this and is intended to demonstrate how to chain and even iterate multiple mapreduce jobs. Currently all of the maps and reduces are identity, but contain code that is there to demonstrate accessing and using denom and other variables you may wish to pass into your mappers/reducers from main. (Hint: A common use for such variables is to maintain the iteration count while doing BFS.)

The EValue class is provided to give an example implementation of a class that implements Hadoop's Writable Interface. This allows it to be the input value or output value of a map or reduce phase. For it to be used as a key type for map or reduce, it would need to implement the WritableComparable interface. EValue currently shows you how to store various fields, including an array, but feel free to modify it to suit your implementation.

You should complete this project on the hive machines in 330 Soda. If you are not sitting physically in front of one of these lab machines, you can access one of them remotely by following these instructions. The code should run locally (on the hive machines) in the same manner as in lab 6. We won't concern ourselves with running remotely on EC2 until Part II. We recommend spending the majority of your development time for this part working locally and with a small dataset to speed up and simplify debugging. The syntax for using the completed program is:

$ make clean # This will cleanup old output files and other stuff
$ make # This will compile SmallWorld.java
$ hadoop jar sw.jar SmallWorld input_graph output_dir denom

For input_graph, you'll be using one of the following:

Local Graphs (`~cs61c/proj2data/`)

ring4.seq - 4 vertices in a ring (0→1, 1→2, 2→3, 3→0)
cit-HepPh.sequ - 35K vertices from High Energy Physics collaborations

Later on we'll run BFS on some more interesting graphs. However analyzing these graphs will use a large amount of resources, so we will analyze those on EC2 in Part II of the project.

Tips

Both Java and Hadoop have many popular versions with different APIs. We are using: Java 6 and Hadoop 0.20.2
You may want to change the output (and input) formats to be more readable during development, to something like TextOutput. The SequenceFileOutputFormat is the fastest because it is a compressed binary encoding.
The number of distance 0 shortest paths is the number of searches done
Try to keep the number of searches for the huge graphs on the order of a few dozen or less
A denom value of 1 will search from every vertex
The correct output for ring4.seq with denom=1 is {(0,4),(1,4),(2,4),(3,4)}
When running cit-HepPh.sequ on hive, you should be able to complete within your disk quota with denom=10000.

Assignment

Part 1 (due 3/17/13 @ 23:59:59)

Complete the above problem locally and submit SmallWorld.java. Submit the Makefile (if modified) or any additional source files if needed.

Submission

Submissions for part one are now open. You may submit the project by running the following in the directory containing SmallWorld.java:

$ submit proj2-1

Only one partner should submit. Please be sure to indicate your partner's login when prompted. Make sure you submit only SmallWorld.java (and any other .java files your implementation depends on, plus the Makefile if you modified it). DO NOT submit any output files. Lastly, in case you modified MAX_ITERATIONS, be sure to set it back to 20 before submitting (as it was originally), otherwise you will not pass the autograder.

Grading

Part 1 is worth 2/3 of your Project 2 grade.

Part 2 will be worth 1/3 of your Project 2 grade.

Acknowledgements

Thanks to Scott Beamer (a former TA), the original creator of this project.

Additional Links (NOT required reading) on Social Network Analysis

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twitter, a Social Network or a News Media?. WWW 2010. (analysis of Twitter and source of our Twitter data)
Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature:393, 1998. (mathematics behind small-world networks)
Stanford Network Analysis Project (a great source of social network data and the source of cit-HepPh)
Laboratory for Web Algorithmics (source of the Hollywood data and has current metrics on largest social networks)
Wikipedia Crawl

CS61C Spring 2013 Project 2: Small World Experiment with MapReduce

Updates

Administrative

Summary

Introduction

Algorithm

Problem Details

Provided Resources

Local Graphs (`~cs61c/proj2data/`)

Tips

Assignment

Part 1 (due 3/17/13 @ 23:59:59)

Submission

Grading

Acknowledgements

Additional Links (NOT required reading) on Social Network Analysis

CS61C Spring 2013 Project 2: Small World Experiment with MapReduce

Local Graphs (~cs61c/proj2data/)

Part 1 (due 3/17/13 @ 23:59:59)

Acknowledgements

Local Graphs (`~cs61c/proj2data/`)