Updates
- 03/16/2013 - If you modified
MAX_ITERATIONS
for testing, make sure you set it back to 20 before submitting. If it is not set to 20, you will not pass the autograder - 03/14/2013 - A FAQ page has been posted at https://piazza.com/class#spring2013/cs61c/976
- 03/14/2013 - Submission Instructions are now up, please see the bottom of this page
Administrative
- You MUST work with a partner on this project (your partner does not have to be in your section). You may not work alone and you may not work in groups larger than two.
- If you don't know Java well, partner with someone who knows Java well.
Summary
In this project you will use MapReduce to analyze data from real social networks. In Part 1, you will develop and test your code locally (on hive) and for Part 2, you will run it on Amazon EC2.
Introduction
Have you ever heard of the infamous six degrees of separation or played the game Six Degrees of Kevin Bacon? They are both outcomes of the small-world property of social networks, which states that the average distance between any two vertices grows very slowly (logarithmically) relative to the size of the network. In practice, even for very large social networks, the median distance is small -- Twitter user follows (4), MSN Message Conversations (6), Facebook friendships (5), and Hollywood co-actors (4).
We will use the large processing capabilities of a Hadoop cluster to analyze data from real social networks to approximate the distance distribution to get a feel for how far apart most things are. To do this, we will randomly pick a subset of the vertices in the graph, and compute their distances to all other vertices. Your final program will take in a social graph and a constant for sampling and it will return data that we could use to produce a histogram of distance versus frequency of that distance for our selected sample.
Algorithm
Formally, we can think of each input social network as defining a graph. Each person is a vertex and each relationship is an edge. This graph is unweighted since all relationships (friendships, follows, collaborations) are equivalent. In general, the distance between any two vertices is the sum of the weights of edges in the shortest path between them. To find the all of the distances from one starting point, one would normally use a Single-source Shortest Path algorithm such as Dijkstra's Algorithm, but since the graph is unweighted, we can get away with a simple Breadth-First Search (BFS). Better yet, the BFS will be much easier to parallelize. To get a feel for how this works, examine the pseudocode for a serial BFS implementation below:
def bfs(vertices, s) for v in vertices dist[v] = -1 dist[s] = 0 queue = [s] for v in queue for n in neighbors(v) if dist[n] == -1 dist[n] = dist[v] + 1 queue += [n] return dist
This variant of breadth-first search will return the distance of each vertex
from the starting point s
. Notice that each vertex is visited
only once, so its distance is only updated if it is ever reached and has yet to
be visited. Your parallel version for MapReduce will be very different (this
project is not as straightforward as dumping the above code into a mapper!), but
the above example was given to show that BFS can compute distances on an unweighted
graph.
We will only perform breadth-first searches from a subset of the nodes to save on time since not much more fidelity is gained by examining all of them. In short, our program will: load in the social graph, perform multiple breadth-first searches, and compute the histogram. We'll be performing all of these breadth first searches simultaneously, since we want to maximize the amount of work we complete in a given unit of time. This will also result in fewer copies of the graph (lower disk/memory usage) and fewer sequential mapreduces (saving time), both of which accelerate processing drastically.
To pick the starting points, we search from each vertex with the probability of 1/denom, so that in expectation, the program will perform num_vertices/denom searches. We leave the actual number of searches to chance since this method is better for a distributed setting like MapReduce. For each vertex, the choice of whether to search from it is completely parallel, and the only global information it needs is the probability with which it should be selected. (Hint/sidenote: At no point should you be calculating how many vertices are in the graph). Once we're doing running our mapper and reducer for BFS, we'll have a final phase of MapReduce that produces the histogram data, which is simply the totals of how many shortest paths are of each distance.
At this point, you're probably wondering how you'll test your code if everything
is randomized. Notice however that if we set denom
to one, we'll run
a BFS starting at every vertex. Thus, the deterministic way to test our code
involves setting denom
to one and passing in a relatively small
graph.
Problem Details
The input graphs are encoded in Hadoop's
SequenceFileType
and each element is (key,value)
= (LongWritable source, LongWritable destination)
. The output
should be of the form (key,value)
= (LongWritable distance, LongWritable total)
, where total is the
number of shortest paths with that distance. In order to keep things managable,
we use the variable MAX_ITERATIONS
to limit the depth of our BFS.
By default, this value is set to MAX_ITERATIONS
= 20. For our
purposes, this is a pretty reasonable limit that lets us keep runtime under
control in the presence of a few outliers.
The vertices in each graph are given by long identifiers. The address range is not necessarily contiguous (e.g. could have vertices {0,1,5,9}). Each input relation is intended to be treated as a directed edge. If the original relation is undirected, the other direction for that relation will be somewhere in the input. There can be repeat relations, but there will not be loops (self-edges).
The denom constant will be fed in on the command line. However,
you'll notice that the skeleton takes care of making denom
accessible
from within your mappers and reducers by attaching it to the Configuration
object for MapReduce job.
Finally, note that if a vertex is unreachable (or has a distance greater than
MAX_ITERATIONS
), it should not contribute to the histogram data.
To help understand the intended operation, we provide an example.
Provided Resources
We provide you with SmallWorld.java
and a Makefile
in ~cs61c/proj/02
. You can copy them to your home directory by:
$ mkdir ~/proj02 $ cp -r ~cs61c/proj/02 ~/proj02
There should not be a need for you to create any additional files,
but if you do, be sure that make
compiles them all and that
main
is still present in the SmallWorld
class in
sw.jar
. Please note that if you submit code that does not compile by running make
, we will not grade it and you will receive a zero.
Feel free to
modify SmallWorld.java
however
you want while still completing the program requirements. The code in there is
intended to take care of tedious coding details or provide examples of useful
things you may want to modify or mimic in your solution.
The skeleton code assumes your code will use three types of mapreduces:
- Graph loading, which will run once. (Provided as example)
- Breadth-First Search, which will run
MAX_ITERATIONS
times. (You'll need to add these classes.) - Histogram making, which will run once. (You'll need to add these classes.)
main
supports this and is intended to demonstrate
how to chain and even iterate multiple mapreduce jobs. Currently all of the maps
and reduces are identity, but contain code that is there to demonstrate
accessing and using denom
and other variables you may wish to pass
into your mappers/reducers from main
. (Hint: A common use for such
variables is to maintain the iteration count while doing BFS.)
The EValue
class is provided to give an example
implementation of a class that implements Hadoop's Writable
Interface. This allows it to be the input value or output value of a map or
reduce phase. For it to be used as a key type for map or reduce, it would need
to implement the WritableComparable
interface. EValue
currently
shows you how to store various fields, including an array, but feel free to modify it to
suit your implementation.
You should complete this project on the hive machines in 330 Soda. If you are not sitting physically in front of one of these lab machines, you can access one of them remotely by following these instructions. The code should run locally (on the hive machines) in the same manner as in lab 6. We won't concern ourselves with running remotely on EC2 until Part II. We recommend spending the majority of your development time for this part working locally and with a small dataset to speed up and simplify debugging. The syntax for using the completed program is:
$ make clean # This will cleanup old output files and other stuff $ make # This will compile SmallWorld.java $ hadoop jar sw.jar SmallWorld input_graph output_dir denom
For input_graph, you'll be using one of the following:
Local Graphs (~cs61c/proj2data/
)
ring4.seq
- 4 vertices in a ring (0→1, 1→2, 2→3, 3→0)cit-HepPh.sequ
- 35K vertices from High Energy Physics collaborations
Later on we'll run BFS on some more interesting graphs. However analyzing these graphs will use a large amount of resources, so we will analyze those on EC2 in Part II of the project.
Tips
- Both Java and Hadoop have many popular versions with different APIs. We are using: Java 6 and Hadoop 0.20.2
- You may want to change the output (and input) formats to be more
readable during development, to something like
TextOutput
. TheSequenceFileOutputFormat
is the fastest because it is a compressed binary encoding. - The number of distance 0 shortest paths is the number of searches done
- Try to keep the number of searches for the huge graphs on the order of a few dozen or less
- A denom value of 1 will search from every vertex
- The correct output for
ring4.seq
with denom=1 is {(0,4),(1,4),(2,4),(3,4)} - When running
cit-HepPh.sequ
onhive
, you should be able to complete within your disk quota with denom=10000.
Assignment
Part 1 (due 3/17/13 @ 23:59:59)
Complete the above problem locally and submit SmallWorld.java
. Submit the Makefile
(if modified) or any additional source files if needed.
Submission
Submissions for part one are now open. You may submit the project by running the following in the directory containing SmallWorld.java
:
$ submit proj2-1
Only one partner should submit. Please be sure to indicate your partner's login when prompted. Make sure you submit only SmallWorld.java
(and any other .java
files your implementation depends on, plus the Makefile
if you modified it). DO NOT submit any output files. Lastly, in case you modified MAX_ITERATIONS
, be sure to set it back to 20 before submitting (as it was originally), otherwise you will not pass the autograder.
Grading
Part 1 is worth 2/3 of your Project 2 grade.
Part 2 will be worth 1/3 of your Project 2 grade.
Acknowledgements
Thanks to Scott Beamer (a former TA), the original creator of this project.
Additional Links (NOT required reading) on Social Network Analysis
- Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twitter, a Social Network or a News Media?. WWW 2010. (analysis of Twitter and source of our Twitter data)
- Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature:393, 1998. (mathematics behind small-world networks)
- Stanford Network Analysis Project (a great source of social network data and the source of cit-HepPh)
- Laboratory for Web Algorithmics (source of the Hollywood data and has current metrics on largest social networks)
- Wikipedia Crawl