`NOT SECRET//SIGINT//REL 61C`

Background

As part of the larger CHAOS project, the mysterious No Such Agency has effectively wiretapped a huge fraction of the internet, recording every page request and, for web traffic that identifies the user, also recording the username in the reply. Each record includes both a timestamp and the network 4-tuple (source IP, source port, destination IP, destination port). For requests, it includes any tracking cookie seen in the HTTP request, while web replies include any extracted username. Requests also include the full HTTP headers (which include other features), but for this purpose they are simply treated as an opaque blob of data and excluded from the analysis.

Of course, simply recording this data is of little use; it needs to be both analyzed and searched. Your job is to prototype the HARBOR internal database structure using Hadoop. HARBOR performs three primary tasks:

1: It matches requests and replies: When the data includes both a request and reply with the same network 4-tuple with a timestamp that is within 10 seconds, it creates a stream of matched records which include both the request and reply.

2: It writes the output stream of matched request/reply pairs into a set of independent "Query Focused Datasets" (QFDs). In a QFD, a particular key (such as the source IP, destination IP, cookie, or username) is hashed using a cryptographic hash function. Then the lowest 2 bytes are used to select which file to write into, with the resulting file containing both the hash and the data.

This enables efficient searching for

All records associated with a username
All records associated with a tracking cookie
All records associated with a source IP or a destination IP

The particular QFD keys we will be using are:

cookie: Cookies included in user requests
srcIP: Source IP addresses
destIP: Destination IP addresses

3: It provides a second tool to create the TOTALFAIL database of Tor users (the torusers QFD). This tool first accesses the QFD which indexes matched users by source IP to get the users who are seen using a set of Tor exit nodes. Using any tracking cookies present in the resulting query, it accesses the tracking cookie QFD to get a list of all users, and then does a query for the usernames associated with those tracking cookies to create the TOTALFAIL QFD.

Why this structure?

This is a structure optimized for the particular problem faced by the No Such Agency. They are obtaining a massive amount of data, recording the activity of a huge fraction of people on the planet. Most of this data is completely irrelevant, but it is not known in advance which is relevant or not.

Instead, the No Such Agency desires the capability to easily query "At this IP, who was connected to the network" and "For this person, which IPs and when did he connect to the network."

Query Focused Datasets solve this problem. In a parallel structure like Hadoop, access time is dominated by disk latency and network latency. By splitting the data into buckets, this can be efficiently parallelized (with each bucket potentially written by a separate process) and easily communicated in parallel. Then when any particular entry is needed, only that particular bucket needs to be searched for the resulting value.

Within a bucket we don't bother sorting: since most buckets will never be read, and since accessing a bucket is dominated by the time it takes to load a bucket from disk (rather than to search for an entry within a bucket) it is better from a total cost perspective to not bother sorting the data.

Then the TOTALFAIL analysis is simply taking advantage of this in a parallel structure. With the initial QFDs in place, it's easy to do a query for "all Tor users" based on the IPs used by Tor, and then do a query for "each of these users, what is their activity." Now when a No Such Agency analyst wants to find out more about a particular Tor user, they can easily discover all their activity.

Your Project (Due 4/26 @ 23:59:59)

Getting Started

Intialize your repository on the Bitbucket website as proj5-xx, replacing xx with your CS 61C login. Next, you can check out a copy of the skeleton code by entering the following commands:

git clone https://mybitbucketusername@bitbucket.org/mybitbucketusername/proj5-xx.git
cd proj5-xx
git remote add proj5-starter https://github.com/cs61c-spring2016/proj5-starter.git
git fetch proj5-starter
git merge proj5-starter/master -m "merge proj5 skeleton code"

The files we have provided in src/main break down into three major groups.

Utilities

HashUtils.java: Contains the salt that you should use to initialize hash functions, a constant defining the number of hashed bytes we use to name the QFD files, and a helper method to clone hash functions.
QueryFocusedDataSet.java: This class represents the QFD data structures described above. You will write serialized instances of this class to HDFS in this project.
WebTrafficRecord.java: This class represents individual user requests or replies, each of which contains a 4-tuple as well as a cookie or username. QFDs store instances of this type. This class contains several methods you may find useful, so it is worth studying in detail!
WTRKey: This will be useful as a key in our Hadoop jobs. It is used to collect all related request/reply matchers at the reducers before they are written to a QFD file.

Matching and Writing Initial QFDs

The second group of files form a Hadoop job that analyzes the initial data and matches user requests with replies before writing them to QFD files. We've already configured the job for you in QFDWriterJob.java. It consists of two subtasks.

The first subtask identifies matching request/reply pairs (identical 4-tuples and occurring within 10 seconds of each other). You should implement this subtask by modifying the following files.

QFDMatcherMapper: Create a new WebTrafficRecord from the input data and emit it along with a key that ensures that all matching request/reply pairs go to the same reducer.
QFDMatcherReducer: Match up requests with replies, and emit them as RequestReplyMatch instances.

The second subtask takes these matches and constructs QueryFocusedDataSet objects that are then serialized to HDFS. You should implement this subtask by modifying the following files.

QFDWriterMapper.java: Emit each request/reply pair to the reducers so that it is written to QFDs based on srcIP, destIP, and cookie.
QFDWriterReducer.java: Create a QueryFocuseDataSet from the request/reply matches and serialize the QFD object to a file on HDFS based on name and hash bytes.

TOTALFAIL

The last group of files form a Hadoop job that collects data on Tor users. Again, we've configured the Hadoop job for you in TotalFailJob.java. We can reuse the reducer class from the previous job, QFDWriterReducer, to write to the toruser QFD files.

You must complete the implementation by modifying TotalFailMapper.java. Your code should take the following steps:

Take the input, a known IP address for a Tor exit node, and query the srcIP QFDs to find all cookies associated with any request originating from this address. To read the proper QFD files, you will need to use the hash of the srcIP.
Using the cookies gathered from the previous step, query the QFDs to find all request/reply pairs associated with those cookies.
Emit each match associated with the cookies, along with a WTRKey to organize matches by hash value, to the reducers, who will then write the toruser QFD files.

Note: You are allowed to use any features included in the Java 8 standard libraries in your project. In fact, some of the newer features introduced in Java 7 and 8 can be extremely useful for this project!

Compiling, Debugging, and Testing

We have provided a Makefile that will let you compile and run your code on the Hive machines. If you wish to run the project on your own system as well, you can install Hadoop 2.7.2 and Java 8.

To compile your code, you can run make from the project's top directory. This will produce two JAR files, one for each of the Hadoop jobs described above. We have included a sample set of web traffic data in the input directory, which is used for automated testing.

To execute the code, run make test from the project's top directory. This will take the two JARS and run them on a local installation of Hadoop. It then invokes a test program we have written, QFDPrinter.java and saves the output to output/actual/test1_.txt. This is compared against the expected output of your program, which we have provided in output/expected/test_1.txt.

You will likely notice that running Hadoop produces many logging messages on the terminal. If you are trying to debug with print statements, this makes it hard to find the output you're looking for. Thus, we have provided an additional Makefile target, test-quiet. Run make test-quiet to run Hadoop with most of its logging disabled. However, be aware that you may miss important debugging information, such as exception stacktraces, when you use this option. You should be careful if you choose to run Hadoop without logging.

Submission

The project is due on 4/26 @ 23:59:59. To submit proj5, enter in the following:

submit proj5

Additionally, you must submit proj5 to your Bitbucket repository:

cd ~/proj5-XX                              # Or where your shared git repo is
git add -u
git commit -m "project 5 submission"       # The commit message doesn't have to match exactly.
git tag "proj5-sub"                        # The tag MUST be "proj5-sub". Failure to do so will result in loss of credit.
git push origin proj5-sub                  # This tells git to push the commit tagged proj5-sub

CS61C Spring 2016 Project 5: HARBOR on Apache Hadoop

NOT SECRET//SIGINT//REL 61C