Project 4: MapReduce and Spark

Updates and clarifications

11/24 6:46 pm: Removed last transformation of classify() and fixed one comment typo
11/24 9:50 pm: It may make more sense to split on all whitespace, but since Yelp reviews are written by real people, their typos and extra whitespace may actually be indicative of their sentiment (the extra whitespace count is differentiating between the joints and boosting the accuracy of our implementation). Accordingly, please split on a single space (str.split(" ")) instead of all whitespace (str.split())
11/24 9:53 pm: Updated staff debug output to be consistent with splitting on one whitespace

Goals

In this project, you will use the MapReduce programming paradigm to parallelize a simple Naive Bayes classifier with a Bag of Words model in Spark to predict Yelp review ratings.

Getting Started

Logistics

You may work with a partner or alone for this project.
We highly recommend that you use the Hive machines so that your development of your project uses the same environment we will be grading in.

Setup

Intialize your repository and get the skeleton code files by entering the following commands:

$ git clone https://mybitbucketusername@bitbucket.org/mybitbucketusername/proj4-xxx-yyy.git/
$ cd proj4-xxx-yyy
$ git remote add proj4-starter https://github.com/61c-teach/fa17-proj4-starter.git
$ git fetch proj4-starter
$ git merge proj4-starter/master -m "merge proj4 skeleton code"

Also, you need to set up a virtual environment for this project. To do this, run:

$ conda create --name proj4env python=2.7

Respond to the prompt to install packages with "y" (no quotes). After these install, run the following command to activate the virtual environment:

$ source activate proj4env

Remember, every time you want to work on the project, you must activate the virtual environment in order to run Spark.

If you are not familiar with Spark, read this programming guide, especially the section on Resilient Distributed Datasets (RDDs) and on RDD Operations.

Background

In this project, we will be classifying and predicting the "stars" of a large Yelp dataset. The dataset contains lots of various information on Yelp reviews, but we are focused on classifying the review text into star "bins". We will be classifying reviews into 3 bins for this project: 1, 3, and 5 stars. To simplify the classification process and to improve accuracy, we have already taken the raw JSON data and grouped the reviews into the 3 bins. To do this, we took the actual star rating given to the review, and grouped it according to the following rules: 4-5 star -> 5 star, 2-4 star -> 3 star, 0-2 star -> 1 star. This is definitely not the best way to group the data, but we chose this for simplicity.

There are many machine learning models/techniques for text classification. Perhaps the simplest (and most surprisingly effective) is a Naive Bayes classification with a Bag of Words model for text.

tl;dr

(Training) We will calculate the likelihood of a word occuring in reviews of each possible number of stars:
P(word | num_stars) = (1 + # of times word appears in reviews with num_stars) / (1 + # of words total in reviews with num_stars)
We will do this for every word that occurs in the group of reviews with num_stars, for all possible number of stars.
(Training) We will calculate the prior probability of a review with num_stars occuring, for all possible numbers of stars: P(num_stars) = # of reviews with num_stars / # of reviews total .
(Classification) Given a review (word1, word2, word3,...), for all possible numbers of stars, we will calculate the joint probability P(num_stars, word1, word2, word3...) = P(num_stars) P(word1|num_stars) P(word2|num_stars) P(word3|num_stars).... Our prediction for the number of stars for the review is then the num_stars that has the highest joint probability.

Bag of Words

When thinking about the relationship of words in a sentence to their sentiment or meaning, the sequence of the words seems like a likely factor. However, the Bag of Words text model ignores the ordering of words, and instead considers each word independently. A word in a document (or a review in our case) is represented only by the number of times it appears in the document and nothing else (none of ordering, the word's part of speech, or common phrases is considered).

For example, if we had a review "This restaurant is amazing! The best. The food is never bad.", then it would be represented as:

Word	Count
this	1
restaurant	1
is	2
amazing	1
the	2
best	1
food	1
never	1
bad	1

You might think that this representation of text is fairly naive ("never bad" is a lot different than "bad" for instance), but it works surpisingly well.

Naive Bayes

Next, we'll understand the machine learning classifier we'll be using for this task. Given some data X, Naive Bayes attempts to predict the probability P(Y|X) that the data has a label Y, otherwise known as the posterior probability. (In our case, given the text of a review, we are trying to predict the number of stars that review gave.) In order to calculate this posterior probability, Baye's Rule is applied: P(Y|X) = [P(X|Y)P(Y)] / P(X) . However, in practice, although the posterior probability is desired, it is actually proportional to the joint probability P(Y, X) , which is more easily calculated; thus, Naive Bayes ultimately attempts to estimate P(Y, X) = P(X|Y)P(Y) . With this goal of estimating P(Y, X), Naive Bayes then tries to estimate P(X|Y) and P(Y) given some set of training data and labels. In our case, our Naive Bayes classifier will be given a training set, the words in the review (data) and the star rating of the reviews that the words appear in (label), and then estimate P(word | star rating of the review it appears in) and P(star rating) (e.g. P("awesome" | it appeared in a 5 star review) and P(5 star review).

To train our Naive Bayes Classifier, we will estimate P(X|Y), otherwise known as the likelihood, as P(word | num_stars) = # of times word appears in reviews with num_stars / # of words total in reviews with num_stars. To estimate P(Y) , otherwise known as the prior, we will estimate it as P(num_stars) = # of reviews with num_stars / # of reviews total .

For example, say we only had four reviews: ("I hate the food.", 1 star), ("The food is good.", 3 stars), ("Service is good.", 3 stars), and ("I love the good food.", 5 stars). Since there are two reviews with three stars, and four reviews total, the prior probability of a review being three stars is P(3 stars) = 2/4. The same calculations would then be done for one star and five stars, such that we have a table that maps a star rating to its prior probability. For estimating likelihoods, we would estimate the likelihood of "good", given that we know it appeared in a 3 stars review, as P("good" | 3 stars) = 2 appearances of "good" in three star reviews / 7 words total over all three star reviews = 2/7. This calculation would be repeated for every other word that appears at least once in a three stars review, and similarly for the words in one star and five stars reviews. In the end, we would then have a likelihood table for each possible number of stars, where each table maps a word to its likelihood given that table's number of stars. The full prior and likelihood tables (concatenated together for brevity) are shown below:

Priors

P(1 star)	P(3 stars)	P(5 stars)
1/4	2/4	1/4

Likelihoods

word	P(word\|1 star)	P(word\|3 stars)	P(word\|5 stars)
I	1/4	0	1/5
hate	1/4	0	0
the	1/4	1/7	1/5
food	1/4	1/7	1/5
is	0	2/7	0
good	0	2/7	1/5
service	0	1/7	0
love	0	0	1/5

Now, for classification. If a Naive Bayes Classifier classifies an unlabeled datum, X, then for all possible labels, Y = y1, y2, y3..., the joint probabilities P(Y=y1, X), P(Y=y2, X), P(Y=y3, X)... are all calculated, and then the datum is classified as the label corresponding to the greatest joint probability. Specifically, the joint probability is calculated as P(Y=y, X) = P(Y=y)P(x_1 | Y=y)P(x_2 | Y=y) P(x_3 | Y=y)..., where Naive Bayes makes the independence assumption that the probabilities of each x_i are independent given the label y. In our case, our labels are 1, 3, and 5 stars, and each datum is a single review, with each word in the review corresponding to an x_i.

More concretely, suppose that we have the same four reviews as earlier, and would like to now predict the number of stars corresponding to the review, "Good food!". For each possible number of stars, we would calculate the probability of that number, given this review, as P(num_stars, "good", "food") = P(num_stars) * P("good" | num_stars) * P("food" | num_stars) . For each of our star ratings:

P(1 star, "good", "food") = (1/4) * (0) * (1/4) = 0
P(3 stars, "good", "food") = (2/4) * (2/7) * (1/7) = 1/49
P(5 stars, "good", "food") = (1/4) * (1/5) * (1/5) = 1/100

Since the joint probability of the review being "Good food!" and being 3 stars is the greatest, Naive Bayes would classify this review as giving 3 stars.

Lastly, for our task, we will handle a common problem of using Naive Bayes classification. Suppose we were to classify, "The price is good". To calculate the probability that this review is 3 stars, we would calculate P(3 stars, "the", "price", "is", "good") = P(3 stars) [P("the"|3 stars) * P("price"|3 stars) * P("is"|3 stars) * P("good"|3 stars) = (2/4) * (1/7) * (0) * (2/7) * (2/7) = 0 . Since one word, "price", was never found in a 3 star review, the joint probability for this review and a 3 star rating was calculated as zero--despite this review having almost all words that also occur in 3 star reviews. To fix this issue, our Naive Bayes classifier will use Laplace Smoothing, a fancy sounding term for calculating the likelihood as P(word|num_stars) = (1 + # of times word appears in reviews with num_stars) / (1 + # of words total in reviews with num_stars) . This way, words like "price" that are never found in the training set of reviews will be assigned a very small nonzero likelihood instead of zero.

Additionally, multiplying many floating point numbers runs into precision problems (do you remember why?). To handle this, our implementation of classification will take the log of our likelihoods and priors, and add them together (instead of multiplying the likelihoods and priors themselves).

Your Task

Your task will be to fill in some of the functionality of the Naive Bayes classifier. All of the code for the Naive Bayes classifier is in classifier/yelpClassifier.py. The parts for you to fill in are clearly marked. Take some time to understand the main driver functions train and classify and all of the comments in the file.

You are welcome to come up with your own framework for the classifier if you choose. However, we will only be accepting code in classifier/yelpClassifier.py. If your code has any other dependencies and/or does not work together with run-classifier.py, it will not work and you will lose points.

The driver Python file is run-classifier.py. Feel free to modify this file to debug, but remember that none of your changes to this file will be used when grading.

Testing

There will be no autograder for this project. We are releasing the output of the staff train and classify functions to a sample dataset. This sample dataset contains 10 reviews for training and 3 reviews for classification. Feel free to use this to help you debug the output of each part of your implementation. When run on the sample dataset, run-classifier.py will automatically generate output in your_debug_output.txt, compare it to the reference staff output given in staff_debug_output.txt, and print out any diffs in debug_diffs.txt. Do realize that because it is such a small set of data, do not worry about accuracy. You can run this sample using:

$ spark-submit run-classifier.py sample

Once you have implemented all of the missing parts, we have three sample datasets for you to test out. They are located in ~cs61c/data/yelp-data/yelp-reviews-(train|test)-(small|medium|large).txt. Each line of every file is in the format review_id num_stars review_text, with num_stars being either 1, 3, or 5. You can run your Spark code on each dataset by using the command:

$ spark-submit run-classifier.py (small|medium|large)

The dataset breakdown is as follows:

Dataset	# Reviews for Training	# Reviews for Testing	Staff Accuracy	Staff Timing
small	1200	400	45%	several seconds
medium	30,000	10,000	69%	~1 minute
large	1,625,389	541,796	70%	~20 minutes

The small dataset should run very quickly, so you should use this to test your implementation and make sure it matches the staff benchmarks. The large dataset (which is significantly larger than the medium dataset) is for you to run once you match the staff benchmarks for the small and medium datasets. You should only run the large dataset when you think you have finished because it will use up a lot of your time and a lot of the Hive computing resources. If your implementation does better than the staff accuracy, great! However, please note that we will be setting time limits. That is, if your accuracy is matching (or better than) the staff accuracy but is significantly slower than our implementation, this will not pass the grader.

Grading will be simple for this project. We will run your implementation on a set of Yelp reviews not released. If it matches (or exceeds) the staff accuracy and does not take significantly longer to run, you will get full points. If you match the staff benchmark for the three datasets, you should be confident that your code will also match the staff on the unreleased dataset. Partial credit will be given out as follows:

20% for a compiling implementation
20% for non-zero accuracy
20% for correct calculate_num_reviews_and_words_per_num_stars
20% for correct calculate_likelihoods
20% for correct classify_reviews

For the last three items, we will test these functions in isolation to award partial credit.

Reminder:

We will only be accepting classifier/yelpClassifier.py. If you make any changes to any other file, including run-classifier.py, it will not be included in grading.

Submission and Grading

Congratulations! You just used Spark to provide some insight into a huge Yelp dataset. Yelp even puts out a challenge to any person interested in using tools such as Spark to analyze their data. Feel free to check it out.

To submit, run:

$ submit proj4

You should only submit classifier/yelpClassifier.py. Anyting else will be overwritten.

In addition, you should submit to your bitbucket repository as well.

$ cd proj4-XXX-YYY
$ git add classifier/yelpClassifier.py
$ git commit -m "proj4 submission"
$ git tag -f "proj4-sub"
$ git push origin proj4 --tags

CS61C Fall 2017 Project 4: Predicting Yelp Review Ratings with Spark