Lab 13: Spark | CS 61A Spring 2016

Due at 11:59pm on 04/29/2016.

Starter Files

Download lab13.zip. Inside the archive, you will find starter files for the questions in this lab, along with a copy of the OK autograder.

Submission

By the end of this lab, you should have submitted the lab with python3 ok --submit. You may submit more than once before the deadline; only the final submission will be graded.

You will be conducting this lab on the Databricks Spark platform which can be accessed via a browser.
After completing each question in the Databricks environment, you will copy a token into lab13.py
Upon completing the lab, use ok to submit the lab as usual

MapReduce

In this lab, we'll be writing MapReduce applications using Apache Spark.

A MapReduce application is defined by a mapper function and a reducer function.

A mapper takes a single input value and emits a list of key-value pairs.
A reducer takes an iterable over all values for a common key and emits a single value.

The following diagram summarizes the MapReduce pipeline:

Mapreduce Diagram

Apache Spark and Databricks

Spark is a framework that builds on MapReduce. The AMPLab (here at Cal!) first developed this framework to improve upon another MapReduce project, Hadoop. In this lab, we will run Spark on the Databricks platform, which will demonstrate how you can write programs that can harness parallel processing on Big Data.

Databricks is a company that was founded out of UC Berkeley by the creators of Spark. They have been generous enough to donate computing resources for the entire class to write Spark code in Databricks notebooks.

Creating an Account

You will need a community edition account in order to start using Spark. You should have received an email containing a signup link - if it's not in your inbox, check your spam folder. If you're certain that you didn't receive an email, talk to your TA to get an account.

Databricks signup

Once you've signed up for an account, login and click the profile icon at the top right. You should see an invitation to join the class organization:

Accept invitation

In the same profile menu, you should now see an option to switch to the CS61A 2016 organization. Make sure to select it before continuing.

Switch workplaces

Creating a Cluster

The first thing you should do is create a cluster. In the sidebar on the left, click on the "Clusters" icon and select the "Create Cluster" button. Enter the name "cs61a" for your cluster, then hit "Create Cluster" to start up the cluster. This can take anywhere from a minute to ten minutes, so be patient! Once your cluster appears as "active", you can proceed to the next step.

Create Cluster

Active Clusters

Loading the Workspace

In the sidebar, select the "Workspace" icon. You should see a "cs61a-labs" folder. Navigate into it and check out question 0 as an example:

Workspace

To get started on this problem, first attach to your cluster. In the menubar at the top left, you should see a button that says "Detached." Click on it, then select the cluster that you created earlier:

Attach to Cluster

Interacting with Notebooks

A Databricks notebook is very similar to an IPython notebook, which is a Python instance that can be interacted with inside of a web browser. Databricks and IPython notebooks can contain code, text and even images! Here is a screenshot of a notebook. You can see that we have code that are in cells. We can run all the cells by pressing the 'Run All' button or we can run a single cell by pressing the Shift+Enter keys when our cursor is in the desired cell.

Databricks Notebook

Now that you know how notebooks work, complete questions 1-3, which are mandatory. Question 4 is optional but recommended. After completing and passing the tests for each question, you should see a token. Copy the token that is shown in the last cell into your lab13.py file as a string.