Lab 13: Spark
Due at 11:59pm on 04/29/2016.
By the end of this lab, you should have submitted the lab with
python3 ok --submit. You may submit more than once before the
deadline; only the final submission will be graded.
- You will be conducting this lab on the Databricks Spark platform which can be accessed via a browser.
- After completing each question in the Databricks environment, you will copy a token into
- Upon completing the lab, use ok to submit the lab as usual
In this lab, we'll be writing MapReduce applications using Apache Spark.
A MapReduce application is defined by a mapper function and a reducer function.
- A mapper takes a single input value and emits a list of key-value pairs.
- A reducer takes an iterable over all values for a common key and emits a single value.
The following diagram summarizes the MapReduce pipeline:
Apache Spark and Databricks
Spark is a framework that builds on MapReduce. The AMPLab (here at Cal!) first developed this framework to improve upon another MapReduce project, Hadoop. In this lab, we will run Spark on the Databricks platform, which will demonstrate how you can write programs that can harness parallel processing on Big Data.
Databricks is a company that was founded out of UC Berkeley by the creators of Spark. They have been generous enough to donate computing resources for the entire class to write Spark code in Databricks notebooks.
Creating an Account
You will need a community edition account in order to start using Spark. You should have received an email containing a signup link - if it's not in your inbox, check your spam folder. If you're certain that you didn't receive an email, talk to your TA to get an account.
Once you've signed up for an account, login and click the profile icon at the top right. You should see an invitation to join the class organization:
In the same profile menu, you should now see an option to switch to the CS61A 2016 organization. Make sure to select it before continuing.
Creating a Cluster
The first thing you should do is create a cluster. In the sidebar on the left, click on the "Clusters" icon and select the "Create Cluster" button. Enter the name "cs61a" for your cluster, then hit "Create Cluster" to start up the cluster. This can take anywhere from a minute to ten minutes, so be patient! Once your cluster appears as "active", you can proceed to the next step.
Loading the Workspace
In the sidebar, select the "Workspace" icon. You should see a "cs61a-labs" folder. Navigate into it and check out question 0 as an example:
To get started on this problem, first attach to your cluster. In the menubar at the top left, you should see a button that says "Detached." Click on it, then select the cluster that you created earlier:
Interacting with Notebooks
A Databricks notebook is very similar to an IPython notebook, which is a Python instance that can be interacted with inside of a web browser. Databricks and IPython notebooks can contain code, text and even images! Here is a screenshot of a notebook. You can see that we have code that are in cells. We can run all the cells by pressing the 'Run All' button or we can run a single cell by pressing the Shift+Enter keys when our cursor is in the desired cell.
Now that you know how notebooks work, complete questions 1-3, which are mandatory. Question 4 is optional but recommended. After completing and passing the tests for each question, you should see a token. Copy the token that is shown in the last cell into your
lab13.py file as a string.