University of California, Berkeley
EECS Department - Computer Science Division
CS3 Lecture 14 : MapReduce
- Are there any questions about the project?
Review - Fractals
- We saw six examples of fractals; what a wonderful, beautiful world is mathematics...
- Today we're going to see the power of using multiple computers to help solve a problem, and how easy this can be if you use the right language and abstraction!
Non-computer example : Sorting cards
- Problem: We have a shuffled deck of cards and want to sort it
- If you get good at this, you can try to compete for the card sorting world record (now 36 sec)!
- Here's the order: A♣ K♣ Q♣ J♣ 10♣ ... 2♣ A♠ K♠ ... 2♠ A♥ K♥ ... 2♥ A♦ K♦ ... 2♦
- Let's have a competition ... Dan vs the class (who will be given time to plot a strategy).
- How did they do? Were they faster or slower than Dan? How much? Why weren't they 5 times faster?
Working in parallel
- When dividing a problem among others, there are overhead costs:
- Time to divide the "data" among the helpers
- Time to ship the data to the helpers
- Time for the helpers (sub-contractors) to complete their part
- Time to put the parts together
- Time to ship the result back to the "dispatcher" who returns the answer
- There's also an issue -- sometimes problems (i.e., programs) have parts that cannot be parallelized! In these cases, there's no speedup at all.
- We haven't even talked about how hard this might be to do in a programming language!
- Failures : What if one of the workers gets sick? How does the dispatcher know and replace him/her?
- Load balancing : What if one of the workers is really fast and gets their part done early? How does the dispatcher know so that they can be given another piece of work to do?
Real-world problems that currently use parallel computation
- SETI @ Home - search for signals that may indicate extraterrestrial intelligence!
- Nutritious Rice for the World - help farmers use breeding to produce rice strains with "higher crop yields, greater disease and pest resistance, and that will provide a full range of bioavailable nutrients thereby benefiting those in regions where hunger and nutrient deficiency is a critical concern."
- Help Conquer Cancer - "Improving the protein crystallography pipeline will enable researchers to determine the structure of many cancer-related proteins faster. This will lead to improving our understanding of the function of these proteins, and enable potential pharmaceutical interventions to treat this deadly disease."
- Folding @ home - "understand protein folding, misfolding, and related diseases"
- Generate Google's index of the web - They also use parallel computation to search this index
- Render Farm - Allow companies like Pixar, Dreamworks, Sony, ILM, etc make 3D images
- Many many others!
Programming in a way that makes parallel programming easy ... MapReduce
- Google (perhaps you've heard of them) took a look at a common parallel software pattern called MapReduce, which takes some (often large amount of) data, maps it, and then reduces it.
- The important fact here is that their thousands of computers can help with the mapping part and the reducing part
- The mapreduce software handles failures and load balancing automatically!
- This is such a common pattern that they've found literally thousands of small uses for it in their company, and they use it to build up their internal databases which are queried when you type something to the Google search field.
- Yahoo (and others) wrote an open-source version of the same thing called Hadoop and we have spent over a year making scheme talk to it and developing curricula for it.
- The take-away big idea here is that by staying with a beautiful function language (like, oh, say Scheme), we can parallelize the computation with almost no work!
MapReduce in CS3 on one local machine
- Here's how the pattern looks like for CS3:
;; 1 + 4 + 9 + 16 + 25 + 36 + 49 + 64 + 81 + 100 = 385
STk> (reduce + (map square '(1 2 3 4 5 6 7 8 9 10)))
- That's it! You take some data, map something over it, and then reduce it.
MapReduce in CS3 on a cluster of machines!
- A cluster is a whole bunch of machines (usually in a machine room or datacenter) that can be called upon to help you with a big compute task.
- We have a cluster of 26 Dell computers we house in the Soda basement (thanks to a grant from Google and Intel) called icluster. Imagine 26 computers helping you with your big compute task!
- In CS3, we're working to hide a mountain of technical details of the implementation from you, but, like the little Dutch boy with his finger in the dike, sometimes these details slip through. Here goes...
- Files -- the cluster takes its data from files (or directories) instead of scheme expressions.
- This means we have to ship our data to the cluster file system first before we can work on it. We've done that for some interesting datasets, so you don't have to worry about how to do this. That said, if you have a particularly fun dataset you'd like to crunch, let your TA know and we'll look into putting it on the cluster.
- New functions that use the cluster:
(reduce-map-letter reducer-fun mapper-fun filename-or-directory)
(reduce-map-word reducer-fun mapper-fun filename-or-directory)
(reduce-map-sent reducer-fun mapper-fun filename-or-directory)
- The reducer-fun is a standard reducer (two inputs, one output), but with two significant constraints, based on the fact that we want this process to be as fast as possible and we're going to use machines to help that might return their answers out of order.
- It has to be support reversing input arguments
(reducer-fun a b) and (reducer-fun b a) both have to be valid.
- Since the order of the output might be different depending on what machines returned first, if you use sentence or word (or similar function) as your combiner, you might get different output. If the reducer-fun is commutative, you won't notice a difference.
- It has to be associative:
(reducer-fun a (reducer-fun b c)) = (reducer-fun (reducer-fun a b) c)
- The filename-or-directory last argument is a word (enclosed in double-quotes) specifying a file or directory in our system. Directories with files in them are concatenated together and treated as if they are one big file. For this example, we'll refer to a file called ilovecs3.txt, which contains exactly the following:
- The effect of each of these functions is almost exactly the same; the only thing that differs is how the file is processed and turned into a list (that's why we have -letter, -word and -sent versions. That is, each one below does exactly this:
(reduce reducer-fun (map mapper-fun file-as-list))
|Name of function
||How does it convert a file to a list to pass to map?
What does map see as its second argument if the file is ilovecs3.txt
(I.e., what is file-as-list)
Domain of mapper-fun?
Ignores carriage returns and spaces (actually whitespace in general) and treats the file as a big list of all its letters
'( i l o v e c s 3 )
Ignores carriage returns and treats the file as a big list of words
'( i love cs3 )
Converts every line into a sentence, and makes a list of all the sentences.
'( (i love) (cs3) )
- ...and what makes this so powerful is that we don't have to have just ONE machine doing the mapping, but an army of machines!
- ...and when we're doing the reduction, we can use lots of students
Show me the money! Let's see an example...
- We'll have to log into the cluster to run this. We're going to use the file "/numbers", which simply has the numbers 1-10:
unix% ssh email@example.com
icluster1  ~ > stk-simply
STk> (define (square x) (* x x))
STk> (reduce + (map square '(1 2 3 4 5 6 7 8 9 10)))
STk> (reduce-map-letter + square "/numbers")
Mapreduce in progress! Your ID number is 933510232. For progress info, see
STk> (reduce-map-word + square "/numbers")
Mapreduce in progress! Your ID number is 28540981. For progress info, see
STk> (reduce-map-sent + (lambda (s) (reduce + (map square s))) "/numbers")
Mapreduce in progress! Your ID number is 475145543. For progress info, see
- You probably will be annoyed with the overhead with small examples. That's because we need to do a lot of behind-the-scenes action to make it all work. That said, when you get to really big datasets, you'll be really psyched to have so many machines helping you out.
How about other files and examples?
This one is small and has all Beatles song names. There are 13 files in this directory, which you can think of as being all in one file. The files are:
The collected works of William Shakespeare
The collected works of Charles Dickens
;; I wonder how many times Shakespeare wrote the word love?
STk> (define (love? w) (equal? w 'love))
STk> (reduce-map-word + (lambda (w) (if (love? w) 1 0)) "/gutenberg/shakespeare")
Mapreduce in progress! Your ID number is 2052777448. For progress info, see
;; Let's double-check that
STk> (reduce-map-sent + (lambda (s) (appearances 'love s)) "/gutenberg/shakespeare")
Mapreduce in progress! Your ID number is 861649500. For progress info, see
;; I wonder what words in the Beatles songs start with u?
STk> (reduce-map-word se (lambda (w) (if (equal? (first w) 'u) w '())) "/beatles-songs")
Mapreduce in progress! Your ID number is 447796622. For progress info, see
("universe" "us" "u.s.s.r.")
;; What songs from the Beatles' Abbey Road have the word "the" in them?
STk> (reduce-map-sent se (lambda (s) (if (member? 'the s) s '())) "/beatles-songs/abbey-road")
Mapreduce in progress! Your ID number is 1945545779. For progress info, see
("here" "comes" "the" "sun" "she" "came" "in" "through" "the" "bathroom" "window" "the" "end")
- We saw a CS3 implementation that allows you to play with cluster computing...
In lab this week
- You'll continue to work on your project and you'll experiment with this yourself
In life this week
- Will the Dow drop below 7k?
- We celebrate winning the Big Game and beating Rutgers! Bowl-bound we are! Bowl-bound Stanford is not!