University of California, Berkeley
EECS Department - Computer Science Division
CS3 Lecture 14 : MapReduce
- Are there any questions about the project?
Review - Fractals
- We saw six examples of fractals; what a wonderful, beautiful world is mathematics...
- Today we're going to see the power of using multiple computers to help solve a problem, and how easy this can be if you use the right language and abstraction!
- Make sure you also have the four excellent CS3 MapReduce Computer Science Illustrated illustrations handy when reading these notes.
Non-computer example : Sorting cards
- Problem: We have a shuffled deck of cards and want to sort it
- If you get good at this, you can try to compete for the card sorting world record (now 36 sec)!
- Here's the order: A♣ K♣ Q♣ J♣ 10♣ ... 2♣ A♠ K♠ ... 2♠ A♥ K♥ ... 2♥ A♦ K♦ ... 2♦
- Let's have a competition ... Dan vs the class (who will be given time to plot a strategy).
- How did they do? Were they faster or slower than Dan? How much? Why weren't they 5 times faster?
Working in parallel
- When dividing a problem among others, there are overhead costs:
- Time to divide the data among the helpers
- Time to ship the data to the helpers
- Time to ship the result back to the "dispatcher"
- Time to recombine the parts together
- There's also an issue -- sometimes problems (i.e., programs) have parts that cannot be parallelized! In these cases, there's no speedup at all.
- We haven't even talked about how hard this might be to do in a programming language!
- Failures : What if one of the workers gets sick? How does the dispatcher know and replace him/her?
- Load balancing : What if one of the workers is really fast and gets their part done early? How does the dispatcher know so that they can be given more work to do?
- Out-of-order : What if the workers each take an arbitrary amount of time to return the answer? How do we put humpty back together if the pieces are returned in an order we didn't expect?
Real-world problems that currently use parallel computation
- SETI @ Home - search for signals that may indicate extraterrestrial intelligence!
- Nutritious Rice for the World - help farmers use breeding to produce rice strains with "higher crop yields, greater disease and pest resistance, and that will provide a full range of bioavailable nutrients thereby benefiting those in regions where hunger and nutrient deficiency is a critical concern."
- Help Conquer Cancer - "Improving the protein crystallography pipeline will enable researchers to determine the structure of many cancer-related proteins faster. This will lead to improving our understanding of the function of these proteins, and enable potential pharmaceutical interventions to treat this deadly disease."
- Folding @ home - "understand protein folding, misfolding, and related diseases"
- Generate Google's index of the web - They also use parallel computation to search this index
- Render Farm - Allow companies like Pixar, Dreamworks, Sony, ILM, etc make 3D images
- Many many others!
Programming in a way that makes parallel programming easy ... MapReduce
- Google (perhaps you've heard of them) took a look at a common parallel software pattern called MapReduce, which takes some (often large amount of) data, maps it, and then reduces it.
- The important fact here is that their thousands of computers can help with the mapping part and the reducing part
- The mapreduce software handles failures and load balancing automatically!
- This is such a common pattern that they've found literally thousands of small uses for it in their company, and they use it to build up their internal databases which are queried when you type something to the Google search field.
- Yahoo (and others) wrote an open-source version of the same thing called Hadoop and we have spent over a year making scheme talk to it and developing curricula for it.
- The take-away big idea here is that by staying with a beautiful function language (like, oh, say Scheme), we can parallelize the computation with almost no work!
MapReduce in CS3 on one local machine
- Here's how the pattern looks like for CS3:
;; 1 + 4 + 9 + 16 + 25 + 36 + 49 + 64 + 81 + 100 = 385
STk> (reduce + (map square '(1 2 3 4 5 6 7 8 9 10)))
- That's it! You take some data, map something over it, and then reduce it.
MapReduce in CS3 on a cluster of machines!
- A cluster is a whole bunch of machines (usually in a machine room or datacenter) that can be called upon to help you with a big compute task.
- We have a cluster of 26 Dell computers we house in the Soda basement (thanks to a grant from Google and Intel) called icluster. Imagine 26 computers helping you with your big compute task!
- In CS3, we're working to hide a mountain of technical details of the implementation from you, but, like the little Dutch boy with his finger in the dike, sometimes these details slip through. Here goes...
- Files -- the cluster takes its data from files (or directories) instead of scheme expressions.
- This means we have to ship our data to the cluster file system first before we can work on it. We've done that for some interesting datasets, so you don't have to worry about how to do this. That said, if you have a particularly fun dataset you'd like to crunch, let your TA know and we'll look into putting it on the cluster.
- There's a web interface to the MapReduce filesystem, if you'd like to browse some of the files and directories you can use.
- New functions that use the cluster:
(reduce-map-letter reducer-fun mapper-fun filename-or-directory)
(reduce-map-word reducer-fun mapper-fun filename-or-directory)
(reduce-map-sent reducer-fun mapper-fun filename-or-directory)
- The reducer-fun is a standard reducer (two inputs, one output), but with two significant differences from the usual reducer, based on the fact that we want this process to be as fast as possible and we're going to use machines to help that might return their answers out of order.
- We cannot predict the order of the arguments. This is because the system starts the reduction as soon as it can with the first two machines that return map results. Conceptually, you can imagine the input to the reducer being passed through an "unsort" stage which recombines a single list in an arbitrary order. E.g., (unsort '(a b c)) could return (a b c) or (a c b) or (b a c) or (b c a) or (c a b) or (c b a). If you don't want your output to depend on what machines returned first (i.e., what the output of unsort may be), make sure your reducer-fun is commutative: (reducer-fun x y) = (reducer-fun y x). E.g., +, -, max, etc. That way, you won't notice a difference and your output will go from being nondeterministic (unpredictable) to deterministic (predictable).
- We cannot predict the associativity of the reductions – we're no longer in right-associative land, Toto! This is because the system starts an additional reduction when it has two valid input data – either from a mapper or another reducer. Remember the goal of the reduction is to take many (the results of all the mappers) and get to one. Conceptually, you can imagine the reduction not being done by our standard, right-associative reduce function, but by a new reduce-arbitrarily function which chooses adjacent pairs arbitrarily and calls the reducer on them. If you don't want your output to depend on what machines returned first (i.e., what the output of reduce-arbitrarily may be), make sure your reducer-fun is associative: (reducer-fun a (reducer-fun b c)) = (reducer-fun (reducer-fun a b) c) E.g., +, -, max, etc. That way, you won't notice a difference and your output will go from being nondeterministic (unpredictable) to deterministic (predictable).
- The filename-or-directory last argument is a word (enclosed in double-quotes) specifying a file or directory in our system. Directories with files in them are concatenated together and treated as if they are one big file. For this example, we'll refer to a file called "/ilovecs3", which contains exactly the following:
- The effect of each of these functions is almost exactly the same; the only thing that differs is how the file is processed and turned into a list (that's why we have -letter, -word and -sent versions. That is, each one below does exactly this:
(reduce-arbitrarily reducer-fun (unsort (map mapper-fun file-as-list)))
|Name of function
||How does it convert a file to a list to pass to map?
What does map see as its second argument if the file is "/ilovecs3"
(I.e., what is file-as-list)
Domain of mapper-fun?
Ignores carriage returns and spaces (actually whitespace in general) and treats the file as a big list of all its letters
'( i l o v e c s 3 )
Ignores carriage returns and treats the file as a big list of words
'( i love cs3 )
Converts every line into a sentence, and makes a list of all the sentences.
'( (i love) (cs3) )
- ...and what makes this so powerful is that we don't have to have just ONE machine doing the mapping, but an army of machines!
- ...and when we're doing the reduction, we can use lots of machines, too.
Show me the money! Let's see an example...
- We'll have to log into the cluster to run this. We're going to use the file "/1-10", which simply has the numbers 1-10:
unix% ssh firstname.lastname@example.org
icluster1  ~ > stk-simply
STk> (define (square x) (* x x))
STk> (define (identity arg) arg))
STk> (reduce + (map square '(1 2 3 4 5 6 7 8 9 10)))
;; This should be 286, because the "10" will be treated as a "1" and "0". 385-102+12+02=286
STk> (reduce-map-letter + square "/1-10")
Mapreduce in progress! Your ID number is 933510232. For progress info, see
STk> (reduce-map-word + square "/1-10")
Mapreduce in progress! Your ID number is 28540981. For progress info, see
STk> (reduce-map-sent + (lambda (s) (reduce + (map square s))) "/1-10")
Mapreduce in progress! Your ID number is 475145543. For progress info, see
;; Let's now try a non-associative, non-commutative reducer...
STk> (reduce-map-word list identity "/1-10")
Mapreduce in progress! Your ID number is 200766683. For progress info, see
(3 ((5 (8 (7 2))) ((1 9) ((4 6) 10))))
STk> (reduce-map-word list identity "/1-10")
Mapreduce in progress! Your ID number is 1615407800. For progress info, see
(9 ((7 1) (2 ((6 10) (8 ((3 5) 4))))))
- You probably will be annoyed with the overhead with small examples. That's because we need to do a lot of behind-the-scenes action to make it all work. That said, when you get to really big datasets, you'll be really psyched to have so many machines helping you out.
How about other files and examples?
This one is small and has all Beatles song names. There are 13 files in this directory, which you can think of as being all in one file. The files are:
The collected works of William Shakespeare
The collected works of Charles Dickens
;; I wonder how many times Shakespeare wrote the word love?
STk> (define (love? w) (equal? w 'love))
STk> (reduce-map-word + (lambda (w) (if (love? w) 1 0)) "/gutenberg/shakespeare")
Mapreduce in progress! Your ID number is 2052777448. For progress info, see
;; Let's double-check that
STk> (reduce-map-sent + (lambda (s) (appearances 'love s)) "/gutenberg/shakespeare")
Mapreduce in progress! Your ID number is 861649500. For progress info, see
;; I wonder what words in the Beatles songs start with u?
STk> (reduce-map-word se (lambda (w) (if (equal? (first w) 'u) w '())) "/beatles-songs")
Mapreduce in progress! Your ID number is 447796622. For progress info, see
(us universe u.s.s.r.)
;; What songs from the Beatles' Abbey Road have the word "the" in them?
STk> (define (keep-the-sents s) ;; sentences with "the" in them pass through
(if (member 'the s)
(list s) ;; "buffer" the sentence with a list
(list))) ;; the rest are turned into null lists.
STk> (reduce-map-sent append keep-the-sents "/beatles-songs/abbey-road")
Mapreduce in progress! Your ID number is 1945545779. For progress info, see
((the end) (here comes the sun) (she came in through the bathroom window))
- We saw a CS3 implementation that allows you to play with cluster computing...
In lab this week
- You'll continue to work on your project and you'll experiment with this yourself
In life this week
- Um, sorry, I missed lab, I'm working from home! (Swine Flu outbreak in the world!!)