MapReduce

MapReduce is a framework for performing computations on BIG sets of data by parallelizing it across computers. It is built on many things we've learned about in this class: concurrency, client-server communication, higher-order procedures, list processing, functional programming, and streams. But the nice thing is you really don't have to worry about most of that when you use it! The whole point of a framework like this is that you only have to give it information about your specific problem, and it'll handle all the parallelism and computation for you. Higher-order procedures on steroids!

It's called "MapReduce", but really it's Map[Filter][Group][Sort]Reduce! Let's break it down.

; (mapreduce <mapper> <reducer> <base-case> <input>)
(mapreduce (lambda (line-pair)
             (list (make-kv-pair (kv-key line-pair) 1)) )
           +
           0
           "/beatles-songs")

input is either the name of a MapReduce file or directory (like "/gutenberg/shakepeare" or "/beatles-songs"), or the output from a previous call to MapReduce.
mapper is a procedure that takes a key-value pair and returns a list of key-value pairs. Usually, the input key will be the name of a file and the value will be a line from that file. This is what lets you perform the map and filter steps of MapReduce.
The group and sort stages happen invisibly! Behind-the-scenes, MapReduce takes all the output from the mappers and puts them together, sorting them all by key. Then it groups all values with the same key, so the association list ((a . 1) (a . 2) (a . 3) (b . 4)) becomes ((a 1 2 3) (b 4)). The values are then sent on for reduction.
reducer and base-case are just like the arguments to accumulate (also known as reduce). The two arguments of reducer are a value (not a kv-pair!) and the accumulated value so far (or the base-case if nothing's been accumulated yet). The return value is also a value, not a kv-pair.
After the reducer is done, the keys are reattached to the now-reduced values. ((a . 6) (b . 4)) The result is a stream of key-value pairs.

The diagram in the lecture notes should help a lot.

The first thing you might want to know about MapReduce is how to do Nothing. That is, how can we take an input and pass it through unchanged?

For regular map, we'd just use the identity function, (lambda (x) x). But MapReduce's mappers take a key-value pair and return a list of key-value pairs. If we want to keep everything the same, we need to put our input in a list.

The reduction stage, on the other hand, is exactly like accumulate, so what we need is something that puts a list back together out of our input data. Put these two together, and...

(define (do-nothing input)
  (mapreduce list cons '() input) )

You can use cons-stream (and the-empty-stream) if there's a lot of data for each key and you don't want to show it all at once. (Sometimes you'll even get a crash if you use cons, cause there's just so much data. In these cases, you need to use cons-stream.)

A lot of MapReduce problems are about formulating your problem in terms of mapping, filtering, sorting, and reducing. For instance, here's a variation on a past midterm question, similar to what we did in section:

Consider the following MapReduce query:
(define (count-words input)
  (list (make-kv-pair (kv-key input) (length (kv-value input))) ))

(mapreduce count-words + 0 "/shakespeare")
The result is a stream of key-value pairs, where the keys are the names of Shakespeare plays, and the values are the number of words in the play.
((a-lovers-complaint . 2568) (a-midsummer-nights-dream . 17608) ...)

Change either the mapper or the reducer (but not both) to find the length of the longest line in each play. (You can use an existing Scheme primitive, or write an entirely new procedure. If you change the reducer, you can also change the base case.) Show your new call to mapreduce.
(show answer)
```
; A correct solution:
(mapreduce count-words max 0 "/shakespeare")
```
Here, the key concept is to realize that the reducer is called with 2 arguments at a time, which in this case are 2 line-counts. Since we want the longest line from each play, we can simply find the "max" line length, so we need to change the reducer to max.

(hide answer)
Change either the mapper or the reducer (but not both) to count the number of times the word "thou" appears in each play. (You can use an existing Scheme primitive, or write an entirely new procedure. If you change the reducer, you can also change the base case.) Show your new call to mapreduce.
(show answer)
```
; A correct solution:
(define (count-thou input)
  (list (make-kv-pair (kv-key input)
                      (length (filter (lambda (x) (eq? x 'thou))
                                      (kv-value input))) )))
(map-reduce count-thou + 0 "/shakespeare")
```
count-thou filters every thou in the line from the play, and then finds the length of the remaining line to get the count of thous. The reducer does not need to be changed from the original problem, since all we need to do is add every count of thou together for each play.

This is not the only correct solution; you could also, for example, make one key-value pair for every thou, or even for every word (with a value of 1 for thou and 0 otherwise).

(hide answer)

Back to Jordy's home page | Course home page