MapReduce

MapReduce is a framework for performing computations on BIG sets of data by parallelizing it across computers. It is built on many things we've learned about in this class: concurrency, client-server communication, higher-order procedures, list processing, functional programming, and streams. But the nice thing is you really don't have to worry about most of that when you use it! The whole point of a framework like this is that you only have to give it information about your specific problem, and it'll handle all the parallelism and computation for you. Higher-order procedures on steroids!

It's called "MapReduce", but really it's Map[Filter][Group][Sort]Reduce! Let's break it down.

; (mapreduce <mapper> <reducer> <base-case> <input>)
(mapreduce (lambda (line-pair)
             (list (make-kv-pair (kv-key line-pair) 1)) )
           +
           0
           "/beatles-songs")

The diagram in the lecture notes should help a lot.


The first thing you might want to know about MapReduce is how to do Nothing. That is, how can we take an input and pass it through unchanged?

For regular map, we'd just use the identity function, (lambda (x) x). But MapReduce's mappers take a key-value pair and return a list of key-value pairs. If we want to keep everything the same, we need to put our input in a list.

The reduction stage, on the other hand, is exactly like accumulate, so what we need is something that puts a list back together out of our input data. Put these two together, and...

(define (do-nothing input)
  (mapreduce list cons '() input) )

You can use cons-stream (and the-empty-stream) if there's a lot of data for each key and you don't want to show it all at once. (Sometimes you'll even get a crash if you use cons, cause there's just so much data. In these cases, you need to use cons-stream.)


A lot of MapReduce problems are about formulating your problem in terms of mapping, filtering, sorting, and reducing. For instance, here's a variation on a past midterm question, similar to what we did in section:

Consider the following MapReduce query:

(define (count-words input)
  (list (make-kv-pair (kv-key input) (length (kv-value input))) ))

(mapreduce count-words + 0 "/shakespeare")

The result is a stream of key-value pairs, where the keys are the names of Shakespeare plays, and the values are the number of words in the play.

((a-lovers-complaint . 2568) (a-midsummer-nights-dream . 17608) ...)

Back to Jordy's home page | Course home page