University of California at Berkeley
           Department of Electrical Engineering & Computer Sciences
                         Instructional Support Group


/share/b/pub/mapreduce.help

								Oct 26, 2007

MapReduce
---------

  MapReduce is a programming model for processing and generating large data 
  sets.  See http://labs.google.com/papers/mapreduce.html for information.

  MapReduce is typically run on a parallel computing cluster using a framework
  such a HaDoop to manage the distributed computing platform.  See
  http://lucene.apache.org/hadoop/ for more information.

  EECS Instruction has received grants from Google and Intel for the creation
  of a 26-node computing cluster, which will be available to EECS classes in 
  Spring 2008.  The cluster is called the "Icluster".   We are installing 
  HaDoop and MapReduce there.

MapReduce on the ICluster
-------------------------

  The Instructional "Icluster" is still under development (Oct 2007).  Some
  users are developing programs for classes now.  Information about logging
  onto the cluster and running programs will be added here later.

  Here is a simple test of the map-reduce implementation on the Icluster:

	HADOOP_HOME=/home/aa/projects/hadoop
	HADOOP_INSTALL=$HADOOP_HOME/hadoop
	HADOOP_CONF_DIR=$HADOOP_HOME/hadoop-conf
	PATH=$HADOOP_INSTALL/bin:$PATH

	hadoop dfs -copyFromLocal $HADOOP_HOME/sample/gutenberg gutenberg
	hadoop fs -rmr gutenberg-output
	hadoop jar $HADOOP_INSTALL/hadoop-0.14.1-examples.jar wordcount \
		gutenberg gutenberg-output
	hadoop dfs -ls gutenberg-output
	hadoop dfs -cat gutenberg-output/part-00000


References
----------

  http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html

  Googling for "hadoop run map-reduce" reveals programming examples. 


						EECS Instructional Support
						378/384/386 Cory, 333 Soda
						inst@eecs.berkeley.edu