Department of Electrical Engineering & Computer Sciences
Instructional Support Group
/share/b/pub/mapreduce.help
/share/b/pub/hadoop.help
/share/b/pub/mpi.help
/share/b/pub/icluster.help
Apr 19, 2019
EECS Instruction is no longer running an Hadoop and MapReduce cluster.
The documentation below is obsolete but is retained as a reference until
we can rewrite it.
The Savio cluster is a current alternative for classes to use Hadoop
for processing large data sets through parallel & distributed computing.
Instructors can use their Instructional Computing Allowances to get some
nodes on Savio. Savio supports Spark (https://spark.apache.org/), but
not MapReduce or Yarn (https://hadoop.apache.org/).
For more information abour the Savio cluster, please see
http://research-it.berkeley.edu/services/high-performance-computing/using-hadoop-and-spark-savio
http://research-it.berkeley.edu/services/high-performance-computing/getting-account/instructional-computing-allowance
Feb 23, 2012
CONTENTS
Icluster: Computing Cluster for EECS Students
Icluster: How to Logon
Icluster: How to Run MapReduce via ssh shell
Icluster: How to Run MapReduce within UCB Scheme
Icluster: How to Run Parallel Programs using Torque/Maui
ICluster: How to View the WEB Logs
Icluster: Reserving Time for a Class
Icluster: Setup info for ISG Sysadmins
MapReduce and MPI
Computing Clusters for EECS Researchers
Google/IBM "Cloud Cluster"
Amazon AWS EC2 Elastic Compute Cloud
Icluster: Computing Cluster for EECS Students
---------------------------------------------
For students to do coursework in parallel computing, EECS Instruction
supports an on-site cluster called "Icluster" that is available to all
students who have Instructional computer accounts:
"ICluster" (aka "Google/Intel Cluster"):
This is a cluster of 26 DELL computers in a water-cooled rack
in an EECS server room, managed by EECS technical support staff
(inst@eecs.berkeley.edu). It consists of 26 Intel/DELL 1950s
(2 quad-core 2.33GHz Xeon, 8-GB RAM) running Debian Linux. The
computers were purchased from DELL in August 2007 with funds that
were donated to EECS by GOOGLE and INTEL. We deployed it as an
8-node cluster that was used by CS61A in Fall 2007 and expanded
it to 26-nodes in January 2008. Any EECS class could use it.
MapReduce/HaDoop programming tools are available.
MPI programming tools are not currently installed.
Icluster: How to Logon
----------------------
Students with EECS Instructional computer accounts can use MapReduce on
the multi-node Instructional cluster by logging into the job submission
server
icluster1.eecs.berkeley.edu
using 'ssh' or 'putty' (see http://inst.eecs.berkeley.edu/connecting.html).
Use your Instructional UNIX login/password, as on cory.eecs.berkeley.edu.
For information about getting an EECS Instructional computer account, see
http://inst.eecs.berkeley.edu/new-users.html.
Icluster: How to Run MapReduce via ssh shell
--------------------------------------------
Here is a simple test of the map-reduce implementation on
icluster1.eecs.berkeley.edu:
#!/bin/bash -x
export HADOOP_ROOT=/home/aa/projects/hadoop
export HADOOP_INSTALL=$HADOOP_ROOT/hadoop
export HADOOP_EXAMPLES=$HADOOP_INSTALL/hadoop-examples*.jar
export HADOOP_CONF_DIR=$HADOOP_ROOT/conf
export PATH=$HADOOP_INSTALL/bin:$PATH
hadoop fs -rmr gutenberg gutenberg-output
hadoop dfs -copyFromLocal $HADOOP_ROOT/examples/gutenberg gutenberg
hadoop jar $HADOOP_EXAMPLES wordcount gutenberg gutenberg-output
hadoop dfs -ls gutenberg-output
hadoop dfs -cat gutenberg-output/part-r-00000
hadoop dfsadmin -report
Icluster: How to Run MapReduce within UCB Scheme
------------------------------------------------
CS61A teaching staff have developed a module for UCB Scheme that runs
MapReduce functions within UCB Scheme, a Scheme interpreter that is
written in STk and has home-grown extensions for our CS Lower Division
classes. It requires a customized version of Hadoop that is owned by
CS61A and that runs on the Icluster.
For instructions about how the CS61A instructor can start Hadoop for the
UCB Scheme/MapReduce functions and how students can use it, please see
http://inst.eecs.berkeley.edu/cgi-bin/pub.cgi?file=scheme.help.
For the authors' technical report, please see "Infusing Parallelism into
Introductory Computer Science Curriculum using MapReduce"
(http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-34.html).
Icluster: How to Run Parallel Programs using Torque/Maui
--------------------------------------------------------
The ICluster runs the Torque Resource Manager and the Maui Cluster Scheduler:
http://www.clusterresources.com/pages/products/torque-resource-manager.php
http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php
These are used to schedule programs that can run on multiple nodes of the
cluster. Please see http://www.millennium.berkeley.edu/docs/torque.html
for commands and examples [thanks to Mike Howard].
ICluster: How to View the WEB Logs
----------------------------------
Here is a WEB site that displays statistics for the Icluster:
http://icluster1.eecs.berkeley.edu:50030/jobtracker.jsp
Icluster: Reserving Time for a Class
------------------------------------
These classes wish to reserve the Icluster for their exclusive use:
CS61A - week 13 of the semester
CS61C - week 15 of the semester
Instuctors should please contact inst@eecs.berekeley.edu if they wish
to assign projects involving parallelism on the Icluster.
Programs that do not use parallelism should not be run on the Icluster.
Please refer to http://inst.eecs.berkeley.edu/~inst/iesglabs.html for
information about our other computers.
Icluster: Setup info for ISG Sysadmins
--------------------------------------
There are 2 Hadoop versions that run on the Icluster:
1) The current version is in ~hadoop/hadoop.
This is what most people should use.
2) The old version for UCB Scheme is in ~cs61a/projects/hadoop.
This is for classes that want to use MapReduce with STk, which
was developed by students for CS61A. CS61AS may still use the
old version for UCB Scheme. Starting in Fall 2011, CS61A has
converted to Python3 and uses the current version (#1, above).
Here are the setup commands for a new installation of the HDFS file
system for the current version (#1, above):
Login as 'hadoop' on Icluster1.
In ~hadoop/conf, edit these files to include
core-site.xml:
hadoop.tmp.dir
/scratch/hadoop/${user.name}
fs.default.name
hdfs://icluster1:9010
mapred-site.xml:
mapred.job.tracker
icluster1:9011
mapred.system.dir
/scratch/hadoop/mapred/system
hadoop-env.sh: export JAVA_HOME=/usr/lib/jvm/java-6-sun
slaves: list all of the nodes in the cluster
Recreate the UNIX and HDFS directory structures:
% export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.12 # for example
% cd ~hadoop/hadoop/bin
% eval ./setup.sh -s # set env variables for bash
% ./stop-all.sh # stop hadoop if running
% cd /scratch; chmod 755 /scratch # do this on all cluster nodes
% mkdir /scratch/hadoop # do this on all cluster nodes
% chmod 777 /scratch/hadoop # do this on all cluster nodes
% ./hadoop fs -mkdir /user
% ./hadoop fs -chown hadoop /user
% ./hadoop fs -chmod 777 /user
% ./hadoop fs -chmod 777 /scratch/hadoop
% ./hadoop namenode -format
% ./start-dfs.sh
% ./hadoop dfsadmin -refreshNodes
% ./stop-all.sh
% ./start-all.sh
% ./hadoop dfsadmin -report # verify that all nodes are OK
To add a new node, add it to ~hadoop/conf/slaves and run:
% ~hadoop/hadoop/bin/stop-all.sh
% ~hadoop/hadoop/bin/start-all.sh
% ~hadoop/hadoop/bin/hadoop dfsadmin -refreshNodes
The 'hadoop' user communicates from the master (icluster1) to the slaves
via ssh, so be sure that icluster1:~hadoop/.ssh/known_hosts includes the
host key of each slave node (from /etc/ssh/ssh_host_rsa_key.pub).
MapReduce and MPI
-----------------
Two basic types of programming are done on clusters: Hadoop and MPI.
MapReduce/HaDoop programming tools (installed on the Icluster):
1) Distributed processing of large input files. The software for that is
"MapReduce" (http://labs.google.com/papers/mapreduce.html). MapReduce
uses a framework called HaDoop (http://lucene.apache.org/hadoop/) to
manage the distributed processing over multiple computers (nodes).
MPI programming tools (not installed on the Icluster):
2) Development and testing of MPI tools. MPI (http://www.open-mpi.org) is
a standardized API typically used for parallel and distributed computing.
MPI tools may include Intel performance tools, SUNs JDK, SBCL (Lisp),
UPC (C), CAF (Fortran), Titanium (Java), CVS, SVN, Git (versioning),
Stow, Python, Octave (scripting), Torque/Maui (batch scheduling). They
need batch scheduling with no contention for performance and timing tests.
HaDoop is still in development and does not yet provide batch scheduling,
so it could interfere with the controlled environment that the MPI users
need for performance and timing tests.
References:
http://code.google.com/edu/parallel/mapreduce-tutorial.html
http://www.alphaworks.ibm.com/tech/mapreducetools/
http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
Googling for "hadoop run map-reduce" reveals programming examples.
http://radlab.cs.berkeley.edu/wiki/Berkeley_Hadoop_SIG
https://bspace.berkeley.edu/portal/site/ff98d738-1d6b-4cf4-876f-e15dda302f6f
http://www.nytimes.com/2009/03/17/technology/business-computing/17cloud.html
http://inews.berkeley.edu/articles/Spring2009/cloud-computing
http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf
Computing Clusters for EECS Researchers
---------------------------------------
The Millennium project (http://www.millennium.berkeley.edu/) provides
clusters for EECS researchers.
The Cal-Grid cluster (http://ist.berkeley.edu/is/platforms/unix/calgrid/)
is available to researchers on a recharge basis. It is supported by the
UC Berkeley IST group (servicedesk@berkeley.edu). For more information,
also see http://inews.berkeley.edu/articles/Fall2008/computing-cluster
and http://scs.lbl.gov/.
Google/IBM "Cloud Cluster"
-------------------------
In 2008/2009, selected classes also used an off-site cluster called the
Google/IBM "Cloud Cluster". For documentation about that, please see
http://inst.eecs.berkeley.edu/cgi-bin/pub.cgi?file=ccluster.help.
Amazon AWS EC2 Elastic Compute Cloud
------------------------------------
The Amazon EC2 Elastic Compute Cloud (http://aws.amazon.com/ec2) is a WEB
interface to a virtual computing environment. Costs range from:
$0.10/hr to $0.80/hr for compute time plus
$0.10/GB to $0.18/GB for data transfer to and from the EC2.
EECS Instructional Support
378/384/386 Cory, 333 Soda
inst@eecs.berkeley.edu