Department of Electrical Engineering & Computer Sciences Instructional Support Group /share/b/pub/mapreduce.help /share/b/pub/hadoop.help /share/b/pub/mpi.help /share/b/pub/icluster.help Apr 19, 2019 EECS Instruction is no longer running an Hadoop and MapReduce cluster. The documentation below is obsolete but is retained as a reference until we can rewrite it. The Savio cluster is a current alternative for classes to use Hadoop for processing large data sets through parallel & distributed computing. Instructors can use their Instructional Computing Allowances to get some nodes on Savio. Savio supports Spark (https://spark.apache.org/), but not MapReduce or Yarn (https://hadoop.apache.org/). For more information abour the Savio cluster, please see http://research-it.berkeley.edu/services/high-performance-computing/using-hadoop-and-spark-savio http://research-it.berkeley.edu/services/high-performance-computing/getting-account/instructional-computing-allowance Feb 23, 2012 CONTENTS Icluster: Computing Cluster for EECS Students Icluster: How to Logon Icluster: How to Run MapReduce via ssh shell Icluster: How to Run MapReduce within UCB Scheme Icluster: How to Run Parallel Programs using Torque/Maui ICluster: How to View the WEB Logs Icluster: Reserving Time for a Class Icluster: Setup info for ISG Sysadmins MapReduce and MPI Computing Clusters for EECS Researchers Google/IBM "Cloud Cluster" Amazon AWS EC2 Elastic Compute Cloud Icluster: Computing Cluster for EECS Students --------------------------------------------- For students to do coursework in parallel computing, EECS Instruction supports an on-site cluster called "Icluster" that is available to all students who have Instructional computer accounts: "ICluster" (aka "Google/Intel Cluster"): This is a cluster of 26 DELL computers in a water-cooled rack in an EECS server room, managed by EECS technical support staff (inst@eecs.berkeley.edu). It consists of 26 Intel/DELL 1950s (2 quad-core 2.33GHz Xeon, 8-GB RAM) running Debian Linux. The computers were purchased from DELL in August 2007 with funds that were donated to EECS by GOOGLE and INTEL. We deployed it as an 8-node cluster that was used by CS61A in Fall 2007 and expanded it to 26-nodes in January 2008. Any EECS class could use it. MapReduce/HaDoop programming tools are available. MPI programming tools are not currently installed. Icluster: How to Logon ---------------------- Students with EECS Instructional computer accounts can use MapReduce on the multi-node Instructional cluster by logging into the job submission server icluster1.eecs.berkeley.edu using 'ssh' or 'putty' (see http://inst.eecs.berkeley.edu/connecting.html). Use your Instructional UNIX login/password, as on cory.eecs.berkeley.edu. For information about getting an EECS Instructional computer account, see http://inst.eecs.berkeley.edu/new-users.html. Icluster: How to Run MapReduce via ssh shell -------------------------------------------- Here is a simple test of the map-reduce implementation on icluster1.eecs.berkeley.edu: #!/bin/bash -x export HADOOP_ROOT=/home/aa/projects/hadoop export HADOOP_INSTALL=$HADOOP_ROOT/hadoop export HADOOP_EXAMPLES=$HADOOP_INSTALL/hadoop-examples*.jar export HADOOP_CONF_DIR=$HADOOP_ROOT/conf export PATH=$HADOOP_INSTALL/bin:$PATH hadoop fs -rmr gutenberg gutenberg-output hadoop dfs -copyFromLocal $HADOOP_ROOT/examples/gutenberg gutenberg hadoop jar $HADOOP_EXAMPLES wordcount gutenberg gutenberg-output hadoop dfs -ls gutenberg-output hadoop dfs -cat gutenberg-output/part-r-00000 hadoop dfsadmin -report Icluster: How to Run MapReduce within UCB Scheme ------------------------------------------------ CS61A teaching staff have developed a module for UCB Scheme that runs MapReduce functions within UCB Scheme, a Scheme interpreter that is written in STk and has home-grown extensions for our CS Lower Division classes. It requires a customized version of Hadoop that is owned by CS61A and that runs on the Icluster. For instructions about how the CS61A instructor can start Hadoop for the UCB Scheme/MapReduce functions and how students can use it, please see http://inst.eecs.berkeley.edu/cgi-bin/pub.cgi?file=scheme.help. For the authors' technical report, please see "Infusing Parallelism into Introductory Computer Science Curriculum using MapReduce" (http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-34.html). Icluster: How to Run Parallel Programs using Torque/Maui -------------------------------------------------------- The ICluster runs the Torque Resource Manager and the Maui Cluster Scheduler: http://www.clusterresources.com/pages/products/torque-resource-manager.php http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php These are used to schedule programs that can run on multiple nodes of the cluster. Please see http://www.millennium.berkeley.edu/docs/torque.html for commands and examples [thanks to Mike Howard]. ICluster: How to View the WEB Logs ---------------------------------- Here is a WEB site that displays statistics for the Icluster: http://icluster1.eecs.berkeley.edu:50030/jobtracker.jsp Icluster: Reserving Time for a Class ------------------------------------ These classes wish to reserve the Icluster for their exclusive use: CS61A - week 13 of the semester CS61C - week 15 of the semester Instuctors should please contact inst@eecs.berekeley.edu if they wish to assign projects involving parallelism on the Icluster. Programs that do not use parallelism should not be run on the Icluster. Please refer to http://inst.eecs.berkeley.edu/~inst/iesglabs.html for information about our other computers. Icluster: Setup info for ISG Sysadmins -------------------------------------- There are 2 Hadoop versions that run on the Icluster: 1) The current version is in ~hadoop/hadoop. This is what most people should use. 2) The old version for UCB Scheme is in ~cs61a/projects/hadoop. This is for classes that want to use MapReduce with STk, which was developed by students for CS61A. CS61AS may still use the old version for UCB Scheme. Starting in Fall 2011, CS61A has converted to Python3 and uses the current version (#1, above). Here are the setup commands for a new installation of the HDFS file system for the current version (#1, above): Login as 'hadoop' on Icluster1. In ~hadoop/conf, edit these files to include core-site.xml: hadoop.tmp.dir /scratch/hadoop/${user.name} fs.default.name hdfs://icluster1:9010 mapred-site.xml: mapred.job.tracker icluster1:9011 mapred.system.dir /scratch/hadoop/mapred/system hadoop-env.sh: export JAVA_HOME=/usr/lib/jvm/java-6-sun slaves: list all of the nodes in the cluster Recreate the UNIX and HDFS directory structures: % export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.12 # for example % cd ~hadoop/hadoop/bin % eval ./setup.sh -s # set env variables for bash % ./stop-all.sh # stop hadoop if running % cd /scratch; chmod 755 /scratch # do this on all cluster nodes % mkdir /scratch/hadoop # do this on all cluster nodes % chmod 777 /scratch/hadoop # do this on all cluster nodes % ./hadoop fs -mkdir /user % ./hadoop fs -chown hadoop /user % ./hadoop fs -chmod 777 /user % ./hadoop fs -chmod 777 /scratch/hadoop % ./hadoop namenode -format % ./start-dfs.sh % ./hadoop dfsadmin -refreshNodes % ./stop-all.sh % ./start-all.sh % ./hadoop dfsadmin -report # verify that all nodes are OK To add a new node, add it to ~hadoop/conf/slaves and run: % ~hadoop/hadoop/bin/stop-all.sh % ~hadoop/hadoop/bin/start-all.sh % ~hadoop/hadoop/bin/hadoop dfsadmin -refreshNodes The 'hadoop' user communicates from the master (icluster1) to the slaves via ssh, so be sure that icluster1:~hadoop/.ssh/known_hosts includes the host key of each slave node (from /etc/ssh/ssh_host_rsa_key.pub). MapReduce and MPI ----------------- Two basic types of programming are done on clusters: Hadoop and MPI. MapReduce/HaDoop programming tools (installed on the Icluster): 1) Distributed processing of large input files. The software for that is "MapReduce" (http://labs.google.com/papers/mapreduce.html). MapReduce uses a framework called HaDoop (http://lucene.apache.org/hadoop/) to manage the distributed processing over multiple computers (nodes). MPI programming tools (not installed on the Icluster): 2) Development and testing of MPI tools. MPI (http://www.open-mpi.org) is a standardized API typically used for parallel and distributed computing. MPI tools may include Intel performance tools, SUNs JDK, SBCL (Lisp), UPC (C), CAF (Fortran), Titanium (Java), CVS, SVN, Git (versioning), Stow, Python, Octave (scripting), Torque/Maui (batch scheduling). They need batch scheduling with no contention for performance and timing tests. HaDoop is still in development and does not yet provide batch scheduling, so it could interfere with the controlled environment that the MPI users need for performance and timing tests. References: http://code.google.com/edu/parallel/mapreduce-tutorial.html http://www.alphaworks.ibm.com/tech/mapreducetools/ http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 Googling for "hadoop run map-reduce" reveals programming examples. http://radlab.cs.berkeley.edu/wiki/Berkeley_Hadoop_SIG https://bspace.berkeley.edu/portal/site/ff98d738-1d6b-4cf4-876f-e15dda302f6f http://www.nytimes.com/2009/03/17/technology/business-computing/17cloud.html http://inews.berkeley.edu/articles/Spring2009/cloud-computing http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf Computing Clusters for EECS Researchers --------------------------------------- The Millennium project (http://www.millennium.berkeley.edu/) provides clusters for EECS researchers. The Cal-Grid cluster (http://ist.berkeley.edu/is/platforms/unix/calgrid/) is available to researchers on a recharge basis. It is supported by the UC Berkeley IST group (servicedesk@berkeley.edu). For more information, also see http://inews.berkeley.edu/articles/Fall2008/computing-cluster and http://scs.lbl.gov/. Google/IBM "Cloud Cluster" ------------------------- In 2008/2009, selected classes also used an off-site cluster called the Google/IBM "Cloud Cluster". For documentation about that, please see http://inst.eecs.berkeley.edu/cgi-bin/pub.cgi?file=ccluster.help. Amazon AWS EC2 Elastic Compute Cloud ------------------------------------ The Amazon EC2 Elastic Compute Cloud (http://aws.amazon.com/ec2) is a WEB interface to a virtual computing environment. Costs range from: $0.10/hr to $0.80/hr for compute time plus $0.10/GB to $0.18/GB for data transfer to and from the EC2. EECS Instructional Support 378/384/386 Cory, 333 Soda inst@eecs.berkeley.edu