Department of Electrical Engineering & Computer Sciences
	                 Instructional Support Group


/share/b/pub/ccluster.help

								Nov 30, 2011

CONTENTS
	Google/IBM "Cloud Cluster"
	Cloud Cluster: How to Get an Account
	Cloud Cluster: How to Run MapReduce via ssh shell
	Cloud Cluster: How to Run MapReduce via Eclipse plugin
	Cloud Cluster: How to Install the MapReduce plugin at home
	Cloud Cluster: How to Transfer Files
	Cloud Cluster: How to View the WEB Logs
	Cloud Cluster: Announcements and Scheduled Downtime 

	Icluster: Computing Cluster for EECS Students


Google/IBM "Cloud Cluster"
-------------------------

  In 2008/2009, selected classes ran MapReduce on an off-site cluster called 
  the Google/IBM "Cloud Cluster".  We have not used it lately, and this usage 
  information is old.  It is retained here for historical purposes.

  "Cloud Cluster" (aka "Google/IBM Cluster"):
  
	This is a remote facility managed by Google/IBM.  It consists of 
	40 64-bit dual-core Opterons (80 cpus) running Redhat SELinux on 
	a Xen virtual machine.  Google and IBM have donated access time 
	to EECS students.  Google and IBM provide EECS with a limited set 
	of "tokens" that we give to students in selected classes.  The 
	students generate accounts on the cluster and use either "ssh" (or 
	"putty") or a special Eclipse plugin to run jobs on the cluster.  
	Only MapReduce/HaDoop is available.


Cloud Cluster: How to Get an Account
------------------------------------

  Instructors: Please notify inst@eecs.berkeley.edu if you have an interest
  in running MapReduce on the Google/IBM Cloud Cluster.

  Google/IBM provides EECS with "tokens" that students use to request accounts
  on the Cloud Cluster.   To get an account, students in specific classes are
  authorized to check out a token from an EECS WEB site.

  To get an account on the Cloud Cluster:

  1) Login to http://inst.eecs.berkeley.edu/webacct/ using CalNet and select
     "Request Cloud Cluster account". [not currently available]

  2) Review the Usage Agreement.   A form with your "token" will be displayed.

  3) Goto http://univsupport.hipods.ihost.com, select "Register".
     This puts you on the "New User Registration" page.   
     (Your browser must allow Javascript for this WEB site.)

  4) Enter the token.  Request an account (a login name and password). 
     There should be NO SPACES in the login name or password.

     We'll refer to this login name as "$USER".  (Note: this login and 
     password will let you login to http://univsupport.hipods.ihost.com to 
     communicate in user forums.)

  5) Within 10 minutes, that account will be enabled on the Cloud Cluster.

  6) You can login to http://univsupport.hipods.ihost.com using that account
     to read the documentation and notices and to submit a help ticket.

  Now you can run programs on the Cloud Cluster either from your local
  computer via the ssh proxy setup or via an Eclipse plugin (see below).

  REMEMBER YOUR NEW LOGIN and PASSWORD.  Your one-time token (step 2) is 
  now obsolete, and it would be difficult to re-discover the login name 
  and passwords you created if you forget them.

  NOTE: Your files on the Cloud Cluster are temporary.  They are not backed up
  and could be purged at any time by the Cloud Cluster staff.   So keep a copy
  of your programs and input data in a safe place.

  Documentation about using the Google/IBM Virtual Infrastructure (Cloud
  Cluster) is under http://64.88.164.203/documents/.
  
  
Cloud Cluster: How to Run MapReduce via ssh shell
-------------------------------------------------
  To run Hadoop on the Cluster from the command line of your local computer, 
  you will use ssh on your local computer to open a SOCKS proxy to the Cloud 
  Cluster through its gateway.  
     
  1) Login to your local computer account, open a terminal window and type:

	ssh -D 6789 -L 50128:127.0.0.1:50128 $USER@64.88.164.202

     This sets up a SOCKS proxy (via the -D option) on port 6789 for use 
     connecting your local hadoop installation to the cluster.  The -L 
     option connects your computer to a remote proxy allowing you to browse 
     the remote Hadoop web interfaces.  
     
     Be sure to run this ssh command before you use the cluster.  Keep this 
     ssh connection open for as long as you are using the cluster.  

  2) You also need the "hadoop-site.xml" file for the Cloud Cluster.  That 
     is already available in on the Instructional UNIX computers, but you 
     would have to install if you are using your own computer, from
     http://64.88.164.203/media/portal/sitexmls/1/3/oitdC6/hadoop-site.xml 

     The local "hadoop" command determines where to run map-reduce from the
     contents of the local "hadoop-site.xml" file.

  3) In your local computer account, open a new terminal window to run 
     hadoop commands.  To run a basic hadoop application on the Cloud
     Cluster:

	/bin/bash
	export HADOOP_HOME=/home/aa/projects/hadoop
	export HADOOP=$HADOOP_HOME/hadoop
	export HADOOP_EXAMPLES=$HADOOP/hadoop-*-examples.jar
	export HADOOP_CONF_DIR=$HADOOP_HOME/ccluster-conf
	export PATH=$HADOOP/bin:$PATH

	cd $HADOOP/bin
	hadoop jar hadoop-*-examples.jar pi 50 100000

     This will start a job that uses the computing power of the entire 
     cluster to calculate pi.  Hopefully the value will be close to 3.14!

  4) Here is an example of how to upload an input file ("gutenberg") to the
     Cloud Cluster and run map-reduce on it:
     
	export GUTENBERG=$HADOOP_HOME/examples/gutenberg
	cd $HADOOP/bin

	hadoop dfs -rmr gutenberg gutenberg-output	# delete old files
	hadoop dfs -put $GUTENBERG gutenberg		# install input file
	hadoop dfs -lsr					# list the files

	# generate lists of wordcounts
	hadoop jar $HADOOP_EXAMPLES wordcount gutenberg gutenberg-output
	hadoop dfs -lsr gutenberg-output		# list the files
	hadoop dfs -cat gutenberg-output/part-00000	# display the output

	# search for the text string "Warranty" 
	hadoop dfs -mkdir input output
	hadoop jar $HADOOP_EXAMPLES grep input output 'Warranty'
	hadoop dfs -cat output/*			# display the output

  The 'hadoop' command without arguments gives usage information.  

  Other examples that might help you write Hadoop applications are in the
  java package org.apache.hadoop.examples.  You can download the Hadoop 
  source doop dfs -cat output/*code from http://hadoop.apache.org/core/version_control.html and 
  looking at how these example applications were written.  


Cloud Cluster: How to Run MapReduce via Eclipse plugin
------------------------------------------------------

  ***********************************************************************
  *** March 2009: see http://64.88.164.203/documents/2/ for current 
  *** instructions.  The Eclipse plugin below are from Spring 2008 
  *** and will be updated soon.
  ***********************************************************************
  
  IBM has written an Eclipse plugin called "MapReduce" that facilitates the 
  use of Hadoop through the Eclipse IDE.  

  The "MapReduce" plugin is installed in Eclipse on the Instructional Solaris 
  (UNIX) computers (for hostnames, see http://inst.eecs.berkeley.edu/labs).

  To use the MapReduce plugin on an Instructional computer:
  
  1) Start "eclipse33" and select Window->Open Perspective->Other->Mapreduce.
     You should see a new tab called "MapReduce Servers" in the Eclipse window.

  2) Create a new MapReduce project, package and classes:

     Select File->New->Project->"MapReduce Project".  Enter a project name,
     select "Use default Hadoop" and configure the Hadoop install directory
     to be "/home/aa/projects/hadoop/hadoop-0.14.3".

     Select File->New->Package.  Enter the project name and a new package name.

     Select File->New->Class.  Create classes for Mapper, Reducer and Driver.
  
  3) To set up your connection to the Cloud Cluster, right-click on your 
     Driver class, select "Run As"->"Run on Hadoop" and enter:

	Server name: Hadoop Server
	Hostname: 10.1.130.119
	Installation directory: /hadoop/hadoop-0.16.0
	Username: [your Cloud Cluster login]

	Tunnel Connections (y/n) y
	Tunnel via: 64.88.164.202
	Tunnel username: [your Cloud Cluster login]
     
     Test with "Validate Location" (enter the password to the gateway, then
     at the job submission server ("localhost:35044").   Look for the "Found
     Hadoop" message at the top of the window.   

  4) Now when you call Mapper, Reducer and Driver functions in your code, 
     they can be run on the Cloud Cluster.

  5) You can find help by selecting Help->Cheat Sheets->MapReduce 
     and at http://www.alphaworks.ibm.com/tech/mapreducetools/

  You can create Mappers, Reducers, and Drivers with coding templates.  You 
  can also send created applications to be run on the Hadoop cluster, by 
  providing the clusters network information, and you can also track the 
  progress of your jobs in real-time.  See the cheat sheets included with 
  the plugin for more details.

  The University of Maryland is developing a MapReduce library for Hadoop 
  called Cloud9.  It is available from 
  http://www.umiacs.umd.edu/~jimmylin/cloud9/umd-hadoop-dist/cloud9-docs/.

  If you're using Java 1.6, you must change your compiler compliance level 
  5.0 or else your code will not run on the IBM cluster.  [Thanks to Jimmy 
  Lin at University of Maryland for this info.]  You can do that by changing 
  the 'Compiler compliance level' to "5.0" in one or more of these places in 
  Eclipse, depending on the version of Eclipse:

	Window -> Preferences -> Java -> Compiler
	Eclipse -> Preferences -> Java -> Compiler
	Projects -> Properties -> Java Compiler


Cloud Cluster: How to Install the MapReduce plugin at home
----------------------------------------------------------

  ***********************************************************************
  *** March 2009: see http://64.88.164.203/documents/2/ for current 
  *** instructions.  The Eclipse plugin below are from Spring 2008 
  *** and will be updated soon.
  ***********************************************************************
  
  To connect to the Cloud Cluster, you need your own copy of Hadoop, and
  replace the default "hadoop-site.xml" file with this one for the Cloud 
  Cluster:

	http://64.88.164.203/media/portal/sitexmls/1/3/oitdC6/hadoop-site.xml 

  This will point your local hadoop installation to your cluster through
  a SOCKS proxy.  You only need to do this the first time you use the Cloud 
  Cluster.

  To install the "MapReduce" plugin on your own computer:

  0) If you don't have Eclipse 3.2+ (http://www.eclipse.org/) and Java 1.5+ 
     (http://java.sun.com/), install those.

  1) Download the MapReduce plugin from
     http://inst.eecs.berkeley.edu/~hadoop/hadoop-eclipse-plugin.jar

     OR

     Download Hadoop 0.14.4 or later from 
     http://www.apache.org/dist/hadoop/core/stable/  
     Open the tar.gz archive and extract the hadoop-eclipse-plugin.jar file 
     from the contrib directory.  (On UNIX, you can unbundle the *.gz file
     using 'gunzip' and 'tar'.  On Windows, you can use http://7-zip.org.)
     
  2) If you are interested in running hadoop applications with scripts written
     in non-Java languages, get hadoop-streaming.jar from the same directory.

  3) Place the eclipse plugin jar file in the plugins directory off your 
     Eclipse home directory.  The plugin requires Java 1.5+ and Eclipse 3.2+.  
 
  4) To load the Eclipse plugin, start Eclipse with the -clean option.  
     You should now have a MapReduce perspective available.  
   
  4) Follow the instructions above to set up your connection to the Cloud 
     Cluster.

  5) You can find help by selecting Help->Cheat Sheets->MapReduce.


Cloud Cluster: How to Transfer Files
------------------------------------
  NOTE: Your files on the Cloud Cluster are temporary.  They are not backed up
  and could be purged at any time by the Cloud Cluster staff.   So keep a copy
  of your programs and input data in a safe place.

  To access your files on the Cloud Cluster submission computer (10.1.130.117),
  you have to login from the gateway computer (64.88.164.202) and then login
  to 10.1.130.117.  To avoid 2 logins, you can set up a tunnel from your 
  computer and use it to login and transfer files to the Cluster.

  Assuming you are logged onto an EECS UNIX computer, these are commands 
  that set up a tunnel through 64.88.164.202 to 10.1.130.117;

  	set PORT1 = `echo "$$ + 1024" | bc`
  	set PORT2 = `echo "$$ + 1025" | bc`
	echo $PORT1 $PORT2
	ssh -L ${PORT1}:10.1.130.119:22 -n -N $USER@64.88.164.202
	CNTL/Z
	bg

  (Note, in this simple example we don't check for ports that are already in
  use, so if you get an error just try any new random numbers above 1024.)

  Enter your password, then put the ssh command in the background by typing
  the CONTROL and Z keys (both the same time) and then "bg".  You can regain 
  access to this command later and kill it by typing "fg" and "CONTROL/C".

	ssh -D $PORT2 localhost -p $PORT1		# logs you in

  Enter your password, and you will find yourself logged directly onto the
  submission system (the name has "XenHost" in it) from the EECS computer.

  On the EECS UNIX computer, these commands will now transfer files:

	scp -P $PORT1 yourfile $USER@localhost:		# copy to the Cluster
	scp -P $PORT1 $USER@localhost:yourfile .	# copy from the Cluster

  The "$$ + 1024" part creates a random port number based on your current 
  process id ($$).  You need 2 port numbers above 1024 that nobody else on
  the computer is using.  So if you get the 
  error "bind: Address already in use..." then add one to the port number 
  and try again.  Or, type "fg" again to see if you have left previous ssh
  processes running with the port in the background.


Cloud Cluster: How to View the WEB Logs
---------------------------------------
  Hadoop has a web interface running on the cluster master (10.1.130.117).  
  However, due to security, the cluster master machine is not accessible 
  from the Internet, which leads to problems using the interface.  In order 
  to access this interface, which contains logs for each individual node, 
  you need to set up an SSH tunnel and a SOCKS proxy.

  1) First, open a tunnel to the cluster submission computer (10.1.130.119):

  	set PORT1 = `echo "$$ + 1024" | bc`; 
  	set PORT2 = `echo "$$ + 1025" | bc`
	echo $PORT1 $PORT2
	ssh -n -N -L ${PORT1}:10.1.130.119:22 $USER@64.88.164.202
	CNTL/Z
	bg

	ssh -n -N -D $PORT2 localhost -p $PORT1
	CNTL/Z
	bg

  2) Then, start your browser and edit proxy preferences to create a SOCKS 4.0 
     proxy (not SOCK 5.0) on "localhost" using port $PORT2 (using the value 
     for $PORT2 that you created above).

  3) Then enter one of these addresses in your browser
     
	http://10.1.130.117:50030		# JobTracker
	http://10.1.130.117:50060		# TaskTracker
	http://10.1.130.117:50070		# HDFS Info

     and you have full access to all the logs.  (You can use the Firefox 
     SwitchProxy extension to easily switch between browsing via the proxy 
     and normal proxyless browsing.) 


Cloud Cluster: Announcements and Scheduled Downtime 
---------------------------------------------------
  Starting May 1 2008, you can logon to http://univsupport.hipods.ihost.com
  to read announcements and forums about the Cloud Cluster.

  The Google/IBM Cloud Cluster will be down for scheduled maintenance every
  Monday starting at 2pm.


Icluster: Computing Cluster for EECS Students
----------------------------------------------

  For students to do coursework in parallel computing, EECS Instruction 
  supports an on-site cluster called "Icluster" that is available to all 
  students who have Instructional computer accounts.  For information
  about that, please see
  http://inst.eecs.berkeley.edu/cgi-bin/pub.cgi?file=icluster.help.


						EECS Instructional Support
						378/384/386 Cory, 333 Soda
						inst@eecs.berkeley.edu