University of California, Berkeley
College of Engineering
Department of Electrical Engineering and Computer Science

CS 61C, Spring 2012 HW 1

TA: Alan Christopher

Due Sunday, January 22, 2011 @ 11:59pm

Goals

This assignment is designed to get you back into the swing of things, thinking about Warehouse Scale Computing, and to help you make sure your account works so that you can submit assignments.

Submitting Your Solution

Submit your solution by creating a directory named "hw1". In this directory, create a file named "hw1.txt" and a file named cs61c-XX.jpg (see problem 0) and put your answers inside. Then run submit hw1.

If you have an account already, or you have not yet had your first lab (accounts will be distributed then), do not email me with your submission. Seriously. Use the standard inst submission mechanism.

In order to avoid blowing through our disk quota we are requiring that the cs61c-XX.jpg's submitted for problem 0 be of size less than approximately 250 kB.

Problem 0: Photograph

Acquire a jpeg of yourself in which your face is clearly visible. Submit this file as cs61c-XX.jpg, where cs61c-XX is your account login (these will be distributed in lab). Additionally, we request that the image you submit not be much larger than 250 kB in size. Note also that your submission can go through without this file, so make sure that you include it in your submission.

Problem 1: Basics Potpourri

  1. Match these words to their definitions (each is used exactly once): virtual worlds, desktop computers, servers, low-end servers, supercomputers, terabyte, petabyte, datacenters, embedded computers, multicore processors, VHDL, RAM, CPU, operating system, compiler, bit, instruction, assembly language, machine language, C, assembler, high-level language, system software, application software, cobol, fortran.
    1. Computer used to run large problems and usually accessed via a network
    2. 2^50 bytes
    3. Computers composed of hundreds to thousands of processors and terabytes of memory
    4. Today's science fiction application that will be available in the near future
    5. Part of a computer called central processor unit
    6. A kind of memory called random access memory
    7. Desktop computer without screen or keyboard usually accessed via a network
    8. A microprocessor containing several processors in the same chip
    9. Thousands of processors forming a large cluster
    10. Currently the largest class of computer that runs one application or one set of related applications
    11. Program that translates statements in high-level language to assembly language
    12. Special language used to describe hardware components
    13. Personal computer delivering good performance to single users at low cost
    14. Program that translates symbolic instructions to binary instructions
    15. High level language for scientific computation
    16. Commands that the processors understand
    17. Binary language that the processor can understand
    18. High level language for business data processing
    19. Symbolic representation of machine instructions
    20. Interface between user's program and hardware providing a variety of services and supervision functions
    21. Software/programs developed by the users
    22. Software layer between the application software and the hardware that includes the operating system and the compilers
    23. High-level language used to write application and system software
    24. Portable language composed of words and algebraic expressions that must be translated into assembly language before run in a computer
    25. 2^40 bytes
    26. Binary digit (0 or 1)
  2. If a computer connected to a 1 gigabit Ethernet network needs to send a 256 Kbytes files, how long would it take.
  3. Assuming that a cache memory is ten times faster than a DRAM memory, that DRAM is 100,000 times faster than magnetic disk, and that flash memory is 1000 times faster than disk, find how long it takes to read a file from a DRAM, a disk, and a flash memory if it takes 2 microseconds from cache memory.
  4. Lots of service providers boast about offering "five nines" of uptime in their service level agreements.
    1. What does this mean?
    2. How much downtime can they tolerate per year without breaking their SLA?
  5. Why do the designs of warehouse scale computers need to be inherently fault-tolerant? Give a quantitative example that shows why (don't just cite a failure rate!).
  6. Why does Google use desktop-grade hardware, instead of server-grade hardware? What challenges does this decision create?

Problem 2: MapReduce

For the parts that ask you to design an algorithm, feel free to use pseudocode, explaining at a high level what the algorithm is.

  1. Give an example of a problem that, while in some sense parallellizable, is not a good fit for MapReduce. Justify your answer.

  2. Assume you have a list of key-value pairs of the form (document, text), associating a body of text with the name of a document. Give the format of intermediate key-value pairs, and the pseudocode for the mapper and reducer, that you could use to calculate for each word, how many times it was used across all documents.

  3. A fun game to play with your friends is "degrees of Wikipedia separation": given article A and article B, who can get from A to B with the fewest number of clicks? The number of pages you can get to in just a few clicks from a popular article is quite large, larger than you could wrap your head around, so you decide to use a MapReduce framework to search for the optimal path through Wikipedia.

    Assume that you have access to a list of 2-tuples (article-name, outgoing-link) that describe Wikipedia's site graph. Design an algorithm that uses MapReduce to find the pages reachable in n clicks or less from some start page s. You should describe any mappers, reducers, and key-value formats you use in your solution.

    Hint: you may find it useful to think of there being two or more different "classes" of input key-value pairs, one of which will be the (article, outgoing-link) tuples described above. You could tell these classes of key-value pairs apart by having the first part of every key be a type tag, for instance.