Homework 7: Pipelining and Caches

TAs In Charge: Matt Johnson (cs61c-tf) and Valerie Ishida (cs61c-td)

Submitting Your Solution

Submit your solution online by 11:59pm on Wednesday 2007-04-11. Do this by creating a directory named hw7 that contains a text file named README. Save any additional files in this folder. From within this directory, type submit hw7 and type 'y' to include any additional files.

Problems

Q1: Consider the following program segment:

add     $5,     $6,     $7
lw      $6,     200($5)
sub     $5,     $6,     $7

Suppose that this code runs using a pipeline that detects hazards and stalls if a hazard arises (ie. there is no forwarding/bypassing). Count how many cycles will be needed to execute this code, and display each instruction's progress through the pipeline in a format similar to that of Figure 6.3 on page 373 of P&H. You can submit this online as ASCII art, so Figure 6.3 (the lower half) would look like this:

  LW $1   F  D  A  M  R
  LW $2      F  D  A  M  R
  LW $3         F  D  A  M  R

The letters FDAMR stand for Fetch, Decode, ALU, Memory, Register. I only showed enough of each instruction to make it clear which is which.

The following are from P&H Exercises 6.2, 6.3, 6.4, 6.5, (pp.454-455) and 7.9, 7.10, 7.11 (pp. 556-558)

6.2: A computer architect needs to design the pipeline of a new microprocessor. She has an example workload program core with 10⁶ instructions. Each instruction takes 100ps to finish.

How long does it take to execute this program core on a nonpipelined processor?
The current state-of-the-art microprocessor has about 20 pipeline stages. Assume it is perfectly pipelined. How much speedup will it achieve compared to the nonpipelined processor?
Real pipelining isn't perfect, since implementing pipelining introduces some overhead per pipeline stage. Will this overhead affect instruction latency, instruction throughput, or both?

6.3: Using a drawing similar to Figure 6.5 on page 377 (reproduced below), show the forwarding paths needed to execute the following four instructions:

  add $3, $4, $6
  sub $5, $3, $2
  lw $7, 100($5)
  add $8, $7, $2

6.4: Identify all of the data dependencies in the following code. Which dependencies are data hazards that will be resolved via forwarding? Which dependencies are data hazards that will cause a stall?

  add $3, $4, $2
  sub $5, $3, $1
  lw $6, 200($3)
  add $7, $3, $6

6.5: How could we modify the following code to make use of a delayed branch slot?

  Loop: lw   $2, 100($3)
        addi $3, $3, 4
        beq  $3, $4, Loop

7.9: Here is a series of address references given as word addresses: 2, 3, 11, 16, 21, 13, 64, 48, 19, 11, 3, 22, 4, 27, 6, and 11. Assuming a direct-mapped cache with 16 one-word blocks that is initially empty, label each reference in the list as a hit or a miss and show the final contents of the cache.

7.10: Using the series of references given in Exercise 7.9, show the hits and misses and the final cache contents for a direct-mapped cache with four-word blocks and a total cache data size of 16 words.

7.11: Given the following pseudocode:

  int array[1000][1000];
  for each element array[i][j] {
    array[i][j] = array[i][j] * 2;
  }

Write two C programs that implement this algorithm: one should access the elements of the array in row-major order, and the other should access them in column-major order. Compare the execution times of the two programs. What does this tell you about the effects of memory layout on cache performance?

UPDATE You can accurately time a program with the 'time' command in Unix. Run it with time executable.