Submit your solution online by 11:59pm on Wednesday 2007-04-11. Do this by creating a directory named hw7 that contains a text file named README. Save any additional files in this folder. From within this directory, type submit hw7 and type 'y' to include any additional files.
Q1: Consider the following program segment:
add $5, $6, $7
lw $6, 200($5)
sub $5, $6, $7
Suppose that this code runs using a pipeline that detects hazards and stalls if a hazard arises (ie. there is no forwarding/bypassing). Count how many cycles will be needed to execute this code, and display each instruction's progress through the pipeline in a format similar to that of Figure 6.3 on page 373 of P&H. You can submit this online as ASCII art, so Figure 6.3 (the lower half) would look like this:
LW $1 F D A M R
LW $2 F D A M R
LW $3 F D A M R
The letters FDAMR stand for Fetch, Decode, ALU, Memory, Register. I only showed enough of each instruction to make it clear which is which.
The following are from P&H Exercises 6.2, 6.3, 6.4, 6.5, (pp.454-455) and 7.9, 7.10, 7.11 (pp. 556-558)
6.2:
A computer architect needs to design the pipeline of a new
microprocessor. She has an example workload program core with 106
instructions. Each instruction takes 100ps to finish.
6.3: Using a drawing similar to Figure 6.5 on page 377 (reproduced below), show the forwarding paths needed to execute the following four instructions:
add $3, $4, $6
sub $5, $3, $2
lw $7, 100($5)
add $8, $7, $2
6.4: Identify all of the data dependencies in the following code. Which dependencies are data hazards that will be resolved via forwarding? Which dependencies are data hazards that will cause a stall?
add $3, $4, $2
sub $5, $3, $1
lw $6, 200($3)
add $7, $3, $6
6.5: How could we modify the following code to make use of a delayed branch slot?
Loop: lw $2, 100($3)
addi $3, $3, 4
beq $3, $4, Loop
7.9: Here is a series of address references given as word addresses: 2, 3, 11, 16, 21, 13, 64, 48, 19, 11, 3, 22, 4, 27, 6, and 11. Assuming a direct-mapped cache with 16 one-word blocks that is initially empty, label each reference in the list as a hit or a miss and show the final contents of the cache.
7.10: Using the series of references given in Exercise 7.9, show the hits and misses and the final cache contents for a direct-mapped cache with four-word blocks and a total cache data size of 16 words.
7.11: Given the following pseudocode:
int array[1000][1000];
for each element array[i][j] {
array[i][j] = array[i][j] * 2;
}
Write two C programs that implement this algorithm: one should access the elements of the array in row-major order, and the other should access them in column-major order. Compare the execution times of the two programs. What does this tell you about the effects of memory layout on cache performance?
UPDATE You can accurately time a program with the 'time' command in Unix. Run it with time executable
.