CA 714CA - Homework #6 solution

3.21
Tomasulo processor. Assume ROB has 3 buffer entries, named 0, 1,
and 2.
ADD.D F0, F8, F8
MULT.D F2, F8, F8
SUB.D F4, F0, F2
ADDI R10, R12, R12

MULT.D = 10 cycles. The rest 1 cycle.

Fill in ROB contents as it would exist on the cycle that ADDI writes
its result. Assume F8 and R12 are initialized and ROB is initially
empty.

Entry Instruction Destination Value Committed

0 ADD.D F0 F8+F8 Yes

1 MULT.D F2 F8*F8 (not ready) No

2 SUB.D F4 #0-#1 (not ready) No

0 ADDI R10 R12+R12 No

The issue order follows the instruction code. By the time ADDI writes, there are sufficient cycles
for ADD.D to write its result into ROB and commit. For MULT.D, since it takes 10 cycles, it is not
complete yet. Therefore, its result is not available to ROB and also it did not commit. Since SUB.D
depends on MULT.D's F2 result, it is still waiting. Therefore its result is not available to ROB and it
did not commit.

3.11
a) Program size is an integer less than infinity and program is not self modifying.
Therefore the number of branches is a constant integer and their locations are static.
As the buffer size increase, sharing will decreases till zero. This brings prediction
accuracy to a constant. Any additional improvement requires the predictor to better
model the temporal behavior of each branch.

b) Based on the fact that the branch prediction performance increased when
moving from a 4096 element buffer to an infinite one for gcc, tomcatv, and
nasa7, one can presume that the code for these programs is either a) very
large or b) has an unusually high density of branch instructions. Either of
these would cause collisions in the branch buffer lowering the prediction
accuracy.

c) For the programs that do not improve when moving from an 4096 element
to an infinite branch buffer, we can conclude that either a) the programs are
relatively small, or b) the programs have a very small branch density.

d) The optimizing compiler has to know about the kind of predictor used to
change the structure in such a way, that branches a likely to be more
regular. This would only work for the programs mentioned in c, when the
compiler has knowledge about the dynamic behavior of the program, for
example while using profiling information.

For programs that execute a great deal of loops, the compiler can also completely unroll loops to
eliminate the branch miss prediction of the loop exit as described in. For this optimization to make a
significant difference, however, the program must have a great deal of loops that can be easily unrolled
completely (e.g. the compiler must know the number of iterations of the loop, and unrolling the loop
must not result in an egregious code size).

Entry	Instruction	Destination	Value	Committed
0	ADD.D	F0	F8+F8	Yes
1	MULT.D	F2	F8*F8 (not ready)	No
2	SUB.D	F4	#0-#1 (not ready)	No
0	ADDI	R10	R12+R12	No