3.21
Tomasulo processor. Assume ROB has 3 buffer entries,
named 0, 1,
and 2.
ADD.D F0, F8, F8
MULT.D F2, F8, F8
SUB.D F4, F0, F2
ADDI R10, R12, R12
MULT.D = 10 cycles. The rest 1 cycle.
Fill in ROB contents as it would exist on the cycle that
ADDI writes
its result. Assume F8 and R12 are initialized and ROB
is initially
empty.
Entry | Instruction | Destination | Value | Committed |
0 | ADD.D | F0 | F8+F8 | Yes |
1 | MULT.D | F2 | F8*F8 (not ready) | No |
2 | SUB.D | F4 | #0-#1 (not ready) | No |
0 | ADDI | R10 | R12+R12 | No |
The issue order follows the instruction code. By
the time ADDI writes, there are sufficient cycles
for ADD.D to write its result into ROB and commit.
For MULT.D, since it takes 10 cycles, it is not
complete yet. Therefore, its result is not available
to ROB and also it did not commit. Since SUB.D
depends on MULT.D's F2 result, it is still waiting.
Therefore its result is not available to ROB and it
did not commit.
3.11
a) Program size is an integer less than infinity and
program is not self modifying.
Therefore the number of branches is a constant integer
and their locations are static.
As the buffer size increase, sharing will decreases till
zero. This brings prediction
accuracy to a constant. Any additional improvement
requires the predictor to better
model the temporal behavior of each branch.
b) Based on the fact that the branch prediction performance
increased when
moving from a 4096 element buffer to an infinite one
for gcc, tomcatv, and
nasa7, one can presume that the code for these programs
is either a) very
large or b) has an unusually high density of branch instructions.
Either of
these would cause collisions in the branch buffer lowering
the prediction
accuracy.
c) For the programs that do not improve when moving from
an 4096 element
to an infinite branch buffer, we can conclude that either
a) the programs are
relatively small, or b) the programs have a very small
branch density.
d) The optimizing compiler has to know about the kind
of predictor used to
change the structure in such a way, that branches a likely
to be more
regular. This would only work for the programs mentioned
in c, when the
compiler has knowledge about the dynamic behavior of
the program, for
example while using profiling information.
For programs that execute a great deal of loops, the compiler
can also completely unroll loops to
eliminate the branch miss prediction of the loop exit
as described in. For this optimization to make a
significant difference, however, the program must have
a great deal of loops that can be easily unrolled
completely (e.g. the compiler must know the number of
iterations of the loop, and unrolling the loop
must not result in an egregious code size).