Lecture #19: Pipelining II

2006-07-31

Andy Carle
Review: Datapath for MIPS

- Use datapath figure to represent pipeline
Review: Problems for Computers

• Limits to pipelining: **Hazards** prevent next instruction from executing during its designated clock cycle
  
  • **Structural hazards**: HW cannot support this combination of instructions (single person to fold and put clothes away)
  
  • **Control hazards**: Pipelining of branches & other instructions **stall** the pipeline until the hazard; “**bubbles**” in the pipeline
  
  • **Data hazards**: Instruction depends on result of prior instruction still in the pipeline (missing sock)
Review: C.f. Branch Delay vs. Load Delay

• Load Delay occurs only if necessary (dependent instructions).

• Branch Delay always happens (part of the ISA).

• Why not have Branch Delay interlocked?
  • Answer: Interlocks only work if you can detect hazard ahead of time. By the time we detect a branch, we already need its value ... hence no interlock is possible!
Outline

• Pipeline Control

• Forwarding Control

• Hazard Control
Piped Proc So Far ...
New Representation: Regs more explicit

IF/DE.Ir = Instruction
DE/EX.A = BusA out of Reg
EX/ME.S = AluOut
EX/ME.D = Bus B pass-through for sw
ME/WB.S = ALuOut pass-through
ME/WB.M = Mem Result from lw
New Representation: Regs more explicit

What's Missing???
Pipelined Processor (almost) for slides

Idea: Parallel Piped Control …
Pipelined Control

IR <- Mem[PC]; PC <- PC+4;

A <- R[rs]; B <- R[rt]

S <- A + B;
S <- A or ZX;
S <- A + SX;
S <- A + SX;

M <- Mem[S]
Mem[S] <- B
R[rd] <- S;
R[rt] <- S;
R[rd] <- M;

If Cond
PC < PC + SX;

Exec
Reg.
File
Mem
Access
Data
Mem
A
B
S
D
M
E
Qual
Reg.
File
PC
Next PC
Inst. Mem
Reg File
A Carle, Summer 2006 © UCB
**Data Stationary Control**

- The Main Control generates the control signals during Reg/Dec
  - Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later
  - Control signals for Mem (MemWr Branch) are used 2 cycles later
  - Control signals for Wr (MemtoReg MemWr) are used 3 cycles later
Let’s Try it Out

10 lw   r1, 36(r2)
14 addl r2, r2, 3
20 sub  r3, r4, r5
24 beq  r6, r7, 100
28 ori  r8, r9, 17
32 add  r10, r11, r12

100 and r13, r14, 15
Start: Fetch 10

10 lw r1, 36(r2)
14 addl r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
Fetch 14, Decode 10

INST. MEM

lw r1, 36(r2)

REG. FILE

lw r1, 36(r2)
add r10, r11, r12
ori r8, r9, 17
and r13, r14, 15

EXEC

14 add r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12

MEM ACCESS

10 lw r1, 36(r2)
100 and r13, r14, 15

WB CTRL

14 add r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
34 add r10, r11, r12

REG. FILE
Fetch 24, Decode 20, Exec 14, Mem 10

IR

Inst. Mem

sub r3, r4, r5

Decide

add r2, r2, 3

Reg File

lw r1

Exec

r2 + 36

Mem Ctrl

Mem Access

Data Mem

Reg. File

WB Ctrl

M

10 lw r1, 36(r2)

14 add r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

PC

Next PC
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10

Note Delayed Branch: always execute ori after beq
Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14

IR

Inst. Mem

ori r8, r9, 17

9 xx 100

Reg File

ori r8, r9, 17

r6 r7

Next PC

100

PC

Reg.

File

Exec

Mem

Access

Data

Mem

r6 = M[r2 + 35]

10 lw r1, 36(r2)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

100

IF

EX

M

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB

IF

Decod

MEM

WB
• Remember: ^ means triggered on edge.
• What is wrong here?
Double-Clocked Signals

- Some signals are double clocked!
- In general: Inputs to edge components are their own pipeline regs

Watch out for stalls and such!
Administrivia

- HW 5 – Due Wednesday in class
- ProjWork 3.6 – Due 8/5 & 8/8

- Midterm 2:
  - Friday, August 4: 11:00 – 2:00
  - 390 Hearst Mining
  - Same rules as last time
Outline

• Pipeline Control

• Forwarding Control

• Hazard Control
Review: Forwarding

Fix by **Forwarding** result as soon as we have it to where we need it:

- **add** $t0, t1, t2$
- **sub** $t4, t0, t3$
- **and** $t5, t0, t6$
- **or** $t7, t0, t8 *$
- **xor** $t9, t0, t10$

*“or” hazard solved by register hardware*
Forwarding

In general:

• For each stage i that has reg inputs
  • For each stage j after I that has reg output
    - If i.reg == j.reg ➔ forward j value back to i.
    - Some exceptions ($0, invalid)

In particular:

• ALUinput ➔ (ALUResult, MemResult)
• MemInput ➔ (MemResult)
Pending Writes In Pipeline Registers

Diagram showing the flow of data through pipeline registers and units such as IAU, npc, I mem, op, rw, rs, rt, PC, Regs, B, A, alu, S, D mem, m, and n.
Pending Writes In Pipeline Registers

- Current operand registers
- Pending writes
- hazard <=

\[
((rs == \text{rw}_{\text{ex}}) \land \text{regW}_{\text{ex}}) \lor
((rs == \text{rw}_{\text{mem}}) \land \text{regW}_{\text{me}}) \lor
((rs == \text{rw}_{\text{wb}}) \land \text{regW}_{\text{wb}}) \lor
((rt == \text{rw}_{\text{ex}}) \land \text{regW}_{\text{ex}}) \lor
((rt == \text{rw}_{\text{mem}}) \land \text{regW}_{\text{me}}) \lor
((rt == \text{rw}_{\text{wb}}) \land \text{regW}_{\text{wb}})
\]
Forwarding Muxes

• Detect nearest valid write op operand register and forward into op latches, bypassing remainder of the pipe

• Increase muxes to add paths from pipeline registers

• Data Forwarding = Data Bypassing
What about memory operations?

Tricky situation:
MIPS:
lw 0($t0)
sw 0($t1)

RTL:
R1 <- Mem[ R2 + I ];
Mem[R3+34] <- R1
What about memory operations?

Tricky situation:

**MIPS:**
lw 0($t0)
sw 0($t1)

**RTL:**
R1 <- Mem[ R2 + I ];
Mem[R3+34] <- R1

**Solution:**
Handle with bypass in memory stage!
Outline

- Pipeline Control

- Forwarding Control

- Hazard Control
Data Hazard: Loads (1/4)

• Forwarding works if value is available (but not written back) before it is needed. But consider …

\[
\text{lw } \$t0,0(\$t1) \quad \text{sub } \$t3,\$t0,\$t2
\]

• Need result before it is calculated!
• Must stall use (sub) 1 cycle and then forward. …
Data Hazard: Loads (2/4)

- Hardware must stall pipeline
- Called “interlock”

lw $t0, 0($t1)
sub $t3,$t0,$t2
and $t5,$t0,$t4
or $t7,$t0,$t6
Data Hazard: Loads (3/4)

• Instruction slot after a load is called "load delay slot"

• If that instruction uses the result of the load, then the hardware interlock will stall it for one cycle.

• If the compiler puts an unrelated instruction in that slot, then no stall

• Letting the hardware stall the instruction in the delay slot is equivalent to putting a nop in the slot (except the latter uses more code space)
Data Hazard: Loads (4/4)

- Stall is equivalent to nop

\[
lw \ $t0, 0($t1)\]

\[
nop\]

\[
sub \ $t3,$t0,$t2\]

\[
and \ $t5,$t0,$t4\]

\[
or \ $t7,$t0,$t6\]
Hazards / Stalling

In general:

• For each stage i that has reg inputs
  • If I’s reg is being written later on in the pipe but is not ready yet
    - Stages 0 to i: Stall (Turn CEs off so no change)
    - Stage i+1: Make a bubble (do nothing)
    - Stages i+2 onward: As usual

In particular:

• ALUinput ← (MemResult)
Hazards / Stalling

Alternative Approach:

- Detect non-forwarding hazards in decode
  - Possible since our hazards are formal.
    - Not always the case.
- Stalling then becomes:
  - Issue nop to EX stage
  - Turn off nextPC update (refetch same inst)
  - Turn off InstReg update (re-decode same inst)
Stall Logic

- 1. Detect non-resolving hazards.
- 2a. Insert Bubble
- 2b. Stall nextPC, IF/DE
Stall Logic

• Stall-on-issue is used quite a bit
  • More complex processors: many cases that stall on issue.
  • More complex processors: cases that can’t be detected at decode
    - E.g. value needed from mem is not in cache
      – proc must stall multiple cycles
By the way …

• Notice that our forwarding and stall logic is stateless!

• Big Idea: Keep it simple!
  
  • **Option 1**: Store old fetched inst in reg ("stall_temp"), keep state reg that says whether to use stall_temp or value coming off inst mem.
  
  • **Option 2**: Re-fetch old value by turning off PC update.