# inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C: Machine Structures

Lecture 29
CPU Design: Pipelining to Improve Performance

2007-04-02

**Lecturer SOE Dan Garcia** 

www.cs.berkeley.edu/~ddgarcia

Wirelessly recharge batt ⇒ Powercast & Philips have

developed a wireless power system. Pacemakers & defibrillators req surgery to replace dead batteries... not any more!





# Review: Single cycle datapath

## °5 steps to design a processor

- 1. Analyze instruction set ⇒ datapath <u>requirements</u>
- 2. <u>Select</u> set of datapath components & establish clock methodology

Control

**Datapath** 

- 3. Assemble datapath meeting the requirements
- 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer.
- 5. Assemble the control logic
- °Control is the hard part
- °MIPS makes that easier
  - Instructions same size
  - Source registers always in same place
  - Immediates same size, location

Operations always on registers/immediates



Input

**Output** 

**Memory** 

#### **How We Build The Controller**

opcode func

"AND" logic

ladd

sub

ori

llw

SW

beq

<u>jump</u>

```
RegDst = add + sub
```

ALUSrc = ori + lw + sw

MemtoReg = lw

RegWrite = add + sub + ori + lw

MemWrite = sw

nPCsel = beq

Jump = jump

 $\mathbf{ExtOp} \qquad = \mathbf{lw} + \mathbf{sw}$ 

ALUctr[0] = sub + beq (assume ALUctr is 0 ADD, 01: SUB, 10: OR)

ALUctr[1] = or

where,

$$\begin{aligned} \text{rtype} &= \sim \text{op}_5 \bullet \sim \text{op}_4 \bullet \sim \text{op}_3 \bullet \sim \text{op}_2 \bullet \sim \text{op}_1 \bullet \sim \text{op}_0, \\ \text{ori} &= \sim \text{op}_5 \bullet \sim \text{op}_4 \bullet \quad \text{op}_3 \bullet \quad \text{op}_2 \bullet \quad \sim \text{op}_1 \bullet \quad \text{op}_0 \\ \text{lw} &= \quad \text{op}_5 \bullet \sim \text{op}_4 \bullet \sim \text{op}_3 \bullet \sim \text{op}_2 \bullet \quad \text{op}_1 \bullet \quad \text{op}_0 \\ \text{sw} &= \quad \text{op}_5 \bullet \sim \text{op}_4 \bullet \quad \text{op}_3 \bullet \sim \text{op}_2 \bullet \quad \text{op}_1 \bullet \quad \text{op}_0 \\ \text{beq} &= \sim \text{op}_5 \bullet \sim \text{op}_4 \bullet \sim \text{op}_3 \bullet \quad \text{op}_2 \bullet \quad \sim \text{op}_1 \bullet \sim \text{op}_0 \\ \text{jump} &= \sim \text{op}_5 \bullet \sim \text{op}_4 \bullet \sim \text{op}_3 \bullet \sim \text{op}_2 \bullet \quad \text{op}_1 \bullet \sim \text{op}_0 \end{aligned}$$

How do we implement this in gates?

"OR" logic

add = rtype •  $func_5$  •  $\sim func_4$  •  $\sim func_3$  •  $\sim func_2$  •  $\sim func_1$  •  $\sim func_0$  sub =  $func_5$  •  $\sim func_4$  •  $\sim func_3$  •  $\sim func_2$  •  $func_1$  •  $\sim func_0$ 

▶ ReqDst

→ ALUSrc

MemtoReg

RegWrite

MemWrite

nPCsel

→ Jump

► ExtOp

→ ALUctr[0]
→ ALUctr[1]

#### **Processor Performance**

- Can we estimate the clock rate (frequency) of our single-cycle processor? We know:
  - 1 cycle per instruction
  - 1w is the most demanding instruction.
  - Assume approximate delays for major pieces of the datapath:
    - Instr. Mem, ALU, Data Mem : 2ns each, regfile 1ns
    - Instruction execution requires: 2 + 1 + 2 + 2 + 1 = 8ns
    - ⇒ 125 MHz
- What can we do to improve clock rate?
- Will this improve performance as well?
  - We want increases in clock rate to result in programs executing quicker.



#### **Gotta Do Laundry**

Output
Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away



° Washer takes 30 minutes



° Dryer takes 30 minutes



° "Folder" takes 30 minutes



° "Stasher" takes 30 minutes to put clothes into drawers





#### **Sequential Laundry**



Sequential laundry takes
 8 hours for 4 loads



#### **Pipelined Laundry**





#### **General Definitions**

- Latency: time to completely execute a certain task
  - for example, time to read a sector from disk is disk access time or disk latency
- Throughput: amount of work that can be done over a period of time



# **Pipelining Lessons (1/2)**



- Pipelining doesn't help latency of single task, it helps throughput of entire workload
- Multiple tasks operating simultaneously using different resources
- Potential speedup = Number pipe stages
- Time to "fill" pipeline and time to "drain" it reduces speedup: 2.3X v. 4X in this example



## **Pipelining Lessons (2/2)**



- Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline?
- Pipeline rate limited by <u>slowest</u> pipeline stage
- Unbalanced lengths of pipe stages reduces speedup



## **Steps in Executing MIPS**

- 1) IFtch: Instruction Fetch, Increment PC
- 2) Dcd: Instruction Decode, Read Registers
- 3) <u>Exec</u>:

Mem-ref: Calculate Address Arith-log: Perform Operation

4) <u>Mem</u>:

**Load:** Read Data from Memory

Store: Write Data to Memory

5) WB: Write Data Back to Register



#### **Pipelined Execution Representation**

Time

IFtch Dcd Exec Mem WB

 Every instruction must take same number of steps, also called pipeline "stages", so some will go idle sometimes

#### **Review: Datapath for MIPS**



## Use datapath figure to represent pipeline





#### **Graphical Pipeline Representation**

(In Reg, right half highlight read, left half write)



#### **Example**

- Suppose 2 ns for memory access, 2 ns for ALU operation, and 1 ns for register file read or write; compute instr rate
- Nonpipelined Execution:
  - •1w: IF + Read Reg + ALU + Memory + Write Reg = 2 + 1 + 2 + 2 + 1 = 8 ns
  - add: IF + Read Reg + ALU + Write Reg
    = 2 + 1 + 2 + 1 = 6 ns
    (recall 8ns for single-cycle processor)
- Pipelined Execution:
  - Max(IF,Read Reg,ALU,Memory,Write Reg)= 2 ns



#### Pipeline Hazard: Matching socks in later load



A depends on D; stall since folder tied up

#### **Administrivia**

- Want to redo your autograded assignments for more credit?
  - We may have an opportunity for you...
- Performance Competition Up!
  - Rewrite HW2 to be as fast as possible
  - · It'll be run on real MIPS machine (PS2)
    - You can optimize C or MIPS or BOTH!!
  - Do it for pride, fame (& EPA points)
  - Two competitions
    - Traditional (same spec as H2)
    - Unbounded (same H2 Extra for Experts spec)



#### **Problems for Pipelining CPUs**

- Limits to pipelining: <u>Hazards</u> prevent next instruction from executing during its designated clock cycle
  - Structural hazards: HW cannot support some combination of instructions (single person to fold and put clothes away)
  - Control hazards: Pipelining of branches causes later instruction fetches to wait for the result of the branch
  - <u>Data hazards</u>: Instruction depends on result of prior instruction still in the pipeline (missing sock)
- These might result in pipeline stalls or "bubbles" in the pipeline.

# Structural Hazard #1: Single Memory (1/2)



Read same memory twice in same clock cycle

# Structural Hazard #1: Single Memory (2/2)

#### Solution:

- infeasible and inefficient to create second memory
- (We'll learn about this more next week)
- so simulate this by having two Level 1
   Caches (a temporary smaller [of usually most recently used] copy of memory)
- have both an L1 <u>Instruction Cache</u> and an L1 <u>Data Cache</u>
- need more complex hardware to control when both caches miss



# Structural Hazard #2: Registers (1/2)



an we read and write to registers simultaneously?

## Structural Hazard #2: Registers (2/2)

- Two different solutions have been used:
  - 1) RegFile access is *VERY* fast: takes less than half the time of ALU stage
    - Write to Registers during first half of each clock cycle
    - Read from Registers during second half of each clock cycle
  - 2) Build RegFile with independent read and write ports
- Result: can perform Read and Write during same clock cycle

#### **Peer Instruction**

- A. Thanks to pipelining, I have <u>reduced the time</u> it took me to wash my shirt.
- B. Longer pipelines are <u>always a win</u> (since less work per stage & a faster clock).
- C. We can <u>rely on compilers</u> to help us avoid data hazards by reordering instrs.



- O: FFF
- 1: FFT
- 2: **FTF**
- 3: **FTT**
- 4: **TFF**
- 5: **TFT**
- 6: **TTF**
- 7: TTT



## **Things to Remember**

#### Optimal Pipeline

- Each stage is executing part of an instruction each clock cycle.
- One instruction finishes during each clock cycle.
- On average, execute far more quickly.
- What makes this work?
  - Similarities between instructions allow us to use same stages for all instructions (generally).
  - Each stage takes about the same amount of time as all others: little wasted time.

