inst.eecs.berkeley.edu/~cs61c
CS61C : Machine Structures

#### Lecture #21 CPU Design: Pipelining to Improve Performance

2007-7-31





## **Review: Single cycle datapath**

#### °5 steps to design a processor

- 1. Analyze instruction set ⇒ datapath <u>requirements</u>
- 2. <u>Select</u> set of datapath components & establish clock methodology
- 3. <u>Assemble</u> datapath meeting the requirements
- 4. <u>Analyze</u> implementation of each instruction to determine setting of control points that effects the register transfer.

   Processor
- 5. <u>Assemble</u> the control logic
- <sup>°</sup>Control is the hard part
- ° MIPS makes that easier
  - Instructions same size
  - Source registers always in same place
  - Immediates same size, location

**Contractions always on registers/immediates** 

CS61C L21 CPU Design : Pipelining to Improve Performance (2)



#### An Abstract View of the Critical Path



Beamer, Summer 2007 © UCB

#### **Processor Performance**

- Can we estimate the clock rate (frequency) of our single-cycle processor? We know:
  - 1 cycle per instruction
  - **1w** is the most demanding instruction.
  - Assume approximate delays for major pieces of the datapath:
    - Instr. Mem, ALU, Data Mem : 2ns each, regfile 1ns
    - Instruction execution requires: 2 + 1 + 2 + 2 + 1 = 8ns
    - ⇒ 125 MHz
- What can we do to improve clock rate?
- Will this improve performance as well?
  - We want increases in clock rate to result in programs executing quicker.



#### Ways to Improve Clock Frequency

Smaller Process Size



- Smallest feature possible in silicon fabrication
- Smaller process is faster because of EE reasons, and is smaller so things are closer
- Optimize Logic
  - Re-arrange CL to be faster
  - Sometimes more logic can be used to reduce delay
- Parallel
  - Do more at once later...
- Cut Down Length of Critical Path
  - Inserting registers (pipelining) to break up CL



#### **Gotta Do Laundry**

- Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away
- ° Washer takes 30 minutes
- ° Dryer takes 30 minutes
- ° "Folder" takes 30 minutes
- "Stasher" takes 30 minutes to put clothes into drawers











**Sequential Laundry** 



**Pipelined Laundry** 



- Latency: time to completely execute a certain task (delay)
  - for example, time to read a sector from disk is disk access time or disk latency
- Throughput: amount of work that can be done over a period of time (rate)



## **Pipelining Lessons (1/2)**



- Pipelining doesn't help <u>latency</u> of single task, it
   helps <u>throughput</u> of entire workload
- <u>Multiple</u> tasks operating simultaneously using different resources
- Potential speedup = <u>Number pipe stages</u>
- Time to "<u>fill</u>" pipeline and time to "<u>drain</u>" it reduces speedup: 2.3X v. 4X in this example

#### **Pipelining Lessons (2/2)**



- Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline?
- Pipeline rate limited by <u>slowest</u> pipeline stage
- Unbalanced lengths of pipe stages reduces speedup



### **Steps in Executing MIPS**

- 1) <u>IFtch</u>: Instruction <u>Fetch</u>, Increment PC
- 2) <u>Dcd</u>: Instruction <u>Decode</u>, Read Registers
- 3) <u>Exec</u>: Mem-ref: Calculate Address Arith-log: Perform Operation

#### 4) Mem: Load: Read Data from Memory Store: Write Data to Memory

## 5) <u>WB</u>: <u>Write Data Back to Register</u>



#### **Pipelined Execution Representation**



 Every instruction must take same number of steps, also called pipeline "<u>stages</u>", so some will go idle sometimes

#### **Review: Datapath for MIPS**







CS61C L21 CPU Design : Pipelining to Improve Performance (14)

#### **Graphical Pipeline Representation**

#### (In Reg, right half highlight read, left half write) Time (clock cycles)



CS61C L21 CPU Design : Pipelining to Improve Performance (15)

#### Example

- Suppose 2 ns for memory access, 2 ns for ALU operation, and 1 ns for register file read or write; compute instr rate
- Nonpipelined Execution:
  - 1w : IF + Read Reg + ALU + Memory + Write Reg = 2 + 1 + 2 + 2 + 1 = 8 ns
  - add: IF + Read Reg + ALU + Write Reg = 2 + 1 + 2 + 1 = 6 ns (recall 8ns for single-cycle processor)
- Pipelined Execution:
  - Max(IF,Read Reg,ALU,Memory,Write Reg) = 2 ns



Administrivia

- Assignments
  - HW7 due 8/2
  - Proj3 due 8/5
- Midterm Regrades due Wed 8/1
- Logisim in lab is now 2.1.6
- Valerie's OH on Thursday moved to 10-11 for this week



#### **Pipeline Hazard: Matching socks in later load**



## A depends on D; stall since folder tied up

CS61C L21 CPU Design : Pipelining to Improve Performance (18)

#### **Problems for Pipelining CPUs**

- Limits to pipelining: <u>Hazards</u> prevent next instruction from executing during its designated clock cycle
  - <u>Structural hazards</u>: HW cannot support some combination of instructions (single person to fold and put clothes away)
  - <u>Control hazards</u>: Pipelining of branches causes later instruction fetches to wait for the result of the branch
  - <u>Data hazards</u>: Instruction depends on result of prior instruction still in the pipeline (missing sock)
- These might result in pipeline stalls or "bubbles" in the pipeline.



#### Structural Hazard #1: Single Memory (1/2)



## **Structural Hazard #1: Single Memory (2/2)**

## Solution:

- infeasible and inefficient to create second memory
- (We'll learn about this more next week)
- so simulate this by having two Level 1
   <u>Caches</u> (a temporary smaller [of usually most recently used] copy of memory)
- have both an L1 Instruction Cache and an L1 Data Cache
- need more complex hardware to control when both caches miss



#### **Structural Hazard #2: Registers (1/2)**



an we read and write to registers simultaneously?

#### **Structural Hazard #2: Registers (2/2)**

- Two different solutions have been used:
  - 1) RegFile access is *VERY* fast: takes less than half the time of ALU stage
    - Write to Registers during first half of each clock cycle
    - Read from Registers during second half of each clock cycle

#### 2) Build RegFile with independent read and write ports

# • Result: can perform Read and Write during same clock cycle



Data Hazards (1/2)

# Consider the following sequence of instructions

- add <u>\$t0</u>, \$t1, \$t2
- sub \$t4, <u>\$t0</u> ,\$t3
- and \$t5, <u>\$t0</u> ,\$t6
- or \$t7, <u>\$t0</u>,\$t8
- xor \$t9, \$t0, \$t10

Data Hazards (2/2)

#### Data-flow backward in time are hazards

#### Time (clock cycles)



CS61C L21 CPU Design : Pipelining to Improve Performance (25)

#### **Data Hazard Solution: Forwarding**

Forward result from one stage to another



#### "or" hazard solved by register hardware

CS61C L21 CPU Design : Pipelining to Improve Performance (26)

#### Data Hazard: Loads (1/4)

Dataflow backwards in time are hazards



- Can't solve all cases with forwarding
- Must stall instruction dependent on load, then forward (more hardware)



#### Data Hazard: Loads (2/4)

- Hardware stalls pipeline
- Called "interlock"



Data Hazard: Loads (3/4)

- Instruction slot after a load is called "load delay slot"
- If that instruction uses the result of the load, then the hardware interlock will stall it for one cycle.
- If the compiler puts an unrelated instruction in that slot, then no stall
- Letting the hardware stall the instruction in the delay slot is equivalent to putting a nop in the slot (except the latter uses more code space)





- First MIPS design did not interlock and stall on load-use data hazard
- Real reason for name behind MIPS: Microprocessor without Interlocked Pipeline Stages
  - Word Play on acronym for Millions of Instructions Per Second, also called MIPS





- Thanks to pipelining, I have <u>reduced the time</u> it Α. took me to wash my shirt.
- Β. Longer pipelines are <u>always a win</u> (since less work per stage & a faster clock).
- We can <u>rely on compilers</u> to help us avoid data С. hazards by reordering instrs.



ABC

FFF

FFT

FTF

FTT

TFF

TFT

ጥጥፑ

ጥጥጥ

0:

2.

3:

**4** ·

5:

#### **Peer Instruction Answer**

- A. <u>Throughput</u> better, not execution time
- B. "...longer pipelines do usually mean faster clock, but branches cause problems!
- C. "they happen too often & delay too long." <u>Forwarding!</u> (e.g, Mem  $\Rightarrow$  ALU)
- A. Thanks to pipelining. I have reduced the time it took me to wash my him
- B. Longer proelines are always a win (since less work per stage & a faste clock).
- C. We can <u>rely on compiler</u> to help us avoid data hat ards of reprdering in strs.



CS61C L21 CPU Design : Pipelining to Improve Performance (33)

ABC

FFF

ггт

नगन

FTT

ччч

ጥፑጥ

ͲͲϜ

ጥጥጥ

0:

3.

Δ·

5:

6 ·

**Things to Remember** 

- Optimal Pipeline
  - Each stage is executing part of an instruction each clock cycle.
  - One instruction finishes during each clock cycle.
  - On average, execute far more quickly.
- What makes this work?
  - Similarities between instructions allow us to use same stages for all instructions (generally).
  - Each stage takes about the same amount of time as all others: little wasted time.

