CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: Pipeline Parallelism 1

Lecturer:

Alan Christopher

# **Boolean Exprs for Controller**



Op 0-5 are really Instruction bits 26-31 Func 0-5 are really Instruction bits 0-5

```
rtype = \sim op_5 \bullet \sim op_4 \bullet \sim op_3 \bullet \sim op_2 \bullet \sim op_1 \bullet \sim op_0,

ori = \sim op_5 \bullet \sim op_4 \bullet op_3 \bullet op_2 \bullet \sim op_1 \bullet op_0

lw = op_5 \bullet \sim op_4 \bullet \sim op_3 \bullet \sim op_2 \bullet op_1 \bullet op_0

sw = op_5 \bullet \sim op_4 \bullet op_3 \bullet \sim op_2 \bullet op_1 \bullet op_0

beq = \sim op_5 \bullet \sim op_4 \bullet \sim op_3 \bullet op_2 \bullet \sim op_1 \bullet \sim op_0

jump = \sim op_5 \bullet \sim op_4 \bullet \sim op_3 \bullet \sim op_2 \bullet op_1 \bullet \sim op_0

add = rtype \bullet func<sub>5</sub> \bullet \sim func_4 \bullet \sim func_3 \bullet \sim func_2 \bullet \sim func_1 \bullet \sim func_0

sub = rtype \bullet func<sub>5</sub> \bullet \sim func_4 \bullet \sim func_3 \bullet \sim func_2 \bullet func_1 \bullet \sim func_0
```

How do we implement this in gates?

# **Boolean Exprs for Controller**

```
RegDst = add + sub
ALUSrc = ori + Iw + sw
MemtoReg = Iw
RegWrite = add + sub + ori + Iw
MemWrite = sw
nPCsel = beq
Jump = jump
ExtOp = Iw + sw
ALUctr[0] = sub + beq
ALUctr[1] = ori
(assume ALUctr is 00 ADD, 01 SUB, 10 OR)
```

How do we implement this in gates?

# **Controller Implementation**



#### Call home, we've made HW/SW contact!

```
High Level Language
       Program (e.g., C)
                Compiler
    Assembly Language
       Program (e.g., MIPS)
                 Assembler
     Machine Language
        Program (MIPS)
Machine
Interpretation
Hardware Architecture Description
  (e.g., block diagrams)
Architecture
Implementation
  Logic Circuit Description
     (Circuit Schematic Diagrams)
```

```
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
    $t0, 0($2)
    $t1, 4($2)
sw $t1, 0($2)
    $t0, 4($2)
SW
0000 1001 1100 0110 1010 1111 0101 1000
1010 1111 0101 1000 0000 1001 1100 0110
     0110 1010 1111 0101 1000 0000 1001
0101 1000 0000 1001 1100 0110 1010 1111
          Register File
             ALU
```



# Administrivia/Clicker

How many hours of fun from proj3 so far?

- a) 0 <= F <= 4
- b) 4 < F <= 8
- c) 8 < F <= 12
- d) 12 < F <= 16
- e) 16 < F

# Administrivia/Clicker

How many Gflop/s right now?

- a) 0 <= F <= 4
- b) 4 < F <= 8
- c) 8 < F <= 12
- d) 12 < F <= 16
- e) 16 < F

## Review: Single-cycle Processor

- Five steps to design a processor:
  - 1. Analyze instruction set → datapath requirements
  - Select set of datapath components & establish clock methodology
  - 3. Assemble datapath meeting the requirements
  - 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer.
  - 5. Assemble the control logic
    - Formulate Logic Equations
    - Design Circuits



# Single Cycle Performance

- Assume time for actions are
  - 100ps for register read or write; 200ps for other events
- Clock rate is?

| Instr    | Instr fetch | Register read | ALU op | Memory access | Register write | Total time |
|----------|-------------|---------------|--------|---------------|----------------|------------|
| lw       | 200ps       | 100 ps        | 200ps  | 200ps         | 100 ps         | 800ps      |
| SW       | 200ps       | 100 ps        | 200ps  | 200ps         |                | 700ps      |
| R-format | 200ps       | 100 ps        | 200ps  |               | 100 ps         | 600ps      |
| beq      | 200ps       | 100 ps        | 200ps  |               |                | 500ps      |

- What can we do to improve clock rate?
- Will this improve performance as well?
   Want increased clock rate to mean faster programs

## Single Cycle Performance

- Assume time for actions are
  - 100ps for register read or write; 200ps for other events
- Clock rate is?

| Instr    | Instr fetch | Register read | ALU op | Memory access | Register write | Total time |
|----------|-------------|---------------|--------|---------------|----------------|------------|
| lw       | 200ps       | 100 ps        | 200ps  | 200ps         | 100 ps         | 800ps      |
| SW       | 200ps       | 100 ps        | 200ps  | 200ps         |                | 700ps      |
| R-format | 200ps       | 100 ps        | 200ps  |               | 100 ps         | 600ps      |
| beq      | 200ps       | 100 ps        | 200ps  |               |                | 500ps      |

- What can we do to improve clock rate?
- Will this improve performance as well?
   Want increased clock rate to mean faster programs

### Gotta Do Laundry

 Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away



Washer takes 30 minutes



- Dryer takes 30 minutes
- "Folder" takes 30 minutes
- "Stasher" takes 30 minutes to put clothes into drawers







# Sequential Laundry



Sequential laundry takes
 8 hours for 4 loads

# Pipelined Laundry



# Pipelining Lessons (1/2)



- Pipelining doesn't help <u>latency</u> of single task, it helps <u>throughput</u> of entire workload
- <u>Multiple</u> tasks operating simultaneously using different resources
- Potential speedup = <u>Number</u> <u>pipe stages</u>
- Time to "fill" pipeline and time to "drain" it reduces speedup:
   2.3X v. 4X in this example

# Pipelining Lessons (2/2)



- Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline?
- Pipeline rate limited by <u>slowest</u> pipeline stage
- Unbalanced lengths of pipe stages reduces speedup

### Steps in Executing MIPS

- 1) <a href="IFtch">IFtch</a>: <a href="Instruction">Instruction</a> <a href="Fetch">Fetch</a>, <a href="Instruction">Increment</a> <a href="PC">PC</a>
- 2) <u>Dcd</u>: Instruction <u>Decode</u>, Read Registers
- 3) <u>Exec</u>:

Mem-ref: Calculate Address

**Arith-log: Perform Operation** 

#### 4) <u>Mem</u>:

Load: Read Data from Memory

Store: Write Data to Memory

5) WB: Write Data Back to Register

# Single Cycle Datapath



# Pipeline registers



- Need registers between stages
  - To hold information produced in previous cycle

# More Detailed Pipeline



# IF for Load, Store, ...





# ID for Load, Store, ...





### **EX for Load**





#### **MEM** for Load



# WB for Load – Oops!



# Corrected Datapath for Load



#### So, in conclusion

- You now know how to implement the control logic for the single-cycle CPU.
  - (actually, you already knew it!)
- Pipelining improves performance by increasing instruction throughput: exploits ILP
  - Executes multiple instructions in parallel
  - Each instruction has the same latency
- Next: hazards in pipelining:
  - Structure, data, control