Lecture 14: RISC-V Part 2
Announcements

- Virtual Front Row today, 3/4:
  - Tony Kam
  - Ben Tait
  - Robin Chu
  - Neil Kulkarni
  - Robert Puccinelli
- HW6 posted (due Monday)
- Midterm Reminder
- Format TBD
- No HW next week

<table>
<thead>
<tr>
<th>Date</th>
<th>Lecture Topic 1</th>
<th>Lecture Topic 2</th>
<th>Discussion</th>
</tr>
</thead>
<tbody>
<tr>
<td>2/25</td>
<td>Circuit Timing Part 2 (slides)</td>
<td>(video)</td>
<td></td>
</tr>
<tr>
<td>3/2</td>
<td>RISC-V Microarchitecture and Implementation</td>
<td>Discussion 7</td>
<td></td>
</tr>
<tr>
<td>3/4</td>
<td>RISC-V Part 2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3/9</td>
<td>Exam 1 Review</td>
<td></td>
<td>Discussion 8</td>
</tr>
<tr>
<td>3/11</td>
<td>No Class - Exam 6-9PM</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3/16</td>
<td>Power and Energy</td>
<td></td>
<td>Discussion 9</td>
</tr>
<tr>
<td>3/18</td>
<td>Memory Blocks 1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3/23</td>
<td>Spring Recess</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3/25</td>
<td>Spring Recess</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Implementing Branches

Uses the “B-type” instruction format

RISC-V Assembly Instruction, example:

beq rs1, rs2, label

if rs1==rs2 pc ← pc + offset // offset computed by compiler/assembler and stored in the immediate field(s)

example:

beq x1, x2, L1

B-format is mostly same as S-Format, with two register sources (rs1/rs2) and a 12-bit immediate

But now immediate represents values -4096 to +4094 in 2-byte increments

The 12 immediate bits encode even 13-bit signed byte offsets (lowest bit of offset is always zero, so no need to store it)
Review: Adding \textbf{sw} to datapath

\begin{itemize}
  \item IMEM
  \item ALU
  \item DMEM
  \item wb
\end{itemize}

\begin{itemize}
  \item pc+4
  \item +4
  \item \textit{inst}[31:0]
\end{itemize}
Adding branches to datapath
Adding branches to datapath
Branch Comparator

- BrEq = 1, if A=B
- BrLT = 1, if A < B
- BrUn = 1 selects unsigned comparison for BrLT, 0=signed

- BGE branch: A >= B, if !(A<B)
## RISC-V Immediate Encoding

### Instruction Encodings, inst[31:0]

<table>
<thead>
<tr>
<th>31</th>
<th>30</th>
<th>25 24</th>
<th>21</th>
<th>20 19</th>
<th>15 14</th>
<th>12 11</th>
<th>8</th>
<th>7</th>
<th>6</th>
<th>0</th>
<th>R-type</th>
</tr>
</thead>
<tbody>
<tr>
<td>funct7</td>
<td>rs2</td>
<td>rs1</td>
<td>funct3</td>
<td>rd</td>
<td>opcode</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>31</th>
<th>30</th>
<th>25 24</th>
<th>21</th>
<th>20 19</th>
<th>15 14</th>
<th>12 11</th>
<th>8</th>
<th>7</th>
<th>6</th>
<th>0</th>
<th>I-type</th>
</tr>
</thead>
<tbody>
<tr>
<td>imm[11:0]</td>
<td>rs1</td>
<td>funct3</td>
<td>rd</td>
<td>opcode</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>31</th>
<th>30</th>
<th>25 24</th>
<th>21</th>
<th>20 19</th>
<th>15 14</th>
<th>12 11</th>
<th>8</th>
<th>7</th>
<th>6</th>
<th>0</th>
<th>S-type</th>
</tr>
</thead>
</table>

<table>
<thead>
<tr>
<th>31</th>
<th>30</th>
<th>25 24</th>
<th>21</th>
<th>20 19</th>
<th>15 14</th>
<th>12 11</th>
<th>8</th>
<th>7</th>
<th>6</th>
<th>0</th>
<th>B-type</th>
</tr>
</thead>
</table>

### 32-bit immediates produced, imm[31:0]

<table>
<thead>
<tr>
<th>31</th>
<th>30</th>
<th>20 19</th>
<th>12 11</th>
<th>10</th>
<th>5 4</th>
<th>1</th>
<th>0</th>
<th>I-immediate</th>
</tr>
</thead>
</table>

<table>
<thead>
<tr>
<th>31</th>
<th>30</th>
<th>20 19</th>
<th>12 11</th>
<th>10</th>
<th>5 4</th>
<th>1</th>
<th>0</th>
<th>S-immediate</th>
</tr>
</thead>
</table>

<table>
<thead>
<tr>
<th>31</th>
<th>30</th>
<th>20 19</th>
<th>12 11</th>
<th>10</th>
<th>5 4</th>
<th>1</th>
<th>0</th>
<th>B-immediate</th>
</tr>
</thead>
</table>

- Upper bits sign-extended from inst[31] always
- Only bit 7 of instruction changes role in immediate between S and B
Implementing **JALR** Instruction (I-Format)

- JALR rd, rs, immediate
  - Writes PC+4 to Reg[rd] (return address)
  - Sets PC = Reg[rs1] + offset
  - Uses same immediates as arithmetic and loads
    - *no* multiplication by 2 bytes
Review: Adding branches to datapath
Adding `jalr` to datapath
Adding jalr to datapath
Implementing jal Instruction

Uses the “J-type” instruction format

- JAL saves PC+4 in Reg[rd] (the return address)
- Set PC = PC + offset (PC-relative jump)
- Target somewhere within ±2¹⁹ locations, 2 bytes apart
  - ±2¹⁸ 32-bit instructions
- Immediate encoding optimized similarly to branch instruction to reduce hardware cost
Adding jal to datapath
Adding jal to datapath
Single-Cycle RISC-V RV32I Datapath
Controller Implementation:

Control logic works really well as a case statement...

```verilog
case (op)
  6'b000000: begin reg_write = 1; ... end
...
```
Processor Pipelining
Program Execution Time

\[ = (\# \text{ instructions}) (\text{cycles/instruction}) (\text{seconds/cycle}) \]

\[ = \# \text{ instructions} \times \text{CPI} \times T_C \]
Single-Cycle Performance

- $T_C$ is limited by the critical path ($1w$)
Single-Cycle Performance

- **Single-cycle critical path:**
  \[ T_c = t_{q_{PC}} + t_{mem} + \max(t_{RFread}, t_{sext} + t_{mux}) + t_{ALU} + t_{mem} + t_{mux} + t_{RFsetup} \]

- **In most implementations, limiting paths are:**
  - memory, ALU, register file.
  - \[ T_c = t_{q_{PC}} + 2t_{mem} + t_{RFread} + t_{mux} + t_{ALU} + t_{RFsetup} \]
**Pipelined Processor**

- Use *temporal parallelism*
- Divide single-cycle processor into 5 stages:
  - Fetch
  - Decode
  - Execute
  - Memory
  - Writeback
- Add pipeline registers between stages
Single-Cycle vs. Pipelined Performance

Single-Cycle

Pipelined

Instr

1

2

3

Fetch Instruction
Decode Read Reg
Execute ALU
Memory Read / Write
Write Reg

Fetch Instruction
Decode Read Reg
Execute ALU
Memory Read / Write
Write Reg

Fetch Instruction
Decode Read Reg
Execute ALU
Memory Read / Write
Write Reg

Fetch Instruction
Decode Read Reg
Execute ALU
Memory Read / Write
Write Reg

Time (ps)
Single-Cycle and Pipelined Datapath
Corrected Pipelined Datapath

- WriteReg must arrive at the same time as Result
Pipelined Control

Same control unit as single-cycle processor

Control delayed to proper pipeline stage
Pipeline Hazards

- Occurs when an instruction depends on results from previous instruction that hasn’t completed.
- Types of hazards:
  - **Data hazard**: register value not written back to register file yet
  - **Control hazard**: next instruction not decided yet (caused by branches)

We need to design ways to avoid hazards, else we pay the price in CPI (cycles per instruction) and processor performance suffers.
Deeper pipeline example.

Deeper pipelines => less logic per stage => high clock rate.

But

Deeper pipelines* => more hazards => more cost and/or higher CPI.

Cycles per instruction might go up because of unresolvable hazards.

Remember, Performance = # instructions X Frequency_{clk} / CPI

*Many designs included pipelines as long as 7, 10 and even 20 stages (like in the Intel Pentium 4). The later "Prescott" and "Cedar Mill" Pentium 4 cores (and their Pentium D derivatives) had a 31-stage pipeline.

How about shorter pipelines ... Less cost, less performance (but higher cost efficiency)
3-Stage Pipeline
The blocks in the datapath with the greatest delay are: IMEM, ALU, and DMEM. Allocate one pipeline stage to each:

```
\[ \text{I} \quad \text{X} \quad \text{M} \]
```

- Use PC register as address to IMEM and retrieve next instruction. Instruction gets stored in a pipeline register, also called “instruction register”, in this case.
- Use ALU to compute result, memory address, or branch target address.
- Access data memory or I/O device for load or store. Allow for setup time for register file write.

Most details you will need to work out for yourself. Some details to follow ... In particular, let’s look at hazards.
The fix:

Selectively forward ALU result back to input of ALU.

- Need to add mux at input to ALU, add control logic to sense when to activate. Check reference for details.
Load Hazard

The fix: Delay the dependent instruction by one cycle to allow the load to complete, send the result of load directly to the ALU (and to the regfile). No delay if not dependent!
### Control Hazard

#### 3-stage Pipeline

<table>
<thead>
<tr>
<th>Instruction</th>
<th>I</th>
<th>X</th>
<th>M</th>
<th>I</th>
<th>X</th>
<th>M</th>
</tr>
</thead>
<tbody>
<tr>
<td>beq x1, x2, L1</td>
<td>I</td>
<td>X</td>
<td>M</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>add x5, x3, x4</td>
<td>I</td>
<td>X</td>
<td>M</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>add x6, x1, x2</td>
<td>I</td>
<td>X</td>
<td>M</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>L1: sub x7, x6, x5</td>
<td>I</td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

*Branch address ready here*

*but needed here!*

**Several Possibilities:**

**The fix:**

1. Always delay fetch of instruction after branch
2. Assume branch “not taken”, continue with instruction at PC+4, and correct later if wrong.
3. Predict branch taken or not based on history (state) and correct later if wrong.

1. Simple, but all branches now take 2 cycles (lowers performance)
2. Simple, only some branches take 2 cycles (better performance)
3. Complex, very few branches take 2 cycles (best performance)

*MIPS defines “branch delay slot”, RISC-V doesn’t*
Predict “not taken”

Control Hazard

Branch address ready at end of X stage:
- If branch “not taken”, do nothing.
- If branch “taken”, then kill instruction in I stage (about to enter X stage) and fetch at new target address (PC)

<table>
<thead>
<tr>
<th>Instruction</th>
<th>I</th>
<th>X</th>
<th>M</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td><code>bneq</code> x1, x1, L1</td>
<td>I</td>
<td>X</td>
<td>M</td>
<td></td>
</tr>
<tr>
<td><code>add</code> x5, x3, x4</td>
<td>I</td>
<td>X</td>
<td>M</td>
<td></td>
</tr>
<tr>
<td><code>add</code> x6, x1, x2</td>
<td>I</td>
<td>X</td>
<td>M</td>
<td></td>
</tr>
<tr>
<td><code>L1: sub</code> x7, x6, x5</td>
<td>I</td>
<td>X</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Branch address ready at end of X stage:
- If branch “not taken”, do nothing.
- If branch “taken”, then kill instruction in I stage (about to enter X stage) and fetch at new target address (PC)

<table>
<thead>
<tr>
<th>Instruction</th>
<th>I</th>
<th>X</th>
<th>M</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td><code>beg</code> x1, x1, L1</td>
<td>I</td>
<td>X</td>
<td>M</td>
<td></td>
</tr>
<tr>
<td><code>add</code> x5, x3, x4</td>
<td>I</td>
<td>nop</td>
<td>nop</td>
<td></td>
</tr>
<tr>
<td><code>L1: sub</code> x7, x6, x5</td>
<td>I</td>
<td>X</td>
<td>M</td>
<td></td>
</tr>
</tbody>
</table>
Pipeline rules:
- Writes/reads to/from DMem are clocked on the leading edge of the clock in the “M” stage
- Writes to RegFile at the end of the “M” stage
- Instruction Decode and Register File access is up to you.

Branch: predict “not-taken”
Load: 1 cycle delay/stall on dependent instruction
Bypass ALU for data hazards
More details in upcoming spec
End of Lecture 14