CS152 Computer Architecture and Engineering
ISAs, Microprogramming and Pipelining

SOLUTION

Assigned
01/23/2023
Problem Set #1, Version (1.2)
Due February 2
@ 11:59:59PT

http://inst.eecs.berkeley.edu/~cs152/sp23

The problem sets are intended to help you learn the material, and we encourage you to collaborate with other students and to ask questions in discussion sections and office hours to understand the problems. However, each student must turn in their own solution to the problems.

The problem sets also provide essential background material for the exam and the midterms. The problem sets will be graded primarily on an effort basis, but if you do not work through the problem sets yourself you are unlikely to succeed on the exam or midterms! We will distribute solutions to the problem set on the day after the deadline to give you feedback.

Assignments must be submitted through Gradescope by 11:59:59pm PT on the specified due date. Box all solutions that don’t involving filling in a figure/table. Only boxed solutions and filled in figures/tables will be considered for grading. See the course website for the policy on slip days (late submissions).

Name: _______________________________

SID: _______________________________
Problem 1: CISC, RISC, Accumulator, and Stack: Comparing ISAs

In this problem, your task is to compare four different ISAs: x86 (a CISC architecture with variable-length instructions), RISC-V (a load-store, RISC architecture with 32-bit instructions in its base form), a stack-based ISA, and an accumulator-based ISA.

Problem 1.A  CISC

Let us begin by considering the following C code, which (inefficiently) rotates the bits in a 32-bit value by \( n \) times.

```c
unsigned int rotate(unsigned int x, unsigned int n) {
    unsigned int msb;
    while (n != 0) {
        msb = x >> 31;
        x = (x << 1) | msb;
        n--;
    }
    return x;
}
```

Using gcc and objdump on an x86 machine, we see that the above loop compiles to the following x86 instruction sequence. On entry to this code, register \( %eax \) contains \( x \) and register \( %ecx \) contains \( n \). Throughout parts (a–d), we will ignore what happens in the `done` label and return statement.

```
loop:    test %ecx,%ecx
         jz  done
         mov %eax,%ebx
         shr $31,%ebx
         shl $1,%eax
         or  %ebx,%eax
         dec %ecx
         jmp loop
done: ...
```

The meanings and instruction lengths of the instructions used above are given in the following table. Registers are denoted with \( R_{SUBSCRIPT} \), register contents with \(<R_{SUBSCRIPT}>\).

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Operation</th>
<th>Length</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>mov $src, $dest</code></td>
<td>(&lt;R_{dest}&gt; = &lt;R_{src}&gt;)</td>
<td>2 bytes</td>
</tr>
</tbody>
</table>
| `test $src1, $src2` | temp = \(<R_{src1}> \& <R_{src2}>\)  
Set flags based on value of temp | 2 bytes|
| `dec $dest`    | \(<R_{dest}> = <R_{dest}> - 1\)                    | 1 byte |
| `shl $imm8, $dest` | \(<R_{dest}> = <R_{dest}> << \text{imm8}\)         | 2 bytes|
Notice that the jump instruction jz (jump if zero) depends on ZF, which is a status flag. Status flags are set by the instruction preceding the jump, based on the result of the computation. Some instructions, like the test instruction, perform a computation and set status flags, but do not return any result. The meanings of the status flags are given in the following table:

<table>
<thead>
<tr>
<th>Name</th>
<th>Purpose</th>
<th>Condition Reported</th>
</tr>
</thead>
<tbody>
<tr>
<td>ZF</td>
<td>Zero</td>
<td>Result is zero</td>
</tr>
</tbody>
</table>

How many bytes is the program? For the above x86 assembly code, how many bytes of instructions need to be fetched if \( x = 0x01020304 \) and \( n = 5 \)? Assuming 32-bit data values, how many bytes of data memory need to be loaded? Stored?

Bytes in program = 7 * 2 bytes + 1 byte = 15 bytes  
Instruction bytes fetched = 5 * 15 bytes (loop iterations) + 4 bytes (final test and jz) = 79 bytes  
Date loaded/stored = 0 bytes

### Problem 1.B RISC

Translate each of the x86 instructions in the following table into zero, one or more RISC-V instructions. Place the loop label where appropriate. You should use the minimum number of instructions needed to translate each x86 instruction. (You are allowed to replace multiple x86 instructions with a single RISC-V instructions). Assume that \( x1 \) contains \( x \) upon entry, and \( x2 \) should receive \( n \). If needed, use \( x4 \) as a condition register, and \( x6, x7, \) etc., for temporaries. You should not need to use any floating-point registers or instructions in your code. A description of the RISC-V instruction set architecture can be found on the class website resources page.

**Note**: It is possible to replace the loop in 1.A with an O(1) non-loop-based solution. For this problem, we want you to use the more inefficient O(N) loop-based solution.

<table>
<thead>
<tr>
<th>x86 instruction</th>
<th>Label</th>
<th>RISC-V instruction sequence</th>
</tr>
</thead>
<tbody>
<tr>
<td>test %ecx,%ecx</td>
<td></td>
<td></td>
</tr>
<tr>
<td>jz done</td>
<td></td>
<td>beq x2, x0 done</td>
</tr>
<tr>
<td>mov $eax,%ebx</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>


<table>
<thead>
<tr>
<th>Instruction</th>
<th>Direct Translation</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>shr $31,%ebx</code></td>
<td><code>srl x6, x1, 31</code></td>
</tr>
<tr>
<td><code>shl $1,%eax</code></td>
<td><code>slli x2, x2, 1</code></td>
</tr>
<tr>
<td><code>or %ebx,%eax</code></td>
<td><code>or x1, x6, x1</code></td>
</tr>
<tr>
<td><code>dec %ecx</code></td>
<td><code>addi x2, x2, -1</code></td>
</tr>
<tr>
<td><code>jmp loop</code></td>
<td><code>jal x0, loop</code></td>
</tr>
</tbody>
</table>

How many bytes is the RISC-V program using your direct translation? How many bytes of RISC-V instructions need to be fetched for \(x = 0x01020304\) and \(n = 5\) with your direct translation? Assuming 32-bit data values, how many bytes of data memory need to be loaded? Stored?

The answer for this question may be slightly different, depending on how exactly students translated their programs.

**Bytes in program** = \(6 \times 4 \text{ bytes} = 24 \text{ bytes}\)

**Instruction bytes fetched** = \(5 \times 24 \text{ bytes (loop iterations)} + 4 \text{ bytes (final beq)} = 124 \text{ bytes}\)

**Bytes loaded/stored** = 0 bytes

---

**Problem 1.C**  
**Stack**

In a stack architecture, all operations occur on top of the stack. Only push and pop access memory, and all other instructions remove their operands from the stack and replace them with the result. The hardware implementation we will assume for this problem set uses stack registers for the topmost two entries; accesses that involve deeper stack positions (e.g., pushing or popping something when the stack has more than two entries) use an extra memory reference. Assume each instruction occupies three bytes if it takes an address or label; other instructions occupy one byte.
### Instruction | Definition
---|---
PUSH addr | load value at addr; push value onto stack
POP addr | pop stack; store value to addr
OR | pop two values from the stack; OR them; push result onto stack
SHL | pop value from top of stack; shift left by 1; push result onto stack
SIGN | pop value from top of stack; shift right by 31; push result onto stack
DEC | pop value from top of stack; decrement value by 1; push result onto stack
BEQZ label | pop value from stack; if it’s zero, branch to label; else, continue with next instruction
BNEZ label | pop value from stack; if it’s not zero, branch to label; else, continue with next instruction
JUMP label | continue execution at location label

Translate the *rotate* loop to the stack ISA. You are permitted to change the sequence of instructions from (a) and (b). Assume that when we reach the loop, \( n \) is at the top of the stack and \( x \) is underneath it. At the end of the loop, the stack should contain only \( x \) at the top. Assume that byte-addressable memory starting at address 0x8000 is available to use as temporary storage. Assume that data values are 32-bits wide.

How many bytes is your program? How many bytes of instructions need to be fetched for \( x = 0x01020304 \) and \( n = 5 \) with your translation? How many bytes of data memory need to be loaded? Stored? Would the number of bytes loaded and stored change if the stack could fit 8 entries in registers?

Answers may be slightly different depending on how students translated their programs.

```
pop 0x8000 # [n,x]
pop 0x8004 # [x]; n is at 0x8000; x is at 0x8004
push 0x8000 # []

loop:
    beqz done # [n]
push 0x8004 # []
shl # [x]
push 0x8004 # [x<<1]
sign # [x, x<<1]
or # [msb(x), x<<1]

pop 0x8004 # [x']
push 0x8000 # []
dec # [n]
jump loop # [n-1]

done:
    push 0x8004 # []
```
# or if the `done:` label needs nothing underneath it you can modify the example in the following way:
- change `done` label everywhere to `tempdone`
- add a label at the end of the program (after the `push 0x8004`) called `done`:

Bytes in program = 10 * 3 bytes (instructions with addresses/labels) + 4 * 1 byte (instructions *without* addresses/labels) = 34 bytes

Instruction bytes fetched = 9 bytes (prologue) + 5 * 22 bytes (loop iterations) + 3 bytes (final `beqz`) + 3 bytes (epilogue) = 125 bytes

Bytes loaded = 1 push * 4 bytes (prologue) + 5 * 3 pushes * 4 bytes (loop iterations) + 1 push * 4 bytes (epilogue) = 68 bytes

Bytes stored = 2 pops * 4 bytes (prologue) + 5 * 1 pop * 4 bytes (loop iterations) = 28 bytes

The solution above doesn't use more than two stack entries, so there would be no change if more stack entries were added. But if your solution does use more than two stack entries, then adding more registers may eliminate implicit loads.

<table>
<thead>
<tr>
<th>Problem 1.D</th>
<th>Accumulator</th>
</tr>
</thead>
</table>

6
In an accumulator ISA, one operand is implicitly a specific register (the same for all instructions), called the accumulator. To make programming easier, we will consider a modified architecture that has a secondary accumulator to hold an additional value. Assume each instruction occupies three bytes if it takes an address or label; other instructions occupy one byte.

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Definition</th>
</tr>
</thead>
<tbody>
<tr>
<td>LOAD addr</td>
<td>load value at addr into the primary accumulator</td>
</tr>
<tr>
<td>STORE addr</td>
<td>store the primary accumulator’s value to addr</td>
</tr>
<tr>
<td>OR addr</td>
<td>OR the value at addr with the value in the primary accumulator</td>
</tr>
<tr>
<td>SHL</td>
<td>left-shift the value in the primary accumulator by one bit</td>
</tr>
<tr>
<td>SIGN</td>
<td>logical right-shift the value in the primary accumulator by 31 bits</td>
</tr>
<tr>
<td>INC</td>
<td>increment the primary accumulator by 1</td>
</tr>
<tr>
<td>DEC</td>
<td>decrement the primary accumulator by 1</td>
</tr>
<tr>
<td>SWAP</td>
<td>swap the values in the primary and secondary accumulators</td>
</tr>
<tr>
<td>ZERO</td>
<td>zero the value in the primary accumulator</td>
</tr>
<tr>
<td>BEQZ label</td>
<td>branch to label if the primary accumulator holds a zero value</td>
</tr>
<tr>
<td>BNEZ label</td>
<td>branch to label if the primary accumulator holds a non-zero value</td>
</tr>
<tr>
<td>JUMP label</td>
<td>continue execution at location label</td>
</tr>
</tbody>
</table>

Notice that all instructions operate on the primary accumulator. Also note that there are no register specifiers in this architecture; addr and label represent memory addresses. Translate the rotate loop to use this ISA. Assume that $x$ initially held at address 0x8000, and $n$ is initially held at address 0x8004. You are permitted to write temporary variables to any addresses above 0x8000. You should return $x$ in the primary accumulator.

How many bytes is your program? How many bytes of instructions need to be fetched for $x = 0x01020304$ and $n = 5$ with your translation? Assuming 32-bit data values, how many bytes of data memory need to be loaded? Stored?

```
LOAD 0x8000 # (p: ?, s: ?) 1
SWAP # (p: x, s: ?) 2
LOAD 0x8004 # (p: ?, s: x) 3
loop:   BEQZ done # (p: n, s: x) 4
DEC # (p: n, s: x) 5
SWAP # (p: n', s: x) 6
SHL # (p: x, s: n-1) 7
STORE 0x8008 # (p: x<<1, s: n') 8
LOAD 0x8000 # (p: x<<1, s: n) 9
SIGN # (p: x, s: n-1) 10
OR 0x8008 # (p: msb(x), s: n') 11
```
STORE 0x8000 # (p: x', s: n') 7

SWAP # (p: x', s: n)
JUMP loop # (p: n', s: x') 8

done: SWAP # (p: n, s: x)

Bytes in program = 8 * 3 bytes (instructions with addresses/labels) + 7 * 1 byte (instructions without addresses/labels) = 31 bytes

Instruction bytes fetched = 7 bytes (prologue) + 5 * 23 bytes (loop iterations) + 3 bytes (final BEQZ) + 1 byte (epilogue) = 126 bytes

Data bytes loaded = 2 * 4 bytes (prologue) + 5 * 2 * 4 bytes (loop iterations) = 48 bytes
Data bytes stored = 5 * 2 * 4 (loop iterations) = 40 bytes

Problem 1.E

Conclusions
In just a few sentences, compare the four ISAs you have studied with respect to code size, number of instructions fetched, and data memory traffic. Which one would you choose if you were to build a specialized processor to execute the code in this program, and why?

- Static code size: CISC < RISC < (Stack ≈ Accumulator)
- Dynamic code size: CISC < (RISC ≈ Stack ≈ Accumulator)
- Data memory traffic: (CISC ≈ RISC) < Accumulator < Stack
  - If your code is not well-matched for a stack machine, even accumulator machines can be more efficient
- We would choose CISC if we wanted to minimize bandwidth and memory storage requirements
  - Another ISA choice is also acceptable if the student provides a reasonable explanation

**Problem 1.F**

<table>
<thead>
<tr>
<th>Optimization</th>
</tr>
</thead>
<tbody>
<tr>
<td>To get more practice with RISC-V, optimize the code from part B so that fewer dynamic instructions are executed on average and the frequency of taken branches is minimized. There are solutions more efficient than simply translating each individual x86 instruction as you did in part (b). Your solution should contain commented assembly code, a brief explanation of your optimizations, and a short analysis of the savings you obtained.</td>
</tr>
</tbody>
</table>

*Note:* It is possible to replace the loop in 1.A with an O(1) non-loop-based solution. For this problem, we want you to use the more inefficient O(N) loop-based solution.

**Common optimizations may include:**
- Loop unrolling
  - Reduces the loop overhead
- Loop inversion: translating the while loop to a do-while loop
  - Eliminates the unconditional jump
Problem 2: Microprogramming and Bus-based Architectures

In this problem, we explore microprogramming by writing microcode for the bus-based implementation of the RISC-V machine described in Handout #1 (Bus-Based RISC-V Implementation). Read the instruction fetch microcode in Table H1-3 of Handout #1. Make sure that you understand how different types of data and control transfers are achieved by setting the appropriate control signals before attempting this problem.

The final solution should be as elegant and efficient as possible with respect to the number of microinstructions used.

Problem 2.A Implementing SUBLEQ

For this problem, you are to implement a new kind of arithmetic instruction, MODULOM. The new instruction has the following format:

MODULOM rd, rs1, rs2

MODULOM performs the following operation: The memory word at the address in rs1 is divided by the memory work at the address in rs2, and the remainder is stored in address in the memory word at the address in rd.

\[ M[rd] \leftarrow M[rs1] \% M[rs2] \]

Your CPU's ALU does not have support for a remainder or a division operation. Fortunately, the modulo operation can also be implemented as a loop, as illustrated below:

```c
unsigned int modulo(unsigned int x, unsigned int y) {
    while (x >= y) x -= y;
    return x;
}
```

This loop is realizable with the microcode of your CPU.

Fill in Worksheet 2.A with the microcode for MODULOM. Use don’t cares (*) for fields where it is safe to use don’t cares. Study the hardware description well, and make sure all your microinstructions are legal.

Please comment your code clearly. If the pseudo-code for a line does not fit in the space provided, or if you have additional comments, you may write in the margins so long as you do it neatly. Your code should exhibit “clean” behavior and not modify rd, rs1, rs2, or other general-purpose architectural registers while executing the instruction.

Finally, make sure that the instruction fetches the next instruction (i.e., by doing a microbranch to FETCH0 as discussed above) once the result has been saved to M[rd].

You may want to consult the microcode found in the micro-coded processor provided in Lab 1, which can be viewed at lab1/generators/riscv-sodor/src/main/scala/
rv32_ucode/microcode.scala for guidance. **Warning:** While that microcode passes all provided assembly tests and benchmarks, no guarantees to the optimality of that code are assured, and there may still be bugs in the provided implementation.

We will accept any reasonable solution, even if it is different from the one on the next page.
<table>
<thead>
<tr>
<th>State</th>
<th>PseudoCode</th>
<th>ldIR</th>
<th>Reg Sel</th>
<th>Reg Wr</th>
<th>en Reg</th>
<th>Id A</th>
<th>Id B</th>
<th>ALUOp</th>
<th>en ALU</th>
<th>Id MA</th>
<th>Mem Wr</th>
<th>en Mem</th>
<th>Im m Sel</th>
<th>en Imm</th>
<th>îBr</th>
<th>Next State</th>
</tr>
</thead>
<tbody>
<tr>
<td>FETCH0:</td>
<td>MA &lt;- PC;</td>
<td>*</td>
<td>PC</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>*</td>
<td>*</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>N</td>
<td>*</td>
<td></td>
</tr>
<tr>
<td></td>
<td>A &lt;- PC</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>IR &lt;- Mem</td>
<td>1</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>*</td>
<td>0</td>
<td>S</td>
<td></td>
</tr>
<tr>
<td></td>
<td>PC &lt;- A+4</td>
<td>0</td>
<td>PC</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>INC_A_4</td>
<td>1</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>D</td>
<td></td>
</tr>
<tr>
<td>NOP0:</td>
<td>microbranch back to FETCH0</td>
<td></td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>*</td>
<td>*</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>J</td>
<td>FETCH0</td>
<td></td>
<td></td>
</tr>
<tr>
<td>MODULOM 0:</td>
<td>MA &lt;- R[rs1]</td>
<td>0</td>
<td>rs1</td>
<td>0</td>
<td>1</td>
<td>*</td>
<td>*</td>
<td>*</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>N</td>
<td></td>
</tr>
<tr>
<td></td>
<td>A&lt;-Mem</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>*</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>*</td>
<td>0</td>
<td>S</td>
<td></td>
</tr>
<tr>
<td></td>
<td>MA &lt;- R[rs2]</td>
<td>0</td>
<td>rs2</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>*</td>
<td>*</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>N</td>
<td></td>
</tr>
<tr>
<td></td>
<td>B&lt;-Mem</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>*</td>
<td>0</td>
<td>S</td>
<td></td>
</tr>
<tr>
<td>LOOP</td>
<td>if (A &lt; B) goto DONE</td>
<td>0</td>
<td>rd</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>SLTU</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>N</td>
<td>DONE</td>
</tr>
<tr>
<td></td>
<td>MA &lt;- R[rd]</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>A&lt;-A–B</td>
<td>*</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>SUB</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>J</td>
<td>LOOP</td>
</tr>
<tr>
<td></td>
<td>goto LOOP</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DONE</td>
<td>Mem&lt;-A</td>
<td>*</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>COPY_A</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>S</td>
<td></td>
</tr>
<tr>
<td></td>
<td>goto FETCH0</td>
<td>*</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>*</td>
<td>*</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>J</td>
<td>FETCH0</td>
</tr>
</tbody>
</table>

Worksheet 2.A
In this question we ask you to implement a useful vector instruction to find the *smallest number in a vector of unsigned integers*. This instruction has the same format as other arithmetic (R-type) instructions in RISC-V:

```
MINV rd, rs1, rs2
```

The MINV instruction takes a pointer to the beginning of a vector in memory (*rs1*) and a pointer to the end of a vector in memory (*rs2*), and it returns in register *rd* the smallest number in that vector. Your code is permitted to modify register *rs1* during the execution of this instruction.

For this problem, each vector element will be a 32-bit unsigned number. You can assume that the address in *rs2* is larger than the address in *rs1*.

Your task is to fill out Worksheet 2.B for the MINV instruction. You should try to optimize your implementation for the minimal number of cycles necessary and for which signals can be set to don’t-cares.

We will accept any reasonable solution, even if it is different from the one on the next page. We will accept solutions for which *rs2* points to the last value in the vector, as well as solutions for which *rs2* points *after* the last value in the vector.
<table>
<thead>
<tr>
<th>State</th>
<th>PseudoCode</th>
<th>ldIR</th>
<th>Reg Sel</th>
<th>Reg Wr</th>
<th>en Reg</th>
<th>ld A</th>
<th>ldB</th>
<th>ALUOp</th>
<th>en ALU</th>
<th>ld MA</th>
<th>Mem Wr</th>
<th>en Mem</th>
<th>Im mSel</th>
<th>en Imm</th>
<th>BR</th>
<th>Next State</th>
</tr>
</thead>
<tbody>
<tr>
<td>FETCH0:</td>
<td>MA &lt;- PC;</td>
<td>*</td>
<td>PC</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>*</td>
<td>*</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>N</td>
<td>*</td>
<td></td>
</tr>
<tr>
<td></td>
<td>A &lt;- PC</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>IR &lt;- Mem</td>
<td>1</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>*</td>
<td>0</td>
<td>S</td>
<td>*</td>
<td></td>
</tr>
<tr>
<td></td>
<td>PC &lt;- A+4</td>
<td>0</td>
<td>PC</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>INC_A_4</td>
<td>1</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>D</td>
<td></td>
</tr>
<tr>
<td>...</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>NOP0:</td>
<td>microbranch back to FETCH0</td>
<td>*</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>J</td>
<td>FETCH0</td>
<td></td>
</tr>
<tr>
<td>MINV0:</td>
<td>A &lt;- B</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>COPY_B</td>
<td>1</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>N</td>
<td></td>
</tr>
<tr>
<td></td>
<td>A &lt;- A-B</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>SUB</td>
<td>1</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>N</td>
<td></td>
</tr>
<tr>
<td></td>
<td>R[rd] &lt;- A-1 (== MAX)</td>
<td>0</td>
<td>rd</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>DEC_A_1</td>
<td>1</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>N</td>
<td></td>
</tr>
<tr>
<td>LOOP</td>
<td>A, MA &lt;- R[rs1]</td>
<td>0</td>
<td>rs1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>*</td>
<td></td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>N</td>
<td></td>
</tr>
<tr>
<td></td>
<td>B &lt;- R[rs2]</td>
<td>0</td>
<td>rs2</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>N</td>
<td></td>
</tr>
<tr>
<td></td>
<td>If NOT (A &lt; B) goto FETCH0</td>
<td>0</td>
<td>rd</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>SLTU</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>EZ FETCH0</td>
</tr>
<tr>
<td></td>
<td>B &lt;- R[rd]</td>
<td>0</td>
<td>rd</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>SLTU</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>EZ FETCH0</td>
</tr>
<tr>
<td></td>
<td>R[rs1]&lt;-A+4</td>
<td>0</td>
<td>rs1</td>
<td>1</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>INC_A_4</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>N</td>
<td></td>
</tr>
<tr>
<td></td>
<td>A&lt;-MEM</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>*</td>
<td>0</td>
<td>S</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>If NOT (A &lt; B) goto LOOP</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>SLTU</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>EZ LOOP</td>
</tr>
<tr>
<td></td>
<td>R[rd] &lt;- A</td>
<td>0</td>
<td>rd</td>
<td>1</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>*</td>
<td>0</td>
<td>J</td>
<td>LOOP</td>
</tr>
<tr>
<td></td>
<td>goto LOOP</td>
<td>0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Problem 2.C

How many cycles does it take to execute the following instructions on the microcoded RISC-V implementation? Use the states and control signals from Handout #1 (or Lab 1, in lab1/generators/riscv-sodor/src/main/scala/rv32_ucode/microcode.scala) and assume that memory does not assert its busy signal.

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Cycles</th>
<th>Summary (not including fetch and dispatch)</th>
</tr>
</thead>
<tbody>
<tr>
<td>SUB x3,x2,x1</td>
<td>3+3=6</td>
<td>1) A←R[x2]; 2) B←R[x1]; 3) R[x3]←A−B</td>
</tr>
<tr>
<td>ANDI x2,x1,#4</td>
<td>3+3=6</td>
<td>1) A←R[x1]; 2) B←Imm; 3) R[x2]←A&amp;1 ∧ B</td>
</tr>
<tr>
<td>LW x1,0(x2)</td>
<td>3+5=8</td>
<td>1) A←R[x1]; 2) B←Imm; 3) A⟩B</td>
</tr>
<tr>
<td>BNE x1,x2,label #(x1 == x2)</td>
<td>3+4=7</td>
<td>1) A←R[x1]; 2) B←R[x2]; 3) MA←A+B; 4) R[x1] ← Mem; 5) μBr J FETCH0^2</td>
</tr>
<tr>
<td>BNE x1,x2,label #(x1 != x2)</td>
<td>3+3+4 = 10</td>
<td>4) A←PC; 5) A←A−4; 6) PC←A+B</td>
</tr>
<tr>
<td>BEQ x1,x2,label #(x1 != x2)</td>
<td>3+4=7</td>
<td>1) A←R[x1]; 2) B←R[x2]; 3) A - B; μBr EZ FETCH0; B ← Imm^4</td>
</tr>
<tr>
<td>BEQ x1,x2,label #(x1 == x2)</td>
<td>3+3+4 = 10</td>
<td>4) A←PC; 5) A←A−4; 6) PC←A+B</td>
</tr>
<tr>
<td>J label</td>
<td>3+6=9</td>
<td>Same as above</td>
</tr>
<tr>
<td>JAL label</td>
<td>3+6=9</td>
<td>Same as above</td>
</tr>
<tr>
<td>JALR x1</td>
<td>3+6=9</td>
<td>Same as above</td>
</tr>
<tr>
<td>AUIPC x1, #128</td>
<td>3+4=7</td>
<td>Same as above</td>
</tr>
</tbody>
</table>
Terminal microinstructions are assumed to have a $\mu$Br J back to FETCH0 unless stated otherwise.

These operations were not provided in the handout, but it is reasonable to assume they can occur in a single ALU op.

The wording of the question is ambiguous as to whether “memory does not assert its busy signal” implies that the machine has guaranteed single-cycle memory versus you can exclude stall cycles for purposes of the cycle accounting. The former would permit the $\mu$Br to occur in parallel with the load. The safer, intended answer assumes the latter case and separates the $\mu$Br and the memory op.

The A register contains PC after fetch, whereas PC is speculatively set to PC+4. Thus, we reuse A to speed up AUIPC and JAL instructions, eliding the need to load the PC into A and decrement it by 4. (Conversely, this cannot be avoided in taken conditional branches.)

Speculatively load the branch offset into B to shave a cycle off a taken conditional branch. This will just be discarded in fetch if the branch is not taken.

J is encoded as JAL with $rd=x0$.

Which instruction takes the most cycles to execute? Which instruction takes the fewest cycles to execute?

Most cycles: Taken branch (BEQ, BNE)
Fewest cycles: Arithmetic operations (SUB, ANDI, etc.)
Problem 3: 6-Stage Pipeline

In this problem, we consider a modification to the fully bypassed 5-stage RISC-V processor pipeline originally presented in Lecture 3 and further expanded on in Handout #1 (RV32I 5-Stage Pipeline Diagram). Our new processor has a data cache with a two-cycle latency. To accommodate this cache, the memory stage is pipelined into two stages, M1 and M2, as shown in Figure 1-A. Additional bypasses are added to keep the pipeline fully bypassed.

Suppose we are implementing this 6-stage pipeline in a technology in which register file ports are inexpensive, but bypasses are costly. We wish to reduce cost by removing some of the bypass paths, but without increasing CPI. The proposal is for all integer arithmetic instructions to write their results to the register file at the end of the Execute stage, rather than waiting until the Writeback stage. A second register file write port is added for this purpose. Remember that register file writes occur on each rising clock edge, and values can be read in the next clock cycle. The proposed change is shown in Figure 1-B.

In this problem, assume that the only exceptions that can occur in this pipeline are illegal opcodes (detected in the Decode stage) and invalid memory address (detected at the start of the M2 stage). Additionally, assume that the control logic is optimized to stall only when necessary. You may ignore branch and jump instructions in this problem.

![Figure 1-A](image1.png)
Figure 1-A. 6-stage pipeline. For clarity, bypass paths are not shown. Handout #1 (RV32I 5-Stage Pipeline Diagram) shows the full pipeline diagram.

![Figure 1-B](image2.png)
Figure 1-B. 6-stage pipeline with proposed additional write port.
Problem 3.A

The second write port allows some bypass paths to be removed without adding stalls in the decode stage. Explain how the second write port improves performance by eliminating such stalls and give a short code sequence that would have required an interlock to execute correctly with only a single write port and with the same bypass paths removed.

The second write port improves performance by resolving some RAW hazards earlier than they would be if ALU operations had to wait until writeback to provide their results to subsequent dependent instructions. It would help with the following instruction sequence:

```
add x1, x2, x3
add x4, x5, x6
add x7, x1, x9
```

The important insight is that the second write port cannot resolve data hazards for immediately back-to-back instructions. An arithmetic instruction in the EX stage writes back as it leaves the EX stage; therefore, the bypass path is necessary if the next instruction has a RAW dependency and is allowed to leave the ID stage.

Problem 3.B

After the second write port is added, which bypass paths can be removed in this new pipeline without introducing additional stalls? List each removed bypass individually. Are any new hazards added to the pipeline due to the earlier writeback of arithmetic instructions?

The bypass path from the end of M1 to the end of ID can be removed. (Credit was also given for the bypass path from the beginning of M2 to the beginning of EX, since these are equivalent.) Additionally, ALU results no longer must be bypassed from the end of M2 or the end of WB, but these bypass paths are still used to forward load results to earlier stages.

There are multiple potential WAW hazards that must be appropriately addressed by the control logic. The two instructions writing at the same time must be appropriately prioritized. Also, if an arithmetic instruction is in M1 and a load with the same destination register is in M2, the write of the earlier load can clobber the result of the older instruction, leading to an incorrect architectural state. The control logic needs to be modified to handle these situations by suppressing the writes of older instructions when they conflict with the writes of newer instructions.
Without further modifications, this pipeline may not support precise exceptions. Briefly explain why and provide a minimal code sequence that will result in an imprecise exception.

Illegal address exceptions are not detected until the start of the M2 stage. Since writebacks can occur at the end of the EX stage, it is possible for an arithmetic instruction following a memory access to an illegal address to have written its value back before the exception is detected, resulting in an imprecise exception. For example:

```
```
lw x1, -1(x0) # address -1 is misaligned
add x2, x3, x4 # x2 will be overwritten, but last instruction has faulted!
```

Describe how precise exceptions can be implemented by adding a new interlock. Provide a minimal code sequence that would engage this interlock. Qualitatively, what is the performance impact of this solution?

Stall any ALU op in the ID stage if the instruction in the EX stage is a load or a store. The instruction sequence above engages this interlock. Loads and stores account for about 1/3 of dynamic instructions. Assuming that the instruction following a load or store is an arithmetic instruction 2/3 of the time, and ignoring the existing load-use delay, this solution will increase the CPI by (1/3)*(2/3) = 2/9. However, only a qualitative explanation was necessary for credit.

Suppose you are additionally given the budget to add a new register file read port. Propose an alternative solution to implement precise exceptions in this pipeline without requiring any new interlocks.

In addition to writing an arithmetic instruction’s destination register in the EX stage, also read its previous value and carry it down the pipeline. If an early writeback occurs before a preceding exception was detected, then the old value of rd is preserved in the M1 pipeline register and can be restored to the register file, maintaining precise state. Note: It is better to read the previous value as late as possible, otherwise this read of rd might need an extra bypass path for the following instruction sequence:

```
ld x1, 0(x8)
```
```
ld x2, -1(x8) # misaligned
addi x1, x1, 4
```

This also depends on the interlocks used to resolve the WAW hazard mentioned in 3.B.
Problem 4: CISC vs RISC

For each of the following questions, select either *CISC* or *RISC*, depending on which ISA you feel would be best suited for the situation described. Also, briefly *explain your reasoning*.

### Problem 4.A  
Lack of Good Compilers I

Assume that compiler technology is poor, and therefore your users are far more apt to write all their code in assembly. A _____ ISA would be best appreciated by these programmers.

**CISC**

CISC ISAs provided more complex, higher-level instructions such as string manipulation instructions and special addressing modes convenient for indexing tables (say for your company’s payroll application). Two example CISC instructions: “DBcc: Test Condition, Decrement, and Branch” and “CMP2: Compare Register against Upper and Lower Bounds”. This made life easy if you stared at assembly all day and could not hide behind convenient software abstractions/subroutines!

OR

**RISC**

A streamlined RISC ISA is far simpler for an assembly programmer to fully understand and reason about than all the idiosyncrasies that CISC ISAs tend to have, such as the variety of complex instructions for narrow use cases and the myriad addressing modes.

### Problem 4.B  
Lack of Good Compilers II

You desire to make compilers better at targeting your *yet-to-be-designed* machine. Therefore, you choose a _____ ISA, as it would be easiest for a compiler to target, thus allowing your users to write code in higher-level languages like C and Fortran and raise their productivity.

**CISC**

Compilers had difficulty targeting CISC ISAs in part because the complicated instructions have many difficult and hard to analyze side-effects. A load-store/register-register RISC ISA which limits side-effects to a single register or memory location per instruction is relatively easy for a compiler to understand, analyze, and schedule code for.

### Problem 4.C  
Fast Logic, Slow Memory
Assume that CPU logic is fast, very fast, while instruction fetch accesses are at least 10x slower (suppose you are the lead architect of the “709”). Which ISA style do you choose as a best match for the hardware’s limitations?

<table>
<thead>
<tr>
<th>CISC</th>
<th>RISC</th>
</tr>
</thead>
</table>

When instruction fetch takes 10x longer than a CPU logic operation, you are going to want to push as much compute as you can into each instruction! Certain especially complex CISC instructions can encode tens, even hundreds of equivalent RISC instructions. For example, a CISC instruction which performs a single expensive, multi-cycle string routine in hardware would be considerably faster than even an optimized RISC implementation that would need a loop with a series of loads, stores, and arithmetic instructions in the loop body.

**Problem 4.D**

Starting with a clean slate in the year 2023 (area/logic/memory is cheap), you think that a _____ ISA that would lend itself best to a very high-performance processor (e.g., high frequency, highly pipelined).

<table>
<thead>
<tr>
<th>CISC</th>
<th>RISC</th>
</tr>
</thead>
</table>

Because RISC instructions tend to have simple, easy to analyze side-effects, they lend themselves more readily to pipelined micro-architectures which dynamically check for dependencies between instructions and interlock or bypass when dependencies arise. And because little work needs to be performed in each stage, the pipeline can be clocked at very high frequencies.

This advantage is evident in modern micro-architectures of old CISC ISAs: The frontend of the processor typically has a decoder which translates CISC instructions (e.g., x86 instructions) into RISC “micro-ops”, which a high-performance pipeline can then dynamically schedule for maximum performance.

For these CISC architectures such as x86 and IBM S/360, they are still around for legacy reasons. But if you had a chance at a clean slate, you would probably prefer a clean RISC implementation with a direct translation to the micro-architecture instead of using area and power on a CISC decoder front-end (not to mention the additional complexity forced on your memory system to handle the odd CISC addressing modes).
**Problem 5: Iron Law of Processor Performance**

Mark whether the following modifications will cause each of the *first three* categories to **increase**, **decrease**, or whether the modification will have **no effect**. Explain your reasoning in the within the box provided.

For the final column “Overall Performance”, mark whether the following modifications **increase**, **decrease**, have **no effect**, or whether the modification will have an **ambiguous** effect. Explain your reasoning. If the modification has an ambiguous effect, describe the tradeoff in which it would be a beneficial modification or in which it would a detrimental modification (i.e., as an engineer would you suggest using the modification or not and why?).

<table>
<thead>
<tr>
<th></th>
<th>Instructions / Program</th>
<th>Cycles / Instruction</th>
<th>Seconds / Cycle</th>
<th>Overall Performance</th>
</tr>
</thead>
<tbody>
<tr>
<td>a) Adding a branch delay slot</td>
<td>Increase: NOPs must be inserted when the branch delay slot cannot be usefully filled</td>
<td>Decrease: Some control hazards are eliminated; also additional NOPs execute quickly because they have no data hazards.</td>
<td>No effect: will not meaningfully change the pipeline. ALSO ACCEPT: Decrease because no branch kill</td>
<td>Ambiguous: Depends on the program and how often the delay slot can be filled with useful work</td>
</tr>
<tr>
<td>b) Adding a complex instruction</td>
<td>Decrease: if the added instruction can replace a sequence of instructions.</td>
<td>Increase: implementing the instruction can mean adding stages or making stages have more complex control logic.</td>
<td>Increase: more control logic and interlocks will often increase the critical path. ALSO ACCEPT: No effect</td>
<td>Ambiguous: if the program can take advantage of the new instruction, it can be worth the cost. This is a hard decision for an ISA designer to make!</td>
</tr>
<tr>
<td>c) Reduce number of registers in the ISA</td>
<td>Increase: Values will more frequently be spilled to the stack, increasing number of loads and stores</td>
<td>Increase: more loads followed by dependent instructions will cause more stalls. Memory latency is hard to schedule around.</td>
<td>Decrease: fewer registers means shorter register file access time</td>
<td>Ambiguous: if the program uses few registers and thus spills rarely to memory, the faster reg. access times may win out. Also, your instructions may be able to be shorter, improving amongst other things code density</td>
</tr>
<tr>
<td></td>
<td>d) Improving memory access speed</td>
<td>No effect: since instructions make no assumption about memory speed.</td>
<td>Decrease: programs will spend less time stalled waiting for memory</td>
<td>Decrease: if memory access is on the critical path or memory was 1 cycle. ALSO ACCEPT: No effect: if memory is pipelined and just takes less cycles.</td>
</tr>
<tr>
<td></td>
<td>e) Adding 16-bit versions of the most common instructions in RISC-V (normally 32 bits in length) to the ISA (i.e., make RISC-V a variable-length ISA)</td>
<td>No effect: The actual number of instructions is unchanged.</td>
<td>Decrease: since code size has shrunk, there will be fewer instruction cache (I$) misses and less time spent waiting to fetch</td>
<td>Increase: decode becomes more complex with more formats, and instruction fetch must deal with misalignment.</td>
</tr>
<tr>
<td></td>
<td>f) For a given CISC ISA, changing the implementation of the micro-architecture from a microcoded engine to a RISC pipeline (with a CISC-to-RISC decoder on the frontend)</td>
<td>No effect: Since the ISA is not changing, the binary does not change, and thus there is no change to Inst/Program.</td>
<td>Decrease: Microcoded machines take several clock cycles to execute an instruction, while the RISC pipeline should have a CPI near 1 (thanks to pipelining).</td>
<td>No effect: the amount of work done in one pipeline stage and one microcode cycle are about the same. ALSO ACCEPT: Increase: the RISC pipeline introduces longer control paths and adds bypasses, which are likely to be on the critical path.</td>
</tr>
</tbody>
</table>