Directed Questions
Background
The RISC-V Instruction Set Architecture
The processor cores featured in this lab implement the RISC-V ISA, developed at UC Berkeley for use in education, research, and industry.
A complete software toolchain is pre-installed on the lab machines. Note that the GNU utilities are prefixed with the target triplet1 (riscv32-unknown-elf2) but otherwise function similarly as their native binutils and gcc counterparts that may be familiar to you. The components most relevant to this lab are:
-
riscv32-unknown-elf-gcc: GNU cross-compiler for C -
riscv32-unknown-elf-objdump: GNU disassembler for RISC-V machine code -
spike: Functional ISA simulator which serves as the de-facto golden reference for the RISC-V ISA. Since it is not a cycle-accurate model, it cannot be relied on for performance measurements but can execute software much more quickly than an RTL simulator to verify correctness.
Chipyard
This lab, as well as subsequent CS 152 labs, is based on the Chipyard framework being actively developed at UC Berkeley.
Chipyard is an integrated design, simulation, and implementation framework for agile development of systems-on-chip (SoCs). It combines Chisel, the Rocket Chip generator, and other Berkeley projects to produce a full-featured RISC-V SoC from a rich library of processor cores, accelerators, memory system components, and I/O peripherals. Chipyard supports several hardware development flows, including software RTL simulation, FPGA-accelerated simulation (FireSim), and automated VLSI methodologies (Hammer).
Chisel
Chisel is a hardware design language developed at UC Berkeley that facilitates advanced circuit generation and design reuse for digital logic designs.
Chisel adds hardware construction primitives to the Scala programming language, providing designers with higher-level features such as object orientation, functional programming, parameterized types, and type inference to write complex, parameterizable hardware generators that produce synthesizable Verilog. This generator methodology enables the creation of re-usable components and libraries, raising the level of abstraction in design while retaining fine-grained control. A Chisel design is essentially a legal Scala program whose execution emits low-level RTL code, which can then be mapped to ASICs, FPGAs, or cycle-accurate software simulators such as VCS and Verilator.
Chisel in This Lab
The “Sodor” instructional cores in this lab are implemented using the Chisel HDL according to the generator design methodology. In this lab, you will compile these Chisel-based processors into software simulators using Verilator and run cycle-accurate experiments on instruction mixes and pipeline hazards.
Students will not be required to write Chisel code as part of this lab, beyond adding and modifying parameters as directed.
Directed Portion (30% of lab grade)
Terminology and Conventions
Throughout this course, the term host refers to the machine on which the simulation runs, while target refers to the machine being simulated. For this lab, an instructional server will act as the host, and the RISC-V processors will be the target machines.
Setup
To complete this lab, ssh into an instructional server with the instructional computing account provided to you.3 The lab infrastructure has been set up to run on the eda-{1..4}.eecs.berkeley.edu machines (eda-1.eecs.berkeley.edu, etc.). Check https://hivemind.eecs.berkeley.edu/ to login to a machine with lower load.
Run these commands to install conda. On the bash command, press enter and say ’yes’ when prompted.
wget -O Miniforge3.sh "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3.sh -p ~/conda
source ~/conda/etc/profile.d/conda.sh
Run these commands to install and build Chipyard. The build-setup.sh script clones all the git submodules of the various Chipyard components. Make sure to copy the ./build-setup.sh command as one line. This step is expected to take several minutes. We recommend you run this step in tmux4 or screen (both already installed on the eda machines), and find something else to do as Chipyard builds. For the remainder of the lab, it will be beneficial to use tmux or screen as some processor builds and simulations can take a few minutes to run.
git clone https://github.com/ucb-bar/chipyard
cd chipyard
git checkout main
git reset --hard 150f888
ln -s /home/ff/cs152/sp26/chipyard/.conda-env .conda-env
conda activate .conda-env/
./build-setup.sh riscv-tools --skip-conda --skip-toolchain --skip-circt --skip-firesim --skip-marshal
Lastly, run these commands to setup your shell environment.
CHIPYARDROOT=$PWD
BMARKS=$CHIPYARDROOT/generators/riscv-sodor/riscv-bmarks
SCRIPTS=$CHIPYARDROOT/generators/riscv-sodor/scripts
source ./env.sh
You have now set up your environment for Lab 1!
Every time you open a new shell, you must run these commands (you do not need to go through the full setup):
cd ~/$USER
source conda/etc/profile.d/conda.sh
cd chipyard
conda activate .conda-env/
CHIPYARDROOT=$PWD
BMARKS=$CHIPYARDROOT/generators/riscv-sodor/riscv-bmarks
SCRIPTS=$CHIPYARDROOT/generators/riscv-sodor/scripts
source ./env.sh
The remainder of this exercise will use ${CHIPYARDROOT} to denote the path of the working tree. Its directory structure is outlined below:
${CHIPYARDROOT}/
├── generators/ # Chisel source code for cores/caches/peripherals/etc.
│ └── riscv-sodor/ # Sodor sources and utilities
│ ├── src/main/scala/sodor/
│ │ ├── common/ # Common source code shared between all Sodor cores
│ │ ├── rv32_1stage/ # Source code for the 1-stage core
│ │ ├── rv32_2stage/ # Source code for the 2-stage core
│ │ ├── rv32_3stage/ # Source code for the 3-stage core
│ │ ├── rv32_5stage/ # Source code for the 5-stage core
│ │ └── rv32_ucode/ # Source code for the microcoded core
│ ├── riscv-bmarks/ # Pre-compiled benchmark binaries
│ └── scripts/ # Python scripts for analyzing Sodor traces
├── test/
│ ├── custom-tests/ # Stub for open-ended question 4.2
│ └── custom-bmarks/ # Stub for open-ended question 4.1
├── scripts/ # Contains repo initialization script
└── sims/
└── verilator/ # Verilator simulation directory
├── generated-src/ # Generated Verilog after Chisel elaboration
└── output/ # Simulation traces are logged here
Of particular note is that the Chisel source code for the processors can be found in $. While you do not need to understand the code to do this assignment, it may be interesting to examine the internals of a processor. Although it is not recommended that you alter any of the processors while collecting data from them in the directed lab portion (except as instructed), feel free in your own time (or perhaps as part of the open-ended portion) to modify the processors as you see fit.
First Steps: Building and Simulating the 1-Stage Processor
The lab repository contains five different cores: 1/2/3/5-stage pipelines and a microcoded processor.
Building the 1-stage Processor
Now we will build a 1-stage RISC-V processor. The first run of sbt may take some time since it must fetch various Scala dependencies. Run the following commands to build the 1-stage processor.5
cd ${CHIPYARDROOT}/sims/verilator
make CONFIG=Sodor1StageConfig
The make command orchestrates the following steps:
-
Start
sbt(the Scala Build Tool), select theSodor1StageConfigconfig, and compile and run the Chisel code which generates a Verilog RTL description of the processor. The generated Verilog code can be found in$. -
Run
verilator, an open-source tool that converts Verilog into a C++ cycle-accurate simulation model. -
Compile the Verilator-generated C++ code into an x86 executable.
Simulating the 1-stage Processor
fesvr) reads a RISC-V ELF binary from the host filesystem, starts the target system simulator, and populates the target system memory with the given ELF program segments. Once fesvr finishes loading the binary, it releases the target system from reset, and the simulated processor then begins execution at the reset vector PC. Here, the test protocol is the standard RISC-V debug module interface . Run the following commands to run a simulation of the Sodor 1-stage processor running the Towers of Hanoi benchmark.
cd ${CHIPYARDROOT}/sims/verilator
make CONFIG=Sodor1StageConfig run-binary BINARY=${BMARKS}/towers.riscv
The simulation should print the cycle count (mcycle) and instruction count (minstret) upon completion. If any benchmarks fail to complete and print mcycle and minstret, verify that you are running on a recommended instructional machine. Otherwise, contact your TA.
Building Other Processors
To select a different processor design point, simply change the CONFIG= key of the make command. Valid options are listed in Table 1.
Sodor1StageConfig |
|---|
Sodor2StageConfig |
Sodor3StageConfig |
Sodor5StageConfig |
SodorUCodeConfig |
The configs available in this lab.
Dumping Waveforms for Debugging
In the very unlikely scenario that you need to debug what you suspect to be an RTL bug, VCD-formatted waveforms can be obtained by running make run-binary-debug instead of the usual make run-binary command. Open the resulting files in a waveform viewer such as GTKWave (http://gtkwave.sourceforge.net/).
Tracing Instruction Mixes Using the 1-Stage Processor
For this section of the lab, you will look at the instruction mixes of several RISC-V benchmark programs provided to you. In the $BMARKS directory, there are a variety of RISC-V benchmarks: dhrystone, median, multiply, qsort, rsort, towers, and vvadd. Make sure you read each section carefully, so you only run the required benchmarks. We wish to avoid overloading the eda servers with uneccesary simulations.
Using your method of choice, inspect the output files generated by make run-binary after running each of these benchmarks. You can view the output of the towers benchmark you ran earlier using vim like so:
cd ${CHIPYARDROOT}/sims/verilator
vim output/chipyard.harness.TestHarness.Sodor1StageConfig/towers.out
The processor commit state is logged to the output trace file on every cycle. riscv-sodor has some python scripts at ${SCRIPTS} that analyzes the contents of the emitted trace file and generates basic statistics. However, you will need to fix some bugs to get accurate statistics.
Line 914 of instructions.py
ARITH_OPCODES = {LUI, AUIPC, ADDI, SLLI_RV32, SLTI, SLTIU, XORI, SRLI_RV32, SRAI_RV32, ORI, ANDI,
Modified:
ARITH_OPCODES = {LUI, AUIPC, ADDI, SLLI, SLTI, SLTIU, XORI, SRLI, SRAI, ORI, ANDI,
Line 146 of tracer.py
n_cycles = cycle - start_cycle
Modified:
n_cycles = cycle - start_cycle + 1
You will also be modifying tracer.py in Problem 3.8, so while you run the rest of your simulations, take some time to review the structure of instructions.py and tracer.py
Run the following commands to view the statistics.
cd ${CHIPYARDROOT}/sims/verilator
${SCRIPTS}/tracer.py output/chipyard.harness.TestHarness.Sodor1StageConfig/towers.out
Example Stats for towers benchmark:
Stats:
CPI : 1.000
IPC : 1.000
Cycles : 24180
Instructions : 24180
Bubbles : 0
Instruction Breakdown:
% Arithmetic : 38.209 %
% Ld/St : 42.994 %
% Branch/Jump : 18.218 %
% Misc. : 0.579 %
Run the simulations and record the statistics and instruction mix for the following benchmarks only: median, multiply, towers, vvadd. (Remember: Do not provide raw dumps. A good way to visualize this kind of data would be a bar graph.)
Then, answer the following questions:
-
Which benchmark has the highest arithmetic intensity?
-
Which benchmark seems most likely to be memory bound?
-
Which benchmark seems most likely to be dependent on branch predictor performance?
CPI Analysis Using the 1-Stage Processor
Consider the results gathered from the RV32 1-stage processor. Suppose you were to design a new machine such that the average CPI of loads and stores is 2 cycles, integer arithmetic instructions take 1 cycle, and other instructions take 1.5 cycles on average. What is the overall CPI of the machine for each benchmark?
What is the relative performance for each benchmark if loads/stores are sped up to have an average CPI of 1? Is this still a worthwhile modification if it means that the cycle time increases 30%? Is it worthwhile for all benchmarks or only a subset? Explain.
CPI Analysis Using the 5-Stage Processor
For this section, we will analyze the effects of branching and bypassing in a 5-stage processor.6
The 5-stage processor has been parameterized to support both full-bypassed (but must still stall for load-use hazards) and fully-interlocked configurations. The fully-interlocked variant performs no bypassing and instead must stall (interlock) the instruction fetch and decode stages until all hazards have been resolved.
First, we verify that full bypassing is enabled in the design. Navigate to the Chisel source code:
cd ${CHIPYARDROOT}/generators/riscv-sodor/src/main/scala/sodor/rv32_5stage
vim consts.scala
The consts.scala file defines constants and compile-time parameters for the processor. Observe the parameter on line 21 is val USE_FULL_BYPASSING = true. You can see how this parameter changes the pipeline by referring to the data path (dpath.scala) in (lines 269-301) and the control path (cpath.scala) in (lines 226-245). The data path instantiates the bypass muxes when full bypassing is activated. The control path contains the stall logic, which must account for more situations when no bypassing is selected.
Like we did for the 1-stage processor, build the processor with the default behavior of bypassing enabled.
cd ${CHIPYARDROOT}/sims/verilator
make CONFIG=Sodor5StageConfig
Run the simulations (make sure to use the Sodor5StageConfig) and record the instruction mix for the following benchmarks only: median, multiply, towers, vvadd.
Before proceeding, move the benchmark .out files into a new folder, as they will be overwritten in the next step. You will need the bypassed output files for Problem 3.8.
cd ${CHIPYARDROOT}/sims/verilator
mv output/chipyard.harness.TestHarness.Sodor5StageConfig/ output/chipyard.harness.TestHarness.Sodor5StageConfigBypassed/
Now disable full bypassing in consts.scala, and re-run the build (check that your Chisel code recompiles).
Record the new CPI values for the benchmarks. How does full bypassing perform compared to full interlocking? If adding full bypassing would hurt the cycle time of the processor by 25%, would it be worth it? Argue your case quantitatively.
Before proceeding to the next step, make sure to reset val USE_FULL_BYPASSING = true.
Bypass Path Analysis
In this section, we will analyze the effect of bypass paths on CPI. As mentioned in the previous section, the code for enabling bypassing is located in the scala files at ${CHIPYARDROOT}/generators/riscv-sodor/src/main/scala/sodor/rv32_5stage.
You will need to modify both the data path in dpath.scala (lines 269-301) and the control path in cpath.scala (lines 226-245). After modifying the pipeline, ensure that your processor passes the assembly tests!
To run assembly tests, you can replace make run-binary with make run-asm-tests. For example, to run the assembly tests on the Sodor 5-stage configuration, you can run make run-asm-tests CONFIG=Sodor5StageConfig. This may take a while, so you can also run commands that only run one assembly test
Analyze the CPI effects of removing the exe_alu_out $\rightarrow$ dec_op1_data bypass path and the exe_alu_out $\rightarrow$ dec_op2_data. After modifying the scala code to remove each of the bypass paths, make sure to run the following benchmarks: median, multiply, towers, vvadd.
After collecting the CPI data, answer the following questions:
-
Which bypass path had the worst impact on CPI performance when removed? Compare it to the other bypass path.
-
Why do you think this bypass path causes the bigger impact on CPI performance? Think about the order of instructions in common operations in RISC-V assembly code.
-
What may be the advantages and disadvantages of removing a bypass path in the real world?
Design Problem Using the 5-Stage Processor
Imagine that you are being asked by your employer to evaluate a potential modification to the design of a 5-stage RISC-V pipeline. The proposed modification is that the Execute / Address Calculation stage and the Memory Access stage be merged into a single pipeline stage. In this combined stage, the ALU and Memory will operate in parallel. Data access instructions will use memory while leaving the ALU idle, and arithmetic instructions will use the ALU while leaving memory idle. These changes are beneficial in terms of area and power efficiency. Think to yourself why this is the case, and if you are still unsure, ask about it in discussion section or office hours.
In RISC-V, the effective address of a load or store is calculated by summing the contents of one register (rs1) with an immediate value (imm).
The problem with the new design is that there is now no way to perform any address calculation in the middle of a load or store instruction since loads and stores do not get to access the ALU. Proponents of the new design advocate changing the ISA to allow only one addressing mode: register direct addressing. Only one source register is used, and the value it contains is the memory address to be accessed. No offset can be specified.
In RISC-V, the only way to perform register direct addressing register-immediate address calculation with imm == 0.
With the proposed design, any load or store instruction that uses register-immediate addressing with imm != 0 will take two instructions. First, the register and immediate values must be summed with an add instruction, and then this calculated address can be loaded from or stored to in the next instruction. Load and store instructions which currently use an offset of zero will not require extra instructions on the new design.
Your job is to determine the percentage increase in the total number of instructions that would have to be executed under the new design. This will require a more detailed analysis of the different types of loads and stores executed by our benchmark codes.
In order to track more specific statistics about the instructions being executed, you will need to modify the Python script at {CHIPYARDROOT}/generators/riscv-sodor/scripts/tracer.py.
Modify the tracer to detect the percentage of instructions that are loads and stores with non-zero offsets. Follow the existing framework in to accomplish this task. There is existing code which you can adapt for your modifications.
Consult the RISC-V Green Card (https://inst.eecs.berkeley.edu/~cs152/sp25/resources/#risc-v) to determine which instruction bits correspond to which fields.
After modifying tracer.py, re-run the tracer on all of the output files from the bypassed 5-stage simulations to gather results.
cd ${CHIPYARDROOT}/sims/verilator/output/chipyard.harness.TestHarness.Sodor5StageConfigBypassed
${SCRIPTS}/tracer.py towers.out
What percentages of the instruction mix do the various types of load and store instructions make up? Evaluate the new design in terms of the percentage increase in the number of instructions that will have to be executed. Which design would you advise your employer to adopt? Justify your position quantitatively.
Feedback Portion
In order to improve the labs for the next offering of this course, we would like your feedback. Please append your feedback to your individual report for the directed portion. These questions are also placed at the bottom of the open-ended part of the lab for your convenience.
-
How many hours did the directed portion take you?
-
How many hours did you spend on the open-ended portion?
-
Was this lab boring?
-
What did you learn?
-
Is there anything that you would change?
Feel free to write as much or as little as you prefer (a point will be deducted only if left completely empty).
Team Feedback
In addition to feedback on the lab itself, please answer a few questions about your team:
-
In one short paragraph, describe your contributions to the project.
-
Describe the contribution of each of your team members.
-
Do you think that every member of the team contributed fairly? If not, why?
-
A canonical name for the system type that follows the nomenclature cpu-vendor-os ↩
-
You will need to modify the PATH variable to access the riscv tools. See Problem 4.1 ↩
-
Create a CS152-specific instructional account through the WebAcct service: http://inst.eecs.berkeley.edu/webacct/ ↩
-
Guide on how to use
tmux: https://www.redhat.com/en/blog/introduction-tmux-linux. ↩ -
Should you encounter a
java.lang.OutOfMemoryErrorexception, repeat themakecommand. ↩ -
The 2-stage and 3-stage processors will not be explicitly used in this lab, but they exist to demonstrate how pipelining in a relatively simple microarchitecture is implemented. ↩