Background

As with Lab 1, this lab is based on the Chipyard framework being actively developed UC Berkeley. However, we will be exploring more sophisticated hardware designs than the rudimentary Sodor processors from the previous lab.

Chipyard

Chipyard is an integrated design, simulation, and implementation framework for agile development of systems-on-chip (SoCs). It combines Chisel, the Rocket Chip generator, and other Berkeley projects to produce a full-featured RISC-V SoC from a rich library of processor cores, accelerators, memory system components, and I/O peripherals. Chipyard supports several hardware development flows, including software RTL simulation, FPGA-accelerated simulation (FireSim), and automated VLSI methodologies (Hammer).

Rocket Chip

Rocket Chip is an open-source SoC generator originally developed at UC Berkeley. It leverages Chisel to compose a library of highly parameterized generators for cores, caches, and interconnects into an integrated SoC. It has been the basis of numerous silicon-proven designs in both research and industry.

Rocket Chip can generate a practically unbounded space of instances, including many parameter sets that are impractical or suboptimal. In this lab, we will examine a variety of design points, each with a different memory hierarchy, to explore the concepts described in class. All Rocket Chip instances used in this lab have three major components: processor cores, a cache hierarchy, and an outer memory system.

Rocket Microarchitecture

Rocket Chip derives its name from the Rocket core that it instantiates by default: a 5-stage, single-issue, in-order RISC-V processor. The instances of Rocket used in this lab implement the RV64IMAFDC instruction set variant¹, which refers to the 64-bit RISC-V base ISA (RV64I) along with a set of useful extensions : M for integer multiply/divide, A for atomic memory operations, F and D for single- and double-precision floating-point, and C for 16-bit compressed representations of common instructions

Rocket also supports the RISC-V privileged architecture with machine, supervisor, hypervisor, and user modes. It has an MMU that implements the Sv39 virtual memory scheme, which provides 39-bit virtual address spaces with 4 KiB pages. As such, these designs are capable of booting mainstream operating systems such as Linux.

Rocket pipeline

Rocket has been extensively optimized for efficient ASIC implementation, resulting in specific microarchitectural adaptations that differentiate it from the classic 5-stage RISC pipeline normally seen in educational settings. In particular, the overall design is mainly concerned with

minimizing high-fanout stall signals and restructuring pipeline logic to cope with long clock-to-Q delays of compiler-generated SRAMs.

Several factors contribute to improved reduction in critical paths compared to more naive approaches:

Instructions are not permitted to stall except in the ID stage for data and known structural hazards.
Most hazards that arise in EX or later stages are handled by replaying (re-fetching and re-executing) the instruction upon reaching WB (not unlike how exceptions propagate down the pipeline). One notable case is load-hit speculation, in which an instruction that depends on a load result can be issued before it is known whether the load is a cache hit.
Branch conditions are resolved in EX, but the PC is redirected in MEM. The 3-cycle mispredict penalty is mitigated by branch prediction provided by a configurable branch target buffer (BTB), branch history table (BHT), and a return address stack (RAS).
Bypass muxes are moved into EX with the selects precomputed in ID; bypass data comes directly from pipeline registers to the extent possible.
Some variable-latency operations (e.g., L1D miss, divide) use a scoreboard to track pending register writes. This enables instructions to complete out of program order so that a long-latency operation does not halt the pipeline for subsequent instructions. Consequently, with a non-blocking L1 data cache, multiple misses can be serviced simultaneously.

Cache Hierarchy

A generic Rocket Chip instance

The basic unit of replication for a core in Rocket Chip is a tile.² Each tile consists of one core (Rocket) and a portion of the inner cache hierarchy that is private to each core:

L1 instruction cache (L1I)
L1 data cache (L1D) of either a blocking or a non-blocking design
fully-associative L1 instruction and data TLBs
optional unified direct-mapped L2 TLB
hardware page table walker

SoC instances can optionally be configured with a unified, inclusive, multi-banked L2 cache as a last-level cache shared between tiles. If an L2 cache is not present, an L2 broadcast hub is instantiated in its place to maintain coherence between the L1 caches. Each of these structures exposes various parameters such as capacity, associativity, replacement policy, and cache line size, which are set through a Scala-based configuration system at elaboration time.

Outer Memory System

The L2 coherence agent (either the L2 cache or broadcast hub) makes requests to an outer memory system through a AXI4 master port. This top-level port would typically interface with a DRAM controller, but since an actual DRAM controller implementation is not openly available, we instead attach a model that simulates the functional and timing behaviors of a DDR3 memory system. The default SoC configuration presents a single memory channel, but the system can be configured to use multiple channels for greater bandwidth.

Directed Portion

Terminology and Conventions

Throughout this course, the term host refers to the machine on which the simulation runs, while target refers to the machine being simulated. For this lab, an instructional server will act as the host, and the RISC-V processors will be the target machines.

Setup

This lab is setup to run on either the instructional servers or a non-MacOS-based Linux OS. Use the corresponding setup instructions based on the machine/setup you are running on. The default setup is to ssh into an instructional server with the instructional computing account provided to you.

Instructional Servers

The lab infrastruture has been set up to run on the eda{1..4}.eecs.berkeley.edu machines (eda-1.eecs, eda-2.eecs, etc.).

cd ~ # go to your home directory
rm -rf chipyard # save tmp space, remove lab1 chipyard directory
source ~/conda/etc/profile.d/conda.sh
conda activate .conda-env/
git clone https://github.com/cs152-teach/chipyard-cs152.git cs152-lab2-sp26
cd cs152-lab2-sp26
git checkout sp26-lab2
./build-setup.sh riscv-tools --skip-conda --skip-toolchain --skip-circt --skip-firesim --skip-marshal

After the Chipyard is built successfully, you will find a newly generated env.sh file. Open that file and add the following lines:

export LAB2ROOT=$PWD
export TESTDIR=${LAB2ROOT}/lab
export SIMDIR=${LAB2ROOT}/sims/verilator

Then, run: source env.sh

Post-Setup

After setting up the repository, you must setup the Chipyard environment in every terminal that is opened.

cd ~ # go to your home directory
conda activate .conda-env/
cd cs152-lab2-sp26
source env.sh

The remainder of this lab will use ${LAB2ROOT} to denote the path of the working tree.

Matrix Transposition Case Study

The directed portion will lead you through a simple case study of a matrix transposition kernel with these objectives:

Illustrate some basic cache optimization techniques
Conduct a brief design-space exploration of cache configurations using the Rocket Chip parameterization system
Familiarize you with the RTL simulation flow

We begin with a naive implementation of matrix transposition in ${TESTDIR}/directed/transpose.c that is derived directly from the mathematical definition. Take a moment to understand the source code. Note that both the $256{\times}64$ input matrix and $64{\times}256$ output matrix are stored in row-major order. The matrix elements are 64-bit integers.

Compile it into a bare-metal binary:

cd ${TESTDIR}/directed
make

Next, navigate to the Verilator directory and build the simulator. Notice that the CONFIG variable selects the top-level SoC design to generate – what exactly this means will be described in the next section. The Chisel design is elaborated into Verilog RTL, which is then compiled into a cycle-accurate simulator.

cd ${SIMDIR}
make CONFIG=CS152RocketConfig -j4

Next, run the naive matrix transposition kernel on the simulator:

make CONFIG=CS152RocketConfig run-binary-hex BINARY=${TESTDIR}/directed/transpose.riscv

This will involve a several minutes of waiting, as the entire program takes approximately 2 million target cycles to execute.

The program prints a snapshot of several hardware performance counters in the processor.³ Use this information to answer the following questions:

How many cycles does the transpose operation take? What is the miss rate of the L1 data cache? Why does the naive transpose code result in non-ideal cache performance? Which memory access in the code incurs the most misses, and why?

Cache Blocking

Rewrite the transpose code to employ cache blocking (loop tiling) using $B{\times}B$ blocks. Experiment with a few values of $B$ to determine which factor yields the best performance for a $256 {\times} 64$ input matrix. You do not need to structure you code for a general N $\times$ M matrix, just 256 $\times$ 64 is sufficient. To maximize $B$, you may also find that it is necessary to apply a simple loop interchange within a block. We highly recommend you generate your .riscv files for a fleet of $B$ sizes, and write a shell script to run those simulations in sequence (once you have ensured functional accuracy with spike), so you can run your script and come back to the data in 20-30 minutes instead of running a command every 5 minutes.

You can first run your code in a software ISA simulator to more quickly test for correctness, before running in Verilator simulation to gather performance data.

timeout --foreground 1 spike ${TESTDIR}/directed/transpose.riscv

This is normally sufficient for debugging software. However, in the unlikely case that a bug manifests only in Verilator, a verbose simulation trace⁴ can be found at transpose.out.

Once you have run the simulations on various $B$ sizes and collected your data, answer the following questions:

For the given matrix dimensions, what is the optimal blocking factor $B$? What is the performance improvement using cache blocking over the naive code? It turns out that the block size $B{\times}B$ which yields the lowest miss rate is much smaller than what one might expect based solely on the 4 KiB capacity of the L1 data cache. What is the reason for this? (Hint: Consider the access pattern within a block, particularly how the rows of the rectangular matrices map to cache sets and which type of cache misses dominate for larger $B$.)

Cache Parameters

Navigate to CS152Configs.scala and examine the definition of CS152RocketConfig.

In Rocket Chip, a Config is a Scala class that sets one or more generator parameters to specific values. Configs are additive, can override each other, and can be composed of other Configs. CS152RocketConfig is an example of a Config that combines other Configs through the ++ operator. The constituent Configs are applied from right to left (or from bottom to top) in the chain, by reverse order of precedence. Thus, a Config appearing to the left of (or above) another Config overrides any parameters previously set by the latter. For more information on the Rocket Chip parameter system, read through the Chipyard documentation.⁵

In CS152RocketConfig, change the associativity of the L1 data cache to 2 by modifying WithL1DCacheWays parameter, while also adjusting WithL1DCacheSets to keep the overall cache size constant. Rebuild the simulator and re-run your blocked matrix transpose version (with your most optimal block size). Repeat this with 4 and 8 ways.

Tip: For fast simulation, try using FastCS152RocketConfig. Simply replace
CONFIG=CS152RocketConfig in the make command with CONFIG=FastCS152RocketConfig.

How does performance and miss rate change when associativity is increased? Explain why higher associativity is or is not beneficial for this particular kernel.

Feedback Portion

In order to improve the labs for the next offering of this course, we would like your feedback. Please append your feedback to your individual report for the directed portion. These questions are also placed at the bottom of the open-ended part of the lab for your convenience.

How many hours did the directed portion take you?
How many hours did you spend on the open-ended portion?
Was this lab boring?
What did you learn?
Is there anything that you would change?

Feel free to write as much or as little as you prefer (a point will be deducted only if left completely empty).

Team Feedback

In addition to feedback on the lab itself, please answer a few questions about your team:

In one short paragraph, describe your contributions to the project.
Describe the contribution of each of your team members.
Do you think that every member of the team contributed fairly? If not, why?

Also known as RV64GC, with G (“general-purpose”) being the canonical shorthand for “IMAFD” ↩
Although Rocket Chip can generate multi-core instances, this lab will feature only single-tile instances. ↩
Note that in-flight and recently retired instructions may or may not be reflected when reading the performance counters. ↩
These prints show signals from Rocket’s writeback stage each cycle; refer to line 980 of $ to identify each field. ↩
https://chipyard.readthedocs.io/en/latest/Chipyard-Basics/Configs-Parameters-Mixins.html ↩