Open-ended Portion (70%)

Select one of the following questions per team. The open-ended portion is worth a considerable fraction of the grade for the lab, and the grade depends on how comprehensive the process to your conclusion is.

Validation and Reverse Engineering of Memory Hierarchies

In this problem, we will try to infer fundamental parameters of a memory system by running user code and measuring execution latency. This is useful for a variety of reasons:

To help guide application optimizations when the underlying microarchitecture is unknown or undisclosed.
To validate memory system performance before tape-out. Some of the most insidious bugs in computer system design are performance bugs, since applications still execute correctly but only more slowly. We would like to catch these bugs before committing a design to silicon, but without a performance model of the target machine, they may go undiscovered.
Using the same principle as above, to help validate simulation models. The common approach of split timing and functional modeling makes it possible to build highly complex cache and memory models – but it is very easy to write “correct” yet fundamentally broken timing models.¹

We have provided a mystery Rocket Chip configuration for you to characterize, aptly named CS152RocketMysteryConfig.

Cache Sizes and Access Latency

We will first run the caches micro-benchmark, which comes from the ccbench suite developed by Christopher Celio, to determine the cache sizes and access latency at each level of the cache hierarchy.

caches executes a single-threaded pointer chase on an array of a given size, in which each 4-byte array element yields the index of the next element to access.

start_cycles = get_cycles();
for (uint32_t k = 0; k < g_num_iterations; k++) {
    idx = arr_n_ptr[idx];
}
end_cycles = get_cycles();

As the array exceeds the size of a cache, the sharp increase in misses is observable from the longer time that the benchmark takes to run. To accurately isolate the load latency of an individual element, both spatial locality and memory prefetching must be defeated. This involves striding the indices to point to different cache lines and randomly sorting the indices within a virtual page (so TLB locality is still maintained).

We have provided a Makefile to automate running ccbench in simulation and visualizing the result. First, sweep across a predefined range of array sizes:

cd ${TESTDIR}/open1
make sim -j4
make ccbench-sweep -j4

This should take approximately an hour to finish when using a couple of parallel jobs. Each benchmark run outputs a log statement in the form:

App:[caches],NumThreads:[0],AppSize:[1024],Time:[4.01507],TimeUnits:[Cycles Per Iteration],NumIterations:[30000],RunType:[0]

“AppSize” records the array size in terms of 4-byte elements, and “Time” records the average cycles spent per iteration.

Run the following to extract these lines into a consolidated report file and invoke the plotting script from ccbench:

./fix_ccbench.sh # installs plotting library
make ccbench-plot

This generates a plot of cycles per iteration versus array size. Open ccbench/caches/plots/plot-CS152RocketMysteryConfig.png and answer the following:

L1 D cache size
L1 D cache latency
L2 cache size
L2 cache latency
DRAM latency

Save the plot for your report.

Other Parameters

Finally, try to empirically deduce some more subtle parameters:

L1 D cache associativity
L1 D cache line size
L1 D cache replacement policy

Optionally, you can try to determine some of these more advanced parameters:

L1 I cache size
L1 I cache associativity
L1 I cache replacement policy
L1 I TLB reach
L1 D TLB reach
L2 TLB reach
L2 TLB hit latency
DRAM page policy (open or closed)
Aggregate DRAM page size (ranks $\times$ banks)
Number of DRAM ranks (you can assume there are 8 banks)
DRAM Column Address Strobe latency (CAS)
DRAM Row Address to Column Address delay (RCD)
DRAM Row Precharge time (RP)

In test, two templates are provided to help you begin writing your own micro-benchmarks:

bmark-p.c: This executes in a “bare-metal” environment with physical addressing. It is usually the quicker option since bare-metal programs can be loaded directly into the simulation without the initialization overheads of user mode.

bmark-v.c: This executes in a user-mode environment with virtual addressing. An initial supervisor program called the proxy kernel (pk) is required to load the user program and set up paging.

In general, you should avoid accessing arbitrary memory locations that have not been properly allocated, either statically or dynamically with malloc() or mmap().

For a rudimentary timer, the templates also define a function that returns the value of the cycle CSR, which counts the number of cycles after reset:

static inline unsigned long rdcycle(void) {
    unsigned long cycles;
    __asm__ __volatile__ ("rdcycle %0" : "=r" (cycles));
    return cycles;
}

printf() and other C stdio.h functions can be used to print results to stdout/stderr.²

Edit the Makefile to add other programs of your own. To build all programs:

cd ${TESTDIR}/open1/test
make

To run the bmark-p.riscv program:

cd ${SIMDIR}
make CONFIG=CS152RocketMysteryConfig run-binary-hex BINARY="${TESTDIR}/open1/test/bmark-p.riscv"

To run the bmark-v.riscv program:

make CONFIG=CS152RocketMysteryConfig run-binary BINARY=$RISCV/riscv64-unknown-elf/bin/pk BINARY_ARGS=${TESTDIR}/open1/test/bmark-v.riscv LOADMEM=1

Feel free to test your code on other known configurations, such as the one used in the directed portion or any new ones that you define.

Submission

For the given CS152RocketMysteryConfig instance, report the 5 parameters related to the cache capacities and access latencies as indicated by the caches benchmark. (Include the plots from ccbench.)

Then provide your best estimate for the D cache associativity, line size, and replacement policy. For each of these 3 parameters, explain how you measured it and provide relevant snippets of your code (the code is not counted towards the page limit). If you are not certain that you can accurately determine the parameters, still provide your code and explain what you tried. More credit will be awarded for a measured and analyzed result that is incorrect than an ill-justified guess which might be correct by coincidence If you have data or plots to show that your code works on known instances, include them in your justification.

Feel free reach out to your GSI if you need help understanding ccbench, Rocket Chip, or anything else regarding this problem.

Design Your Own Hardware Prefetcher

In this problem, we will build a hardware prefetcher (in either Chisel or C++) in hopes of improving the performance on various benchmarks.

We will use a Rocket Chip instance that has a 16 KiB 4-way set-associative L1 data cache and 64-byte cache lines. Unlike the directed portion, this configuration uses the non-blocking data cache, so that Rocket can take advantage of hit-under-miss while a prefetch is being serviced.

Interfaces

Partly for convenience and modularity, the prefetcher is integrated with the tile as a separate module from the cache itself. It can be considered another client of the data cache, like the core.

Start by navigating to L1Prefetcher.scala. This Chisel file contains our generator framework for creating prefetchers. All prefetcher modules inherit from the L1Prefetcher base class, which specifies a common set of I/O ports (grouped together in the L1PrefetcherIO bundle).

To observe the stream of memory requests and misses, the prefetcher snoops on the HellaCacheIO interface between the core and L1 data cache. A curated subset of signals from the HellaCacheReq bundle is presented to the prefetcher:

Input Signal	Description
`io.cpu.req.valid`	Asserted when Rocket’s execute stage sends a request to the L1D cache
`io.cpu.req.bits.addr`	Virtual address of the access
`io.cpu.req.bits.cmd`	Memory operation type (e.g., 0=load, 1=store, etc.)
`io.cpu.req.bits.size`	Logarithm of access size (e.g., 0=1 byte, 3=8 bytes)
`io.cpu.miss`	Asserted when a cache miss is being reported to Rocket’s writeback stage (Note: This is delayed by two cycles after the original request)

The prefetcher has a simplified outgoing interface through which it can inject prefetch requests into the L1D. This is a decoupled interface that uses ready/valid hand-shaking³ to coordinate the source and sink: Both io.dmem.req.valid and io.dmem.req.ready must be high during the same cycle to initiate a prefetch.

Input Signal	Description
`io.dmem.req.ready`	Asserted when the L1D can accept a request
`io.dmem.nack`	Asserted when a prefetch request from two cycles ago is rejected, either because all MHSRs are occupied or the request is a secondary miss (Note: This could be used to replay a request or throttle the prefetcher)
Output Signal	Description
`io.dmem.req.valid`	Indicates that the request is valid
`io.dmem.req.bits.addr`	Virtual address to prefetch (Note: This must be aligned to an 8-byte boundary)
`io.dmem.req.bits.write`	Indicates intent to write

The single port to the L1D is arbitrated between the core, prefetcher, and the page table walker, with the prefetcher being given the lowest priority so as to avoid blocking actual memory requests. However, for more aggressive prefetching schemes, it may be desirable to implement some form of throttling to ensure that prefetches do not excessively occupy the MSHRs (miss status handling registers). For example, prefetches could be rate-limited based on a fixed interval or a feedback loop that adapts to miss rate.

For reference, the ExampleL1Prefetcher module is provided as a demonstration on how to use the L1PrefetcherIO interface described above. This is a naive implementation of the one-block prefetch-on-miss scheme from lecture, but its simplistic design actually turns out to be quite ineffectual, causing a moderate performance degradation more often than it helps. Hopefully yours is a superior solution!

C++ Modeling

As an alternative to writing Chisel, you also have the option of implementing your prefetcher in C++ as a software model that is co-simulated with Rocket Chip. This method uses SystemVerilog DPI (Direct Programming Interface) to enable a Verilog wrapper module to call C++ functions.

Start by navigating to L1Prefetcher.cc. This C++ file contains two functions to work with:

Function	Description
`prefetcher_init()`	This is called at the beginning of simulation and can be used to initialize global state.
`prefetcher_tick()`	This is called for each clock cycle and is where the bulk of your prefetcher logic will reside. The function signature matches the I/O ports described above. Values for output signals (`dmem_req_*`) are assigned by dereferencing the argument pointers. Read the important note below about output signal timing.

Building

To use your custom hardware prefetcher, first modify the WithL1Prefetcher Config defined in CS152Configs.scala. Replace the default ExampleL1Prefetcher instantiation with the appropriate module, either CustomL1Prefetcher (Chisel) or ModelL1Prefetcher (C++), like so:

class WithL1Prefetcher extends Config((site, here, up) => {
  case BuildL1Prefetcher =>
    Some((p: Parameters) => Module(new CustomL1Prefetcher()(p)))
})

A specific top-level configuration for this problem (CS152RocketPrefetchConfig) has already been prepared for you, which includes the WithL1Prefetcher Config.

To build the simulator:

cd ${SIMDIR}
make CONFIG=CS152RocketPrefetchConfig -j4

Simulating

First see how your prefetcher performs on the matrix transposition kernel from the directed portion. You may choose to use the naive code or your L1 cache-blocked version.

cd ${SIMDIR}
make CONFIG=CS152RocketPrefetchConfig run-binary-hex BINARY="${TESTDIR}/directed/transpose.riscv"

Next test your prefetcher on the suite of benchmarks from Lab 1.⁴ These also print a snapshot of the hardware performance counters – note that L1D misses from regular and prefetch requests are counted separately.

make CONFIG=CS152RocketPrefetchConfig run-bmark-tests

Lastly, as an example of a more complex application, we have also included the Graph Algorithm Performance Benchmark Suite (GAPBS) , which consists of portable, high-performance implementations for six fundamental graph algorithms developed by Scott Beamer. Specifically, our focus is on the direction-optimizing variant of Breadth-First Search (BFS). Smaller inputs (Kronecker graphs with $2^{10}$ vertices) will be used here, as the reference inputs such as real social network graphs are too intensive in memory requirements and simulation time.

cd ${TESTDIR}/open2
make
cd ${SIMDIR}
make CONFIG=CS152RocketPrefetchConfig run-binary BINARY=${TESTDIR}/open2/gapbs/bfs LOADMEM=1 BINARY_ARGS="-n 8 -g 1"

Note: if building the BFS binary results in a __dso_handle not defined error, you need to provide a minimal definition for it:

Create a file dso.cc with the following content: void *__dso_handle = 0;
Compile it by running riscv64-unknown-elf-g++ -c dso.cc -o dso.o
Add the resulting object file to your build by appending it to CXXFLAGS (in Makefile) before running make: CXXFLAGS += -specs=htif.specs -specs=htif_argv.specs -Wl,--defsym=__heap_size=8M $(bmarks_dir)/dso.o.

As in the directed portion, the simulation traces can be found at output/<design_name>/<testbench>.out based on benchmark name.

Submission

Report performance metrics and cache statistics from running the various benchmarks with the prefetcher enabled. Compare these to results gathered from running on the baseline CS152RocketNoPrefetchConfig system, which omits the prefetcher but is otherwise identical. Include the source code for your implementation in an appendix (not counted towards the page limit).

In your report, describe your design and any implementation challenges in detail. Here are some suggestions to consider in your evaluation:

What memory access patterns or instances of locality were you targeting?
Explain your design rationale and the various approaches that you considered. What worked and what did not?
Analyze the impact on miss rate and CPI. Were any results surprising?
Optionally, see if you can characterize your prefetcher on the set of metrics introduced in Lecture 7:
- accuracy = useful prefetches / total prefetches
- coverage = total prefetches / total unique accesses
- timeliness = number of prefetches arriving on time / total prefetches
It may be useful to instrument your prefetcher with Chisel printfs⁵ or C++ std::cout statements to log certain events and parse the trace with a script.

A negative result is perfectly acceptable so long as you reason about why the outcomes differed from your expectations. (Designing an effective prefetcher is a non-trivial task! This exercise is partly meant to underscore the challenges of prototyping an idea.)

Feel free reach out to your GSI if you need help understanding Chisel, Rocket Chip, or anything else regarding this problem.

Design Your Own Replacement Policy and Victim Cache

For this problem, we would like to investigate whether a different cache replacement policy, combined with a victim cache, would improve the performance of five selected SPEC benchmarks compared to random replacement.

Assume you are designing for a 16 KiB 4-way set-associative L1 data cache, where the backside is connected to DRAM. You will model your cache modifications in spike, a functional ISA simulator for RISC-V that has been extended with a basic cache model. The simulator feeds memory addresses through a simulated cache (with a given size, associativity, and block size) to compute the number of accesses, hits, and misses. While spike does not model microarchitectural timings and is therefore not cycle-accurate, its speed lets us execute much longer programs ordinarily infeasible in RTL simulation.

SPEC CPU2006

The SPEC CPU2006 package is a former⁶ industry-standard benchmark suite for evaluating general-purpose processors, memory systems, and compilers . You will be running five benchmarks from SPECint (integer) and SPECfp (floating-point) on smaller test inputs.⁷ Brief descriptions of them, taken from the SPEC documentation, follow:

401.bzip2 is based on bzip2 version 1.0.3, modified to perform compression and decompression in memory instead of file I/O.
429.mcf is derived from MCF, a program used for single-depot vehicle scheduling in public mass transportation. It features a specialized version of the simplex algorithm for network flow problems.
450.soplex is based on SoPlex 1.2.1. It solves a linear program using a simplex algorithm and sparse linear algebra.
458.sjeng is based on Sjeng 11.2, a program that plays chess and several chess variants. It attempts to find the best move via a combination of alpha-beta or priority proof number tree searches, advanced move ordering, positional evaluation and heuristic forward pruning.
470.lbm implements the “Lattice Boltzmann Method” to simulate incompressible fluids in 3D.

To build spike and simulate all benchmarks using its cache model:

cd ${TESTDIR}/open3
make spike -j4
make run

The benchmarks should take around a total of 15 minutes to execute when run serially. For quicker testing, individual programs can be re-run with make run-$X$, where $X$ is the name of the benchmark without the numerical prefix (e.g., run-bzip2, run-mcf, etc.).

The simulation output is recorded in CPU2006/build.riscv/*.out, and the compiled SPEC binaries can also be found in that same directory. Removing these files forces make to re-run the simulations later:

make clean-run

A full rebuild can be triggered by purging all generated files:

make clean

Modifying the Cache Simulator

Navigate to riscv-isa-sim/riscv/cachesim.cc, where you will find the definition for the cache_sim_t C++ class instantiated in spike. After taking some time to understand how the current cache model operates, modify the cache_sim_t::victimize() function to implement your custom replacement policy.

To rebuild the simulator without running any benchmarks:

cd ${TESTDIR}/open3
make spike

Adding a Victim Cache

Generally, the associativity of a cache (number of ways) presents a trade-off between access time and conflict misses. In order to reduce conflict misses without affecting access times, N. Jouppi proposed victim caching in which a small fully-associative secondary cache, called a victim cache, is added to a direct-mapped L1 cache to hold recently evicted cache lines.

We have provided you with an implementation of a victim cache. You will analyze the impact of this victim cache in conjuction with your custom replacement policy.

cd ${TESTDIR}/open3/riscv-isa-sim
git apply ../victim_cache.patch

Pay close attention to parameter VICTIM_CACHE_LINES in cachesim.cc. This parameter configures the capacity of the victim cache. Observe what happens to the benchmark outputs when you increase the size of this cache. It is sufficient to stick to powers of 2 for this analysis.

Rebuild the sim and rerun the benchmarks with:

cd ${TESTDIR}/open3
make spike -j4
make run

Submission

In your report, describe how your cache replacement policy and victim cache work, using visual aids (e.g., block diagrams) where appropriate to illustrate their operation. Include a diff of your modifications to the cache model in an appendix (not counted towards the page limit).

Report cache statistics from running the SPEC benchmark suite on your modified cache, and compare them to the original cache. Try to explain your results as best you can. Here are some suggestions to consider in your analysis on the effectiveness of your design:

Based on the number of memory accesses, misses, and instructions retired, what effect on AMAT and CPI do you think your new cache would have?
Estimate the cost in resources if you were to implement your design in hardware. How much additional state would be required in terms of bits?
For the given set of benchmarks, how problematic are conflict misses compared to compulsory or capacity misses? Would the addition of a victim cache be justifiable?
What is the minimum number of victim cache entries that you would recommend? Are there diminishing returns to increasing victim cache size?
Program behavior can sometimes be characterized by distinct execution “phases”. Does the miss rate vary over time within a benchmark?

Feel free to reach out to your GSI if you need help understanding the ISA simulator, the cache model, or anything else regarding this problem.

Feedback Portion

In order to improve the labs for the next offering of this course, we would like your feedback. Please append your feedback to your individual report for the directed portion. These questions are also placed at the bottom of the open-ended part of the lab for your convenience.

How many hours did the directed portion take you?
How many hours did you spend on the open-ended portion?
Was this lab boring?
What did you learn?
Is there anything that you would change?

Feel free to write as much or as little as you prefer (a point will be deducted only if left completely empty).

Team Feedback

In addition to feedback on the lab itself, please answer a few questions about your team:

In one short paragraph, describe your contributions to the project.
Describe the contribution of each of your team members.
Do you think that every member of the team contributed fairly? If not, why?

For example, in one FASED bug, a write address was mistakenly being used for a read access to the tag array of the last-level cache model. While this would cause a real cache to return incorrect data, it simply manifested as small timing aberration in the model. ↩
Formatting of floating-point numbers is supported by newlib, the embedded C library. ↩
https://inst.eecs.berkeley.edu/~cs150/Documents/Interfaces.pdf ↩
You can run the benchmarks in parallel by adding the -j $N$ flag to the make command, but refrain from spawning an excessive number of jobs so as to be fair to other users. $N = 4$ is probably acceptable. ↩
https://www.chisel-lang.org/docs/explanations/printing ↩
It has since been replaced by SPEC CPU2017; however, we still opt to use SPEC CPU2006 for being much simpler to cross-compile for RISC-V. ↩
Each benchmark with reference inputs generally require a day to run on a typical Rocket Chip instance mapped to an FPGA. In this case, we are more interested in stressing the caches than reporting valid benchmark scores, so minor adjustments have been made to limit simulation time. ↩