CS61C Fall 2018 Lab 6 - Caches!



Obtaining the Lab Files

Similar to previous weeks, cd to your lab repository and then get the lab 6 starter files via:

$ git fetch starter
$ git merge starter/master

Tools for Cache Visualization in Venus

To understand how caches work is typically one of the hardest tasks for students in 61C at first.

This exercise will use some cool cache visualization tools to get you more familiar with cache behavior and performance terminology with the help of the file cache.s.

At this point, read through cache.s to get a rough idea of what the program does. Make sure you go through what the pseudocode does and what the argument registers hold before you proceed to analyze cache configurations on it.

Here's the magic link of cache.s

For each of the scenarios below, you'll be repeating THESE steps:

  1. Open the designated file in Venus. (cache.s, magic link)
  2. In the code for cache.s, set the appropriate Program Parameters as indicated at the beginning of each scenario (by changing the immediates of the commented li instructions in main)
  3. Simulator-->Cache.
  4. Set the appropriate Cache Parameters as indicated at the beginning of each scenario.
  5. As you execute code in Venus, any DATA memory access (load or store) will show up (instruction fetches not shown because instructions are loaded into a separate instruction cache which is not shown in Venus).

The Cache Simulator will show the state of your data cache. If you reset your code, you will also reset the cache hit/miss rate as well!

IMPORTANT: If you run the code all at once, you will get the final state of the cache and hit rate. You will probably benefit the most from setting a breakpoint in the loop wordLoop right before or after each memory access to see exactly where the hits and misses are coming from.

Review: Hit and Miss Policies

The Venus cache simulator currently simulates a write-through, write-allocate cache. Here's a reminder about the three different cache hit policies you should know about:

There are also two miss policies you should know about:

The important takeaway concerning Venus here: in a write-through cache (like in Venus), even though you are updating memory on writes, because we also write to the cache, we consider writes to blocks we have in the cache to be write hits.

Common question(s):

  1. Don't we usually pair write-back with write-allocate and write-through with no write-allocate? Yes, we learned in class that the ordinary pairing of hit policy/miss policy is write-back/write-allocate and write-through/no write allocate. However, with respect to the cache, write-through and write-back caches behave similarly on hits (both write to the cache), so the hit/miss patterns you see in the Venus cache simulator would be the same even if Venus simulated a write-back cache.

Exercise 1: A Couple of Memory Access Scenarios

Task: Simulate the following scenarios and record the final cache hit rates with the program in cache.s. Try to reason out what the hit rate will be BEFORE running the code. After running each simulation, make sure you understand WHY you see what you see (the TAs will be asking)!

Do not hesitate to ask questions if you feel confused! This is perfectly normal and the staff is there to help you out!

THE FOLLOWING are good questions to ask yourself as you do these exercises (not checkoff questions):

Scenario 1:

Cache Parameters: (set these in the Cache tab)

Program Parameters: (set these by initializing the a registers in the code)

Tip: If it's hard for you to visualize what's getting pulled into the cache on each memory access just from staring at the code, try getting out some paper and a pencil. Write down what the tag:index:offset breakdown of the 32-bit addresses would be, figure out which memory addresses map to which set in the cache with the index bits, and see if that helps.


  1. What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: "Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.")
  2. What is our hit rate if we increase Rep Count arbitrarily? Why?
  3. How could we modify one program parameter to increase our hit rate? (Hint: Your answer should be of the form: "Change [parameter] to [value]." Note: we don't care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate.

Scenario 2:

Cache Parameters:

Program Parameters:


  1. How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It's not 1.
  2. What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses).
  3. This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
  4. Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why?
  5. You should have noticed that our hit rate was pretty high for this scenario, and your answer to the previous question should give you a good understanding of why. IF YOU ARE NOT SURE WHY, consider the size of the array and compare it to the size of the cache. Now, consider the following:

    Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) repcount times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep).

    QUESTION: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others AKA it doesn't matter if Rep k is applied to element arr[i] before Rep k is applied to element arr[i+1], etc.

    HINT: You do not want to iterate through the entire array at once because it's much bigger than your cache. Doing so would reduce the amount of temporal locality your program exhibits, which makes cache hit rate suffer. We want to exhibit more locality so that our caches can take advantage of our predictable behavior. SO, instead, we should try to access __________ of the array at a time and apply all of the _________ to that __________ so we can be completely done with it before moving on, thereby keeping that _________ hot in the cache and not having to circle back to it later on! (The 1st, 3rd, and 4th blanks should be the same. It's not some vocabulary term you should use to fill them in. It's more of an idea that you should have.)

Checkoff [1/3]

Exercise 2: Loop Ordering and Matrix Multiplication

If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices A, B, and C are all n-by-n and stored in one-dimensional column-major arrays:

for (int i = 0; i < n; i++)
    for (int j = 0; j < n; j++)
        for (int k = 0; k < n; k++)
            C[i+j*n] += A[i+k*n] * B[k+j*n];

Fact: Matrix multiplication operations are at the heart of many linear algebra algorithms, and efficient matrix multiplication is critical for many applications within the applied sciences.

In the above code, note that the loops are ordered i, j, k. If we examine the innermost loop (the one that increments k), we see that it...

REMEMBER: To compute the matrix multiplication correctly, the loop order doesn't matter.

BUT, the order in which we choose to access the elements of the matrices can have a large impact on performance. Caches perform better (more cache hits, fewer cache misses) when memory accesses take advantage of spatial and temporal locality, utilizing blocks already contained within our cache. Optimizing a program's memory access patterns is essential to obtaining good performance from the memory hierarchy.

Take a glance at matrixMultiply.c. You'll notice that the file contains multiple implementations of matrix multiply with 3 nested loops.

Task: Think about what the strides are for the nested loops in the other five implementations.

Note that the compilation command in the Makefile uses the '-O3' flag. It is important here that we use the '-O3' flag to turn on compiler optimizations. Compile and run this code with the following command, and then answer the questions below:

$ make ex2

This will run some matrix multiplications according to the six different implementations in the file, and it will tell you the speed at which each implementation executed the operation. The unit "Gflops/s" reads, "Giga-floating-point-operations per second." THE BIGGER THE NUMBER THE FASTER IT IS RUNNING!

Checkoff [2/3]

Exercise 3: Cache Blocking and Matrix Transposition

For this exercise, using the hive machines is recommended! If you are in one of the Soda Lab 27- rooms, you can still ssh from the instructional machines into the hive machines. This is because the hive machines have historically run the tests twice as fast.

Matrix Transposition

Sometimes, we wish to swap the rows and columns of a matrix. This operation is called a "transposition" and an efficient implementation can be quite helpful while performing more complicated linear algebra operations. The transpose of matrix A is often denoted as AT.

Cache Blocking

In the above code for matrix multiplication, note that we are striding across the entire A and B matrices to compute a single value of C. As such, we are constantly accessing new values from memory and obtain very little reuse of cached data! We can improve the amount of data reuse in the caches by implementing a technique called cache blocking. More formally, cache blocking is a technique that attempts to reduce the cache miss rate by further improving the temporal and/or spatial locality of memory accesses. In the case of matrix transposition we consider performing the transposition one block at a time.

Things to note: In the above image, we transpose each submatrix Aij of matrix A into its final location in the output matrix, one submatrix at a time. It is important to note that transposing each individual subsection is equivalent to tranposing the entire matrix.

Since we operate on and finish transposing each submatrix one at a time, we consolidate our memory accesses to that smaller chunk of memory when transposing that particular submatrix, which increases the degree of temporal (and spatial) locality that we exhibit, which makes our cache performance better, which makes our program run faster.

This (if implemented correctly) will result in a substantial improvement in performance. For this lab, you will implement a cache blocking scheme for matrix transposition and analyze its performance.

Your task is to implement cache blocking in the transpose_blocking() function inside transpose.c. By default, the function does nothing, so the benchmark function will report an error. You may NOT assume that the matrix width (n) is a multiple of the blocksize. After you have implemented cache blocking, you can run your code by typing:

$ make ex3
$ ./transpose <n> <blocksize>

Where n, the width of the matrix, and blocksize are parameters that you will specify. You can verify that your code is working by setting n=10000 and blocksize=33. The blocked version should finish significantly faster.

The following section is meant to serve as a guideline for if you have no idea how to start. If you think you know how to use the parameter blocksize, then just jump right in and get started.

Some tips to get started:

Start by looking at the transpose_naive function included in the file. Notice that the index y strides vertically across the WHOLE src matrix in one iteration of the outer loop before resetting to 0. Another way to say this is that the index x only updates after y is done going from 0 all the way to n. This is the behavior which we want to change. We want to step not stride across the array indices.

TL;DR: fill out dst square chunks at a time, where each square chunk is of dimension blocksize by blocksize.

Instead of updating x only when y goes through ALL of 0 through n, we want to jump down to the next row of dst after we stride across the width (horizontal axis) of just a single block. How big is a block? Exactly the number of integers specified by the parameter blocksize.
In addition, we only want to stride vertically through the height (vertical axis) of a block before we move on to the next block. We don't want to make x stride all the way down n rows of dst before we move on to the next block.

Hint: A standard solution needs 4 (four) for loops.

Finally, since we can't assume that n is a multiple of the blocksize, the final block column for each block row will be a little bit cut-off AKA it won't be a full blocksize-by-blocksize square. In addition, the final block row will all be truncated. To fix this problem, you can do the exercise assuming that n is a multiple of the blocksize and then add in a special case somewhere to do nothing when your indices reach out of bounds of the array.

Once your code is working, complete the following exercises and record your answers (we will ask about it during checkoff).

Part 1: Changing Array Sizes

Fix the blocksize to be 20, and run your code with n equal to 100, 1000, 2000, 5000, and 10000. At what point does cache blocked version of transpose become faster than the non-cache blocked version? Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?

(Sanity check: the blocked version isn't faster than the naive version until the matrix size is sufficiently big.)

Part 2: Changing Blocksize

Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000. How does performance change as the blocksize increases? Why is this the case?

(Sanity check: as you increase the blocksize, the amount of speedup should change in one direction, then change in the other direction.)

Notice that in neither of the last two exercises did we actually know the cache parameters of our machine. We just made the code exhibit a higher degree of locality, and this magically made things go faster! This tells us that caches, regardless of their specific parameters, will always benefit from operating on code which exhibits a high degree of locality.

Checkoff [3/3]