Lab 7: Data Locality and Cache blocking

  1. Setup
  1. Copy the lab files to your home directory with:

                                cp -r ~cs61c/labs/sp11/07 ./lab7

  1. Background
  1. Matrix Multiplication

        If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops:

for (int i = 0; i < n; i++)

    for (int j = 0; j < n; j++)

            for (int k = 0; k < n; k++)

                C[i+j*n] += A[i+k*n] * B[k+j*n];

assuming that matrices A, B and C are all n-by-n and stored in one-dimensional column-major arrays. Matrix multiplication operations are at the heart of many linear algebra algorithms, and efficient matrix multiplication is critical for many applications within the applied sciences.

        In the above code, note that the loops are ordered i,j,k. Thus, considering the innermost “k” loop, we move through B with stride 1, A with stride n and C with stride 0.  To compute the matrix multiplication correctly, the loop order doesn’t matter. However, how we choose to stride though the matrices can have a large impact on performance as it can break the assumption of spacial locality that is critical for good cache performance.

  1. Matrix Transposition

        Sometimes, we wish to swap the rows and columns of a matrix. This operation is called a “transposition”, and an efficient implementation can be quite helpful while  performing more-complicated linear algebra operations. The transpose of matrix A is often denoted as AT.

  1. Cache blocking

        In the above code for matrix multiplication, note that we are striding across the entire matrices to compute a single value of C. As such, we are constantly accessing new values from memory and obtain very little reuse of cached data! We can improve the amount of data reuse in cache by implementing a technique called cache blocking. More formally, cache blocking is a technique that attempts to reduce the cache miss rate by improving the temporal and/or spacial locality of memory accesses. In the case of matrix transposition we consider completing the transposition one block at a time.

        In the above image, each block Aij of matrix A is transposed into its final location in the output matrix. With this scheme, we significantly reduce the magnitude of the working set in cache at any one point in time. This (if implemented correctly) will result in a substantial improvement in performance. For this lab, you will implement a cache blocking scheme for matrix transposition and analyze its performance. As a side note, you will be required to implement several levels of cache blocking for matrix multiplication for Project 3.

  1. Exercise 1: Matrix multiply

If you glance within matrixMultiply.c, you’ll notice that the file contains an implementation of matrix multiply that is performed with 3 simple loops. Compile and run this code. (Note: Make sure to use the “-O3” flag to turn on compiler optimizations)

  1. Why does performance drop for large values of n?
  2. Modify matrixMultiply.c to try all 6 different loop orderings. Which ordering(s) perform best for 1000-by-1000 matrices? Which ordering is the worst? In both cases, how are we striding through the matrices with respect to the innermost loop?

  1. Exercise 2: Matrix transposition

Compile and run the naive matrix transposition implemented in transpose.c (Again, make sure to use the “-O3” flag when you compile!).

  1. Note the time required to perform the naive transposition for a matrix of size 2000-by-2000.
  2. Modify the function called “transpose” in transpose.c to implement a single level of cache blocking. I.e. Loop over all matrix blocks and transpose each into the destination matrix. (Hint: Make sure to handle the fringe cases of the transposition: i.e. What if we tried to transpose the 5-by-5 matrix above with a blocksize of 2?).
  3. Try block sizes of 2-by-2, 100-by-100, 200-by-200, 400-by-400 and 1000-by-1000. Which performs best on the 2000-by-2000 matrix? Which performs worst?