## Setup

Copy the lab skeleton into your workspace:

`$ cp -r ~cs61c/labs/06 ~/lab06`

## Background Information

### Matrix Multiplication

If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices A, B, and C are all n-by-n and stored in one-dimensional column-major arrays:

for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) C[i+j*n] += A[i+k*n] * B[k+j*n];

Matrix multiplication operations are at the heart of many linear algebra algorithms, and efficient matrix multiplication is critical for many applications within the applied sciences.

In the above code, note that the loops are ordered i, j, k. If we examine the innermost loop (k), we see that it moves through B with stride 1, A with stride n and C with stride 0. To compute the matrix multiplication correctly, the loop order doesn't matter. However, the order in which we choose to access the elements of the matrices can have a large impact on performance. Caches perform better (more cache hits, fewer cache misses) when memory accesses exhibit spatial and temporal locality. Optimizing a program's memory access patterns is essential to obtaining good performance from the memory hierarchy.

### Matrix Transposition

Sometimes, we wish to swap the rows and columns of a matrix. This operation
is called a "transposition" and an efficient implementation can be quite
helpful while performing more complicated linear algebra operations. The
transpose of matrix A is often denoted as A^{T}.

### Cache Blocking

In the above code for matrix multiplication, note that we are striding
across the entire A and B matrices to compute a single value of C. As such, we
are constantly accessing new values from memory and obtain very little reuse of
cached data! We can improve the amount of data reuse in the caches by
implementing a technique called cache blocking. More formally, **cache
blocking is a technique that attempts to reduce the cache miss rate by
improving the temporal and/or spatial locality of memory accesses**. In the
case of matrix transposition we consider performing the transposition one block
at a time.

In the above image, we transpose each block A_{ij} of matrix A into
its final location in the output matrix, one block at a time. With this
scheme, we significantly reduce the magnitude of the working set in cache
during the processing of any one block. This (if implemented correctly) will
result in a substantial improvement in performance. For this lab, you will
implement a cache blocking scheme for matrix transposition and analyze its
performance. You may also find this technique useful for Project 3.

## Exercise 1: Matrix multiply

Take a glance at matrixMultiply.c. You'll notice that the file contains multiple implementations of matrix multiply with 3 nested loops.

Compile and run this code with the following command:

`$ make ex1`

Note that the compilation command in the Makefile uses the '-O3' flag. *It
is important here that we use the '-O3' flag to turn on compiler
optimizations*. The Makefile will run matrixMultiply twice. Copy the results somewhere so that
you do not have to run this program again and use them to help you answer the
following questions:

**Why does Gflop/s performance drop for large values of n?**- The second run of matrixMultiply runs all 6
different loop orderings. Which ordering(s) perform best for 1000-by-1000
matrices? Which ordering(s) perform the worst?
**How does the way we stride through the matrices with respect to the innermost loop affect performance?**

**Checkoff:** Be prepared to explain your responses to the above
questions to your TA.

## Exercise 2: Matrix transposition

Compile and run the naive matrix transposition implemented in transpose.c. You can do so with the command

`$ make ex2`

Next, complete the following exercises.

- Note the time required to perform the naive transposition for a matrix of size 2000-by-2000.
- Modify the function called transpose in transpose.c to implement a single level of cache
blocking. That is, loop over all matrix blocks and transpose each into the
destination matrix.
**Make sure to handle special cases of the transposition:**What if we tried to transpose the 5-by-5 matrix above with a blocksize of 2? - Try block sizes of 2-by-2, 100-by-100, 200-by-200, 400-by-400 and
1000-by-1000.
**Which performs best on the 2000-by-2000 matrix? Which performs worst?**

**Checkoff:** Show and explain your code to your TA. Answer the questions
in part c above and then run your code with a block size of 33-by-33 to verify
your algorithm handles odd block widths correctly.

## Exercise 3: Mid-Semester Survey

Please take the time to fill out the mid-semester survey. The survey is anonymous, but completing this IS a part of your checkoff for this lab. Make sure to stay on the submission complete page or take a screenshot to show your TA.

**Checkoff:** Show your submission verification to your TA.