CS61C Project 3: Optimize Matrix Multiply

TA: Andrew Gearhart and Vasily Volkov

  1. You must work in groups of two for this project! For each part of the project, only submit one report per group. Make sure the names and logins of both partners are on the report submission!

  1. Background

        A researcher in the UCB Parlab has been working on a video processing application that separates foreground and background information from video data. Unfortunately, he has little background in high-performance computing and needs your domain expertise to optimize the computational bottleneck of his code: a matrix multiply. Recall that matrix multiply is of this form:

        Unfortunately for the researcher, his matrix multiply has certain characteristics that makes it unsuitable for a canned library application. In particular, this matrix is single precision of dimension 1000 <= M <= 10000 rows and less than or equal to N=100 columns. Each of the columns represents a single video frame in grayscale, and the algorithm attempts to detect the background as pixels that do not change greatly from frame to frame. Furthermore, the matrix multiply is of the form A * A’ where A’ is the transpose of A (ie. ). Thus, the matrix multiply is of this form:

The above diagram labels the matrices with their number of rows and columns, and represents a good way to visualize the problem.

        Your objective is to parallelize and optimize the matrix multiply operation described above!!! Awesome, right?

  1. Given

  1. Copy the project materials to your account:

cp -R ~cs61c/proj/03/ proj3

  1. You will be provided with several codes to help you with the optimization process:
  1. benchmark.c: benchmarking and timing code
  1. Edit this file to change the size of matrix A
  1. sgemm-naive.c: naive matrix multiply
  2. Makefile
  1. Deliverables

  1. Part 1: DUE 3/27

  1. Code: Should compile and run with given Makefile/benchmark.c!
  1. use the command “submit proj3-1” to submit these files:
  1. sgemm-small.c
  2. report1.pdf
  1. Develop on the machines in 200SD
  2. Optimize a single precision (ie. using the 32-bit float type) AA’ matrix multiply for a matrix of size 64x64 with a level of blocking in the style of Lab 7. (In this case, your block sizes will be small because you’ll be “register blocking” rather than “cache blocking”, see below.) Also add a fringe case to handle the portions of matrices that can’t be evenly divided by your block size. Use SSE intrinsics (3/1 Lecture and Lab 8) and loop unrolling to optimize the computation of your small matrix blocks.
  1. Tune the matrix multiply for when the matrix A is of size 64x64, but your code should also run correctly for all other matrix sizes.
  2. It is beneficial to think of the blocking as occurring at the register level: i.e. you are implementing a “register blocking” that uses small blocks in the hope of increasing the amount of register reuse accomplished by the algorithm. (This should be a hint as to the size of small blocks that you are looking to SIMDize.)
  3. The performance benefits of this optimization will be much more prevalent for smaller matrix sizes, so don’t worry about performance dropping off for larger matrices. We’ll tune for larger matrices in Part 2.
  4. For this part of the assignment, do not implement any other optimizations than the ones specified above. This will be checked, and it so that the performance portion of the grade can be calculated fairly.
  1. Report:
  1. In addition to the code file, submit a 1-2 page report in pdf form called “report1.pdf” which includes:
  1. Plot of the speedup of sgemm-small.c over sgemm-naive.c
  2. Questions: (For all questions, make sure to explain how you arrived at your answer!)
  1. From the x86 assembly, determine how many xmm registers are used in your code of your Part 1 submission.
  2. What portion of the floating point operations occur in fringe cases? (i.e. non-SIMD) Give the formula for a general matrix of size m-by-n.
  1. Notes:  
  1. You should be able to reach around 7 Gflop/s of performance for this portion of the assignment.
  2. Include and test code for the fringe case even if your register block size evenly divides the 64x64 matrix. We will certainly use this fringe code for part 2.
  3. Also, make sure to label your plots and correctly plot speedup. Many people lost points last semester for these reasons!
  1. Part 2: DUE 4/3

  1. Code: Should compile and run with given Makefile/benchmark.c!
  1. use the command “submit proj3-2” to submit these files:
  1. sgemm-all.c (code from Part 1 with some cache blocking)
  2. sgemm-openmp.c (code from sgemm-all.c with OpenMP)
  3. report2.pdf
  1. Initially develop on the 200SD machines. We are working on implementing a batch queue system on the “hive” machines in 330 Soda to improve load balancing for OpenMP jobs. If we wish you to collect final OpenMP results from these machines, the staff will publish a spec on how to do so by 3/27.
  2. Improve performance of your Part 1 code for more general tall and skinny matrices via one or two levels of cache blocking. You may wish to try copying blocks into contiguous buffers and the technique of padding fringe cases with zeros to improve your performance for larger matrix sizes (i.e. if you pad fringes with zeros, you can use your already-tuned code to handle these cases).
  3. Use at least one OpenMP pragma to parallelize your matrix multiply implementation.
  1. This should be fairly straight-forward, you should try to reorder loops and play with the “schedule” directive to try and improve scaling and overall performance.
  1. For this part of the assignment, do not implement any other optimizations than the ones specified above. This will be checked, and it is so that the performance portion of the grade can be calculated fairly.
  1. Report:
  1. In addition to the code files, submit a 1-2 page report in pdf form called “report2.pdf” which includes:
  1. Plot of the speedup of sgemm-all.c over sgemm-naive.c
  2. A weak scaling plot of your sgemm-openmp.c vs. sgemm-all.c (3/3 Lecture, Slide 15)
  3. A strong scaling plot of your sgemm-openmp.c over sgemm-all.c (3/3 Lecture, Slide 15)
  4. Questions: (For all questions, make sure to explain how you arrived at your answer!)
  1. What effect (upon Gflop/s measurements) does cache blocking have when considering a range of matrix sizes? In other words, is there another useful metric to consider in addition to your code’s peak performance on a single problem size?
  2. What portions of matrix C are shared between threads in your OpenMP code when running with 8 threads?
  3. What portions of A are shared between which threads in your OpenMP code when running with 8 threads?
  4. For a matrix size of 3000x100, does your OpenMP code have more fringe cases than your SIMD code? Does every OpenMP thread have the same number of fringe cases?
  1. Notes:
  1. You should be able to reach around 42 Gflop/s of performance for this portion of the assignment.
  2. If needed, assume that matrix A is of size 3000x100 for speedup plots.
  3. For Part 2, we will grade your average performance over a range of tall and skinny matrices from 1000-10000 rows and less than 100 columns. We will try strange matrix sizes, such of powers-of-2 and primes...so be prepared.
  1. Grading

  1. You will be graded according to three parameters:
  1. Correctness (1/3 of grade): Does the code implement the optimization in a way to accurately calculate the matrix multiplication? Have you implemented the required optimization techniques for each part (i.e. cache blocking, OpenMP, etc.)?
  2. Performance (1/3 of grade): Graded on a curve based upon all class submissions.
  3. Report (1/3 of grade): Have you formatted the plots correctly? Answered the required questions?
  1. Each part of the project will be ½ of the overall grade. For Part 2, a performance and correctness score will be determined by the average of scores from sgemm-all.c and sgemm-openmp.c.
  2. You may submit multiple codes until the deadline. We will only evaluate your final submission for the performance portions of this project.
  1. Extra Credit: DUE 4/24

  1. Once the performance results from Part 2 have been submitted and analyzed, top performance results for various optimizations will be posted.
  2. As an extra credit assignment, we will have a class competition to determine which groups can code the fastest serial and parallel matrix multiply algorithms in the class. For this portion of the assignment, you may implement any optimization that correctly calculates AA’!
  3. Do not just resubmit your Part 2 submission to the extra credit contest. You need to implement at least one non-trivial optimization.
  4. Other optimizations to consider (using these in Parts 1 and 2 is not allowed!):
  1. aligned loads for SSE
  2. prefetching
  3. autotuning of cache/register blocking parameters
  4. optimizations specific to the fact that the multiply is of form AA’
  5. Cache-oblivious
  1. Space-filling curve layout in memory (Z-Morton, Hilbert, etc.)
  1. Strassen’s algorithm
  1. You will be given several extra weeks for this task, and winners will achieve everlasting fame by having their performance results publicly posted during one of the final class lectures.
  2. Submission:
  1. use the command “submit proj3-ec” to submit these files:
  1. sgemm-ec-serial.c
  2. sgemm-ec-parallel.c
  3. report-ec.pdf
  1. Report: report-ec.pdf
  1. Please submit a pdf file! Plot speedup of sgemm-ec-serial.c vs. sgemm-all.c for increasing matrix size. For parallel code submissions plot weak scaling and strong scaling of sgemm-ec-parallel.c and sgemm-openmp.c (two lines on the same plot). If the plot requires a fixed problem size, use matrix A = 3000x100.
  2. Create a list of implemented optimizations, and in a couple sentences for each evaluate effectiveness.
  1. Extra credit will be provided to the 3 following groups:
  1. Top 3 serial codes and the top 3 parallel codes (on 8 threads)
  1. Note: Codes within 100MFlop/s will be considered “tied”, and both will be awarded extra credit.
  1. If you achieved over a 20% improvement over the performance of your sgemm-all.c or sgemm-openmp.c submissions for Part 2.
  2. Partial credit for through implementations of advanced concepts, that might not achieve speedup. (i.e. You correctly implement Strassen, but it runs slower) Realize that to obtain extra credit via this option, a significant amount of work and effort needs to be dedicated.
  1. References

  1. Goto, K., and van de Geijn, R. A. 2008. Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software 34, 3, Article 12.
  2. Chellappa, S., Franchetti, F., and Püschel, M. 2008. How To Write Fast Numerical Code: A Small Introduction, Lecture Notes in Computer Science 5235, 196–259.
  3. Bilmes, et al. The PHiPAC (Portable High Performance ANSI C) Page for BLAS3 Compatible Fast Matrix Matrix Multiply.
  4. Lam, M. S., Rothberg, E. E, and Wolf, M. E. 1991. The Cache Performance and Optimization of Blocked Algorithms, ASPLOS'91, 63–74.
  5. Intel Instrinsics Guide (see Lab 8)
  6. Intel® 64 and IA-32 Architectures Optimization Reference Manual
  7. OpenMP reference sheet
  8. OpenMP Reference Page from LLNL