CS61C Project 3

Optimizing Matrix Multiplication

Part 1 due Sunday, March 18, 2012 @ 23:59:59

Part 2 due Sunday, April 1, 2012 @ 23:59:59

TA: Rimas Avizienis

Summary

In this project, you will implement an optimized single-precision (32-bit) floating point matrix-matrix multiplication (MMM) routine for square (N x N) matrices.  In Part 1, you will use SIMD instructions to implement a MMM kernel tuned for performance on small matrices.  In Part 2, you will apply cache blocking to your MMM kernel and tune it to perform well on larger matrices.   Finally, you will parallelize your code to run on multiple cores and see how well its performance scales.

For each part of the project, you will submit source code and a writeup through your class github repository.  Part of your grade for each part will be based on your writeup, in which you will briefly describe your implementation and analyze its performance. The other part will be based on our evaluation of your code’s performance and correctness.

Background

Matrix multiplication is a fundamental linear algebra operation that is at the core of many important numerical algorithms.  As such, performing MMM efficiently is of interest to  people working in a wide range of domains.   To refresh your memory, if A,B, and C are N x N matrices, then C=AB is also an N x N matrix, and the value of each element in C is defined as:

The subscripts denote the position (row and column) of a particular entry in a matrix.  For this project, your code should implement the computation C=C+AB.  

For the purposes of this project, matrices will be stored in memory in column-major order.  This means that  the element in row i, column j of a matrix will be offset by i+j*N  entries relative to the first element of the matrix.  Each entry in C is the cross product of a row of A and a column of B, and requires N multiplications and additions to compute. It takes a total of 2N3 floating point operations to compute C. Each matrix occupies 4N2 bytes of memory, and every element in the A and B matrices is reused N2 times throughout the course of the computation.

As you have seen in labs 7 & 8, the simplest way to implement MMM is by using three nested loops.  To significantly improve upon the performance of the naive MMM implementation, your code will need to do two things:

  1. Use the X86 instruction set’s SIMD (SSE) instructions

To take full advantage of the compute resources available in modern x86 CPUs, your code must use SSE instructions to perform multiple floating point operations in parallel whenever possible.  Recall that SSE registers are 128-bits wide, so a single SSE instruction can compute four single-precision floating point results at once.  Each core has 16 SSE registers (named xmm0, xmm1, etc) and (under ideal circumstances) can issue both an SSE multiply and add instruction during a single cycle.   The hive machines have two 2.4GHz Intel Xeon E5620 4 core CPUs - one in each of the two sockets on the motherboard.  We can use this information to calculate the theoretical peak single-precision floating point performance of these machines:          

2 instructions (1 multiplication, 1 addition) per cycle *

4 single precision floating point operations per instruction (flops) *

2.4x109 cycles per second =

19.2 Gflop/s per core (19.2 giga flops) * 8 cores =

153.6 GFlop/s

In practice, performance can be limited either by the available compute resources (CPU bound), or by the rate at which the CPU can access memory (memory bound). Which factor ultimately limits the performance of a particular implementation of an algorithm depends on its arithmetic intensity (the ratio of arithmetic operations performed to bytes of memory accessed) and the target platform’s compute resources and memory hierarchy.   For more on this topic, see the discussion of the “roofline model” in the textbook (look it up in the index) and the second paper in the references section of this document.  

Parallelizing an algorithm to run on multiple cores incurs overheads related to communication and synchronization.  As a result, performance for parallel codes rarely scales linearly as the number of cores increases beyond a certain point.

  1. Use blocking to improve cache performance

As you saw in Lab 7, the order in which a program accesses memory can have a drastic impact on how long it takes to run.  Your MMM code should use at least two levels of blocking (for the cache and SSE registers) to obtain good performance from the memory hierarchy.  For an in-depth discussion of blocking strategies for MMM, take a look at the first paper listed in the references section.  The processors you’ll be using have 3 levels of caches.  Each of the four cores in a socket has private L1 instruction & data caches (32KB, 8 way set associative) and a private unified L2 cache (256KB, 8 way set associative).  The outermost level of the memory hierarchy consists of a 12MB L3 cache shared by all four cores in a socket, and three memory controllers that communicate with external banks of DRAM.

Getting Started

Copy the project materials to a directory in your github tree:

                cp –R ~cs61c/proj/sp12_03 proj3

We provide you with the following files:

  1. benchmark.c – a “test harness” which measures the performance and verifies the output  of your MMM  implementations.
  2. sgemm-naive.c – a naive, 3 nested loop implementation of MMM
  3.  Makefile

To build the naive MMM implementation, run “make bench-naive” in your project directory.  Running “./bench-naive” will measure its performance on a set of input sizes from the range 64 <= N <= 1024.  Place your register blocked MMM kernel implementation in a file called sgemm-small.c.  You can create a template to start from by making a copy of sgemm-naive.c  (cp sgemm-naive.c sgemm-small.c).  Be sure to #include the necessary header files directives before trying to call any SSE intrinsic functions. Running “make bench-small” and then running “./bench-small” will benchmark and verify the MMM kernel in sgemm-small.c.

You do not need to modify or submit any of the provided files (except possibly the Makefile, if you choose to).  Each part of the project will require you to submit several new files:

 

 Part 1:               sgemm-small.c               part1.pdf                Makefile (optional)

 Part 2:               sgemm-all.c                    sgemm-openmp.c          part2.pdf         Makefile (optional)

 

You are not allowed to show your code to other students (except your partner), TAs, professors, or anyone else.  Copying code you find on the internet is also strictly prohibited and will result in a failing grade on the project and other dire consequences.

 

Assignment

 

Part 1 (due 3/17/2012 @ 23:59:59) (40 points)

 

Implement a single-precision MMM kernel for square matrices that is optimized for performance on inputs with N=64.  Here is a suggested order in which to apply various optimizations:

  1. Use SSE Instructions

Your code will need to use SSE instructions to get good performance on the processors you are using.  Your code should definitely use the _mm_add_ps, _mm_mul_ps, _mm_loadu_ps intrinsics as well as some subset of:

_mm_load1_ps, _mm_load_ss, _mm_storeu_ps, _mm_store_ss

_mm_shuffle_ps, _mm_hadd_ps

Depending on how you choose to arrange data in registers and structure your computation, you may not need to use all of these intrinsics (such as _mm_hadd_ps or _mm_shuffle_ps).  There are multiple ways to implement MMM using SSE instructions that perform acceptably well.  You probably don’t want to use some of the newer SSE instructions, such as those that calculate dot products (though you are welcome to try!).  You will need to figure out how to handle matrices for which N isn’t divisible by 4 (the SSE register width in 32-bit floats).

You should always use unaligned loads and stores (_mm_loadu_ps and _mm_storeu_ps).  On the processors you are using for this project, these instructions perform as fast as their aligned counterparts (_mm_load_ps and _mm_store_ps) when operating on aligned addresses.  The only reason you might want to use the aligned versions is to see if you are accessing memory using aligned addresses (your program will die with a “segmentation fault” error if you are).  Accessing memory aligned on 128-bit boundaries is significantly faster, so you should do it when you can.

  1. Re-use values in registers (“register blocking”)

Your code should re-use values once they have been loaded into registers as much as possible.  By reusing values loaded from the A or B matrices in multiple calculations, you can reduce the total number of times each value is loaded from memory in the course of computing all the entries in C.  To ensure that a value gets loaded into a register and reused instead of being loaded from memory repeatedly, you should assign it to a local variable and refer to it using that variable.

  1. Loop unrolling and instruction scheduling

        Unroll the inner loop of your code to improve your utilization of the pipelined SSE multiplier and adder units, and also reduce the overhead of address calculation and loop condition checking.  You can reduce the amount of address calculation necessary in the inner loop of your code by accessing memory using constant offsets, whenever possible.

The processors you are using can issue a load, a multiply and an add instruction during a single cycle (assuming all 3 are ready to execute and appear reasonably close to one another in the instruction stream).  For your enlightenment, here are the latencies (how many cycles it takes to produce a result) for the functional units you’ll be using:

        SSE loads (that hit in the L1 Cache)                 :        2 cycles

        SSE single precision adds and multiplies        :        3 cycles

You can experiment with changing the ordering of instructions (i.e. interleaving independent loads, stores and multiplies) to try and maximize your utilization of the processor.  Keep in mind that the compiler is trying to do this for you to some extent, so depending on the particulars of your code, this may or may not have much of an effect.  You can look at the compiler generated assembly code to see what effect your C source code manipulations have had.

  1. Memory access

        It’s generally a good idea to access memory in sequential chunks whenever possible.   You may find that it is possible to achieve better performance by including a “preprocessing” step where you transpose one of the matrices into a local buffer, so you can access both the A and B matrices using  sequential addresses.  To handle matrix sizes that aren’t a multiple of 4, you can try copying them into larger local buffers and padding the edges with zeros.  This could allow you to reuse your already optimized SSE code.  Another reason you may want to try copying data into a local buffer is so that you can guarantee that all memory accesses will be aligned.

 

Grading

        Your grade for part 1 will depend on 3 things:

  1. Correctness                (10 points)

Your code must correctly compute the output for input matrices with N between 64 and 1024.  To get these points, your code must implement at least some of the above mentioned optimizations.  You can’t just submit the same naive code we gave you and expect to get these points.

  1. Performance        (20 points)

        To get all available points, your code must achieve at least 10 GFlops/s of performance on 64x64 input matrices, as measured on the hive machines.  For reference, a highly tuned library implementation of MMM (from GotoBLAS) achieves ~14.5 GFlops/s for such matrices on the same machines.  We will compile your code using the Makefile we provided, unless you provide your own.  You are welcome to try experimenting with different compiler optimization flags, however we highly recommend that you not use -O3 while developing your code, as it often produces slower code than -O2.  You may not use any OpenMP directives in your submission for part 1.

  1. Writeup                (10 points)

In addition to your code, submit a 1-2 page report (in PDF form) called “part1.pdf.”  Your writeup should:

  1. Describe how you use SSE registers in the innermost loop of your code (max 200 words)
  2. Describe how you deal with the fringes of matrices when N isn’t evenly divisible by the SSE register width (max 100 words)
  3. Include a plot showing the performance of your code as compared to  the code in sgemm-naive.c
  4. Look at the compiler generated X86 assembly code and answer the following questions:
  1. How many XMM registers does your code use?
  2. Are any values being spilled to the stack during the innermost loop?
  3. How many scalar floating point instructions does your code use and why?

        

Once you have finished Part 1, push your submission (two files: part1.pdf and sgemm-small.c) to the proj3/ directory of your class account github repo, and tag the commit with the label “proj3-1.”  If you worked with a partner, make sure both your names and class accounts appear at the top of both files (writeup and source code).  Failure to do so will cost you points.  If you require a TA’s help to submit your project through github, and the TA determines that you could have dealt with the situation on your own, you will be penalized.  Likewise, if you tag the wrong commit you will also be subject to a penalty.  We sincerely hope that you’ve figured out how git works by now - if you still find it confusing, please refer to one of the many online resources on the subject.  Remember that you can check which files are present in a commit that a tag points to through github’s web interface.  Don’t worry if the commit includes more than just the proj3/ directory - we will only look in that directory for files related to this project.

Part 2 (due 4/1/2012 @ 23:59:59) (40 points)

Using the MMM kernel you wrote for part 1 as a starting point, add one or more levels of cache blocking to improve its performance for larger matrix sizes.  Measure how your code performs on inputs with odd sizes (e.g. 65x65, 127x127, etc.) and optimize your fringe handling if necessary.  To improve performance on large matrices, you may want to experiment with copying sub-blocks of your input matrix into contiguous regions of memory.  Put your improved code into a file called “sgemm-all.c”  

Once your serial implementation performs to your satisfaction, use at least one OpenMP #pragma to parallellize it.   This should be fairly straight-forward.  To improve the performance and scalability of your parallelized MMM code, you should try reordering loops and playing with the “schedule” directive.  Place your parallelized code in a file called “sgemm-openmp.c”.  

Batch Queueing System

To aid you in measuring the performance of your parallelized code, we have set up a batch queueing system on the hive cluster.  One machine (hive1) has been designated as the “head” node, and four machines (hive25-28) have been configured as “compute” nodes.  The compute nodes have been set up such that remote logins are not permitted, so jobs that are run on those machines are guaranteed to be free of interference from other processes running on the machines.

To submit a job to run on one of the compute servers, you will need to make a shell script that looks like this:

        #!/bin/bash

#PBS -N CS61C

#PBS -V

#PBS -l nodes=1

#PBS -q batch

cd $PBS_O_WORKDIR

# workaround to fix a thread affinity problem in Linux

export GOTOBLAS_MAIN_FREE=1

# name of the file to execute

./bench-openmp

Put the script in your working directory (the same one that has the bench-openmp executable that you’d like to test) called openmp-test.sh.  To submit your job to the batch queue system, execute the following command.

        qsub openmp-test.sh

To check the status of jobs currently in the batch queue system, use the qstat command:

Job id Name User Time Use S Queue

------------------------- ---------------- --------------- -------- - -----

25.hive1 CS61C cs61c-ta 0 R batch

the "R" indicates that the job is running. "C" means completed, and "Q" means queued (if all 4 compute nodes are busy).

Calling qstat with the "-n" option will show you which node a task is running on:

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

-------------------- -------- -------- ---------------- ------ ----- --- ------ -----

25.hive1.cs.berk cs61c-ta batch CS61C 6891 1 1 -- 01:00 C 00:00

hive25/0

which shows you that my job ran on hive25. The output from your job will be in a file called CS61c.o<num> where <num> is replaced by the Job ID.

If you have any questions about the batch queue system, please post them on piazza.  

NOTE:  When measuring the performance of your sgemm-openmp code, we will run it on an unloaded hive machine but we will not use the batch queue system.  

Grading

Your grade for part 2 will depend your code’s performance and a writeup.

  1. Performance        (30 points)

Half of the performance component of your grade for part 2 will be based on the performance of your sgemm-all code, and the other half will be based on the performance of your sgemm-openmp code.   In both cases, we will be measuring the average performance of your code over a range of matrix sizes.

For your sgemm-all code, we will take the average over all the matrix sizes included in the test set in benchmark.c.  An average of 10 GFlop/s will earn you 80% of the available points, and an average of 12 GFlop/s will earn you 100% of the available points.

We will only consider the performance of your openmp code for matrices where N is between 512 and 1024.  Average performance of 40 GFlop/s will earn you 80% of the available points, and an average over 50 GFlop/s will earn you 100% of the available points.

  1. Writeup                (10 points)

Your writeup for part 2 should include:

  1. A brief description of any changes you made to your code from part 1 in order to get it to run well on a range of matrix sizes (max 150 words)
  2. A brief description of how you used OpenMP pragmas to parallelize your code in sgemm-openmp.c (max 150 words).
  3. A plot showing the speedup of sgemm-all.c over sgemm-naive.c for values of N between 64 and 1024
  4. A weak scaling plot of the performance of your sgemm-openmp.c code (use your sgemm-all.c code as the baseline for the single threaded case)
  5. A strong scaling plot of the performance of your sgemm-openmp.c code

Once you have completed part 2, push the required files to the proj3/ directory of your class account github repo, and tag the commit with the label “proj3-2”.  

Extra Credit (due 4/22/2012 @ 23:59:59)

        We will be having a class competition to determine which person or team can code the fastest serial and parallel implementations of MMM for square matrices.  Submissions are due Sunday, April 22 at midnight and must be submitted through github.  Do not just resubmit your part 2 submission to the extra credit contest - you must implement at least one non-trivial optimization.  Winners will be announced in class on Tuesday, April 24th.  The top 3 teams in each category (serial and parallel) will be awarded extra credit.  Results within 100 MFlop/s will be considered tied, and both submissions will be awarded extra credit.  Performance for both the serial and parallel codes will be measured on a randomly selected subset of matrix sizes within the range 512 < N < 1024 (not the same subset used in part 2 of the assignment).

Submission:

Create a proj3-ec/ directory in your class github repository that contains the following files:

  1. sgemm-ec-serial.c (serial implementation)
  2. sgemm-ec-parallel.c (parallel implementation)
  3. report-ec.pdf (writeup)

        Tag the commit with the tag proj3-ec

        

        You don’t need to submit both a serial and a parallel implementation, but you are welcome to.  Your writeup should include performance graphs and a description of optimizations you attempted.  If your submission includes a serial implementation, plot the performance of your extra credit submission (sgemm-ec-serial.c) against your part 2 submission (sgemm-all.c).  For submissions that include a parallel implementation, include strong and weak scaling plots for your part 2 code (sgemm-openmp.c) and your extra credit submission (sgemm-ec-parallel.c) on the same plot (two graphs total - two lines per graph).  For both serial and parallel implementations, Include a list of optimizations you attempted and summarize the effectiveness of each one in a few sentences.

References

  1. Goto, K., and van de Geijn, R. A. 2008. Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software 34, 3, Article 12.
  2. Chellappa, S., Franchetti, F., and Püschel, M. 2008. How To Write Fast Numerical Code: A Small Introduction, Lecture Notes in Computer Science 5235, 196–259.
  3. Bilmes, et al. The PHiPAC (Portable High Performance ANSI C) Page for BLAS3 Compatible Fast Matrix Matrix Multiply.
  4. Lam, M. S., Rothberg, E. E, and Wolf, M. E. 1991. The Cache Performance and Optimization of Blocked Algorithms, ASPLOS'91, 63–74.
  5. Intel Instrinsics Guide (see Lab 8)
  6. Intel® 64 and IA-32 Architectures Optimization Reference Manual
  7. OpenMP reference sheet
  8. OpenMP Reference Page from LLNL