Lab 8: SSE Intrinsics and Loop Unrolling

Setup

Pull the contents of ~cs61c/labs/fa12/08

$ cd ~/work
$ git pull ~cs61c/labs/fa12/08 master

Background

The x86 instruction set includes a large (and ever growing) number of SIMD instructions, which Intel calls SSE (Streaming SIMD Extensions) instructions. SSE instructions operate on values stored in 128-bit wide registers (the XMM registers), and perform the same operation on multiple data elements in parallel. The number of elements that a single instruction can process is determined by the width of the element's type. For example, a single SSE instruction can process four single precision (32 bit) or two double precision (64 bit) floating point values. Each SSE instruction has a corresponding C function (known as an "intrinsic function") that causes the compiler to emit the desired SSE instruction in the assembly code it produces. Intrinsic functions make it much easier to use SSE instructions in C programs - without them, you would need to write assembly code and deal with details like register allocation, calling conventions, inline assembly syntax, etc. When you use intrinsic functions, the compiler does that dirty work for you. Another benefit of using intrinsics is that it allows the compiler to reorder the SSE instructions if it determines that doing so could help performance (while maintaining correctness).

Exercise 1 — What's up with SIMD?

There are hundreds of SSE instructions in the x86 ISA these days, and finding the best instruction(s) to use for a particular task may require a bit of research. Intel has published a document called the "Intel Intrinsic Guide" which describes every instrinsic function - Google it and download the PDF.

Your assignment:

Find the names of the intrinsic functions that perform the following SIMD operations:

∗    Four single precision floating point divisions (type float)
∗    Sixteen "max" operations across signed 8-bit integers (type char)
∗    Arithmetic right shifts of eight signed 16-bit integers (type short)

Record the names of these functions in a text file to show your TA.

Exercise 2 — So what does SIMD look like, anyway?

This exercise demonstrates one way of vectorizing double-precision 2x2 matrix multiplication using SSE intrinsics. For this lab (and project 3), matrices will be stored in memory in column-major order (as opposed to row-major order). In column-major order, sequential addresses point to elements in adjacent rows of the matrix, so the element in row i and column j is offset from the first entry in the matrix by i+j*n entries (where n is the number of rows in the matrix). If you find this confusing, read Wikipedia's article on the subject. Which ordering to use is pretty much an arbitrary decision, not unlike whether bytes within a word are stored in memory in little or big endian order. In C, 2D arrays (e.g. int x[4][4]) are stored in row-major order, whereas in Fortran they are stored in column-major order.

Multiplying the 2x2 matrices A and B and adding the result to a third matrix C (C=C+A*B) is illustrated below:

Calculating C=A*B requires performing the following arithmetic operations:

    C[0] += A[0]*B[0] + A[2]*B[1];
    C[1] += A[1]*B[0] + A[3]*B[1];
    C[2] += A[0]*B[2] + A[2]*B[3];
    C[3] += A[1]*B[2] + A[3]*B[3];

Look at the C code in sseTest.c - it uses the following intrinsic functions to implement the above computation:

__m128d _mm_loadu_pd( double *p )returns vector (p[0], p[1])
__m128d _mm_load1_pd( double *p )returns vector (p[0], p[0])
__m128d _mm_add_pd( __m128d a, __m128d b )returns vector (a0+b0, a1+b1)
__m128d _mm_mul_pd( __m128d a, __m128d b )returns vector (a0b0, a1b1)
void _mm_storeu_pd( double *p, __m128d a )stores p[0]=a0, p[1]=a1

Your assignment:
1. Compile the code by running "gcc -g0 -O2 -S sseTest.c". The resulting assembly code is in the file sseTest.s.
(NOTE: You can assemble and link the .s file into an executable binary by running "gcc sseTest.s -o sseTest".)

2. Locate the instructions in sseTest.s that correspond to the body of the for loop, and identify the SSE instructions generated by each call to an intrinsic function. Add comments next to the instructions inside the loop to demonstrate to your TA that you understand what the code is doing.

Note that the x86 assembly code that GCC produces uses what's called AT&T syntax, which follows these conventions:
∗    uses "%" in front of register names (not $ as in MIPS assembly)
∗    the destination of an instruction is always the right-most operand (e.g. movupd source, destination)
∗    an operand with parenthesis represents a memory location:
       –    (%rax) uses %rax as a pointer
       –    8(%rax) uses %rax as a base and 8 as an offset
       –    (%rax,%rbx) adds %rax and %rbx to form a memory address
       –    (%rax,%rbx,4) adds %rax and 4 times %rbx to form a memory address
∗    "leaq 8(%rax), %rbx" loads address of 8(%rax), i.e. %rax+8, into %rbx
∗    %rsp is the stack pointer; %rbp is the frame pointer
∗    %rXX registers are 64-bit; %eXX represent the lower 32 bits of them
∗    addq and addl are 64- and 32-bit versions of add respectively
∗    when calling a function, the caller pushes its arguments onto the stack (whereas in MIPS they are placed in $a0, $a1, etc. registers)
∗    local variables declared inside a function are also allocated space on the stack.

Exercise 3 — Welcome to the world of SIMD!

In this exercise, you will use SSE intrinsic functions to vectorize a piece of code, thereby harnessing the power of SIMD! Done correctly, this will make your code run about 4x faster then the naive implementation presented below:

    int sum_naive( int n, int *a )
    {
        int sum = 0;
        for( int i = 0; i < n; i++ )
            sum += a[i];
        return sum;
    }

You will likely find the following intrinsics useful:

__m128i _mm_setzero_si128( )returns 128-bit zero vector
__m128i _mm_loadu_si128( __m128i *p )returns 128-bit vector stored at pointer p
__m128i _mm_add_epi32( __m128i a, __m128i b )returns vector (a0+b0, a1+b1, a2+b2, a3+b3)
void _mm_storeu_si128( __m128i *p, __m128i a )stores 128-bit vector a at pointer p

Your assignment:
1. Start with sum.c. Use SSE instrinsics to implement the sum_vectorized() function. (Make sure to compile with the -O2 flag!!!!)

2. Show your TA your working code and report how it performs relative to the original version.

Exercise 4 — Let's unroll some loops!

Loop unrolling is a code optimization that can provide a significant performance boost to certain kinds of code. SIMD (or vector) code often falls into this category. Loop unrolling involves replicating the code in the body of a loop N times, updating all calculations involving loop variables appropriately, and (if necessary) handling edge cases where the number of loop iterations isn't divisible by N. Unrolling the loop in the SIMD code you wrote for the previous exercise will improve its performance significantly.

Here's an example of a loop that has been unrolled:

    int sum_unrolled( int n, int *a )
    {
        int sum = 0;

        /* do the body of the work in a faster unrolled loop */
        for( int i = 0; i < n/4*4; i += 4 )
        {
            sum += a[i+0];
            sum += a[i+1];
            sum += a[i+2];
            sum += a[i+3];
        }

        /* handle the small tail in a usual way */
        for( int i = n/4*4; i < n; i++ )   
            sum += a[i];

        return sum;
    }

Take a look at Wikipedia's article on loop unrolling for more detailed information about loop unrolling, including explanations of why it can improve performance so drastically (hint: there's more to it than just reducing the overhead of updating and checking the loop index).

Your assignment:
1. Make a copy of your sum_vectorized() function called sum_vectorized_unrolled(). Unroll the loop by a factor of 4, so that one iteration of the loop performs 4 times as much work as it did originally. (Make sure to compile with the -O2 flag!!!!)

2. Show your TA your unrolled code, report the performance gain you observed, and give some reasons why unrolling the loop made the code run faster.