CS61C $ummer 2017 Lab 9 - SIMD Intrinsics and Unrolling the Strands of Fate which bind for loops together

Question: Why do I always like to use the word nontrivial?

Answer: I once went to Professor Josh Hug's office hours, and I swear he used the word every 4 sentences. I thought it was a really nice word to use to describe stuff that you think is important but you don't want to use the dry word important for. When I say something is nontrivial, I'm saying that it's something worth your attention; that requires a significant NONTRIVIAL amount of energy to process and understand. I challenge every one of you to always do nontrivial work in your lives.

IMPORTANT: Part of today's exercises (3-5) were made possible by TA Nikhil!

UPDATE: PLEASE COPY AGAIN

If you were in Thursday 7/20's 11-1 labs you may have copied a buggy version of the starter code for Exercise 3. Please run the following commands again to get the updated starter code. In particular, the following changes were made:

The Lab Files

Copy the lab files from the instructional servers to your lab account with

$ cp -r ~cs61c/labs/09/ ~/labs/09/

Alternatively, secure-copy (scp) them from the instructional servers to your own laptop with

$ scp -r cs61c-xxx@hive10.cs.berkeley.edu:~cs61c/labs/09/ ~/YOUR_FILEPATH

And if you want to secure-copy them back to your class account:

$ scp -r ~/YOUR_FILEPATH/09 cs61c-xxx@hive10.cs.berkeley.edu:~/YOUR_LABACCT_FILEPATH

Note that all code using SSE instructions is only guaranteed to work on the hive machines.


Many newer processors support SSE intrinsics, so it is certainly possible that your machine will be sufficient, but you may not see accurate speedups. Ideally, you should ssh into one of the hive machines to run this lab.

Exercises

Exercise 1: Familiarize Yourself with the SIMD Functions

Given the large number of available SIMD intrinsics we want you to learn how to find the ones that you'll need in your application.

Intel hosts a variety of tools related to intrinsics, which you can find here (but these are not necessary for this lab).

The one that we're particularly interested in is the Intel Intrinsics Guide. Open this page and once there, click the checkboxes for everything that begins with "SSE" (SSE all the way to SSE4.2).

Do your best to interpret the new syntax and terminology. (There are certainly an overwhelming amount of m's...) Find the 128-bit intrinsics for the following SIMD operations (one for each):

HINT: If you're overwhelmed, just try clicking on a function whose name you understand! Maybe try one of the "add" functions. Now read the "Description" and "Operation" sections in the drop-down menu opened by clicking the instruction. If you clicked an "add" function like I suggested, you should've read something like, "Add packed X-bit integers in a and b, and store the results in dst." Then you should realize that the value "X" also appears in the function name! The pattern you see here can be described as follows:

__m128i _mm_add_epiX adds together (128/X) packed X-bit integers.

Another hint: Things that say "epi" or "pi" deal with integers, and those that say "ps" or "pd" deal with single precision and double precision floats.

Checkoff [1/5] (uh oh... 5 exercises...)

Exercise 2: Reading SIMD Code

In this exercise you will consider the vectorization of 2-by-2 matrix multiplication in double precision (the code we looked at in the SIMD lecture):

The math in the above image amounts to the following arithmetic operations:

    C[0] += A[0]*B[0] + A[2]*B[1];
    C[1] += A[1]*B[0] + A[3]*B[1];
    C[2] += A[0]*B[2] + A[2]*B[3];
    C[3] += A[1]*B[2] + A[3]*B[3];

You are given the code sseTest.c that implements these operations in a SIMD manner.
The following intrinsics are used:

__m128d _mm_loadu_pd( double *p ) returns vector (p[0], p[1])
__m128d _mm_load1_pd( double *p ) returns vector (p[0], p[0])
__m128d _mm_add_pd( __m128d a, __m128d b ) returns vector (a0+b0, a1+b1)
__m128d _mm_mul_pd( __m128d a, __m128d b ) returns vector (a0b0, a1b1)
   void _mm_storeu_pd( double *p, __m128d a ) stores p[0]=a0, p[1]=a1

Compile sseTest.c into x86 assembly (not MIPS!) by running:

make sseTest.s

Now, observe the general structure of the code in sseTest.s. See if you can find the for-loop in sseTest.s (hint: it's a trick question, see exercise 4) and see if you can identify which instructions are performing SIMD operations. Be prepared to describe to your TA what is happening in general, but you do not need to spend too much time on this section (recall that we are not interested in x86 assembly in this class). We don't expect you to tell us exactly what the code is doing line by line. If you're stuck with decipering the x86 assembly, refer to the original sseTest.c file that contains the matching C code.

HINT: Usually loops in assembly code require some kind of jumping or branching... can you find any in this code?

Checkoff [2/5]

Exercise 3: Writing SIMD Code

COMMON LITTLE MISTAKES

The following are bugs that the staff have noticed were preventing students from passing the tests:

NOTE: We'll be changing directories now! Run the command:

$  cd lab09-branch_prediction

What the heck is going on in here? Allow me to tell you.

We've got one header file common.h that has some code to sum the elements of a really big array. It's a minor detail that it randomly does this 1 << 16 times... but you don't need to worry about that. We also pincer the execution of the code between two timestamps (that's what the clock() function does) to measure how fast it runs! The file randomized.c is the one which will have a main function to run the sum functions. The file sorted.c does the same, but with an extra detail which we'll only worry about in Exercise 5.

GOT IT?? GOOD.

For Exercise 3, you will vectorize/SIMDize the following code in common.h to speed up the naive implementation shown here:

 long long int sum(unsigned int vals[NUM_ELEMS]) {
  long long int sum = 0;
  
  for(unsigned int w = 0; w < OUTER_ITERATIONS; w++) {
    for(unsigned int i = 0; i < NUM_ELEMS; i++) {
      if(vals[i] >= 128) {
        sum += vals[i];
      }
    } 
  }
  return sum;

Note: you need only vectorize the inner loop.

You might find the following intrinsics useful... as in: You're going to need all of these functions:

__m128i _mm_setzero_si128( ) returns 128-bit zero vector Maybe to initialize something
__m128i _mm_loadu_si128( __m128i *p ) returns 128-bit vector stored at pointer p You've got an array vals... how do you get some elements in vector (__m128i) form?
__m128i _mm_add_epi32( __m128i a, __m128i b ) returns vector (a0+b0, a1+b1, a2+b2, a3+b3) This is a summing function after all! You'll definitely need to add some stuff together!
void _mm_storeu_si128( __m128i *p, __m128i a ) stores 128-bit vector a at pointer p Load goes from pointer to vector, this goes from vector to pointer!
_m128i _mm_cmpgt_epi32( __m128i a, __m128i b ) returns the vector (ai>bi ? 0xffffffff : 0x0 for i from 0 to 3)
AKA a 32-bit all-1s mask if ai > bi and a 32-bit all-0s mask otherwise
cmpgt is how SSE says, "compare greater than." How do you use this to implement the if statement?
_m128i _mm_and_si128( __m128i a, __m128i b ) returns vector (a0&b0, a1&b1, a2&b2, a3&b3) Mask stuff

I'm REALLY SORRY that this table is kinda crappy... If you have any questions on how these functions work, feel free to ask a staff member. You could also find them on that SSE Intrinsics Guide you used in Exercise 1!

Task: Start with sum, and use SSE instrinsics to implement the sum_simd() function, which needs to be vectorized and achieve a significant amount of speedup.

How do I even do this? What do we mean when we say to SIMDize a piece of code? Well, recall that the SSE intrinsics are basically functions which perform operations on multiple pieces of data in a vector in parallel. This turns out to be faster than running through a for loop and applying the operation once for each element in the vector.

In our sum function, we've got a basic structure of iterating through an array. On every iteration, we add an array element to a running sum. Here's my question for you. Why don't we add a few array elements to a sum vector in parallel and then consolidate the individual values of the sum vector into our desired sum at the end? Doing so is your task, and I hope you now have some ideas about how to use the SSE functions to help you.

Hint 1: See all those __m128i things? Just call them vectors to make your life easier. They're basically 128B vectors which can be interpreted as any 128B chunk of memory. We'll be using them to encode 4 (four) 32-bit ints.

Hint 2: We've left you a vector called _127 which contains four copies of the number 127. You should use this to compare with some stuff when you implement the condition within the sum loop.

Hint 3: DON'T use the store function (_mm_storeu_si128) until the very end! It turns out that storing is very costly and performing a store in every iteration will actually cause your code to slow down.

Hint 4: Why you want to storeu in the first place: It's bad practice to index into the __m128i vector things like they are arrays, so you want to store them into arrays first with the storeu function, and then access the integers elementwise by indexing into the array.

Hint 5: READ the function declarations in the above table carefully! You'll notice that the loadu and storeu take __m128i* type arguments. But shouldn't we be loading from arrays and storing into them? Answer: yes. You can just cast an int array to a _m128i pointer, like we did in the "floating-point bit-ops" lab so we can interpret 128B of the array as a _m128i* type input. Alternatively, you could skip the typecast, but the compiler will throw a bunch of warnings at you... which is really annoying.

To compile and run your code, run the following commands:

$ cd lab09-branch_prediction
$ make
$ ./randomized

Sanity check: The naive version runs at about 25 seconds on the hive machines, and your SIMDized version should run in about 5-6 seconds.

Checkoff [3/5]

You already know what time it is. Exercise 3.5: TA Alex's Sandbox

I'm going to be honest with you: I think we peaked with the Liz Climo comics from last time. Literally nothing is going to beat them, and I probably should have saved them for last... or at least like the 2nd to last or 3rd to last lab.

Question: Where do you get all this cute stuff?
Answer: It's all the stuff I've upvoted on Reddit. I'm very selective with what I upvote and what I like.

Here's today's image of the day.

I really like the way Liz's comics show up on the lab page, embedded, so I think I'll leave another one here.

Exercise 4: Loop Unrolling

Fact: You can obtain even more performance improvement! Carefully unroll the SIMD vector sum code that you created in the previous exercise. This should get you a little more increase in performance from sum_simd (a few fractions of a second). As an example of loop unrolling, consider the supplied function sum_unrolled():

long long int sum_unrolled(unsigned int vals[NUM_ELEMS]) {
  long long int sum = 0;

  for(unsigned int w = 0; w < OUTER_ITERATIONS; w++) { 
    for(unsigned int i = 0; i < NUM_ELEMS / 4 * 4; i += 4) {
      if(vals[i] >= 128) sum += vals[i];
      if(vals[i + 1] >= 128) sum += vals[i + 1];
      if(vals[i + 2] >= 128) sum += vals[i + 2];
      if(vals[i + 3] >= 128) sum += vals[i + 3];
    }

    // This is what we call the TAIL CASE
    for(unsigned int i = NUM_ELEMS / 4 * 4; i < NUM_ELEMS; i++) {
      if (vals[i] >= 128) {
        sum += vals[i];
      }
    }
  }
  return sum;
}

Task: Consider why unrolling the loop like this is useful at all.
Hint: What part of the for loop do we have to do LESS of now that we do four operations at a time?

You can do a Google search for "loop unrolling" for some inspiration...

Big task: Within common.h, copy your sum_simd() code into sum_simd_unrolled() and unroll it 4 (four) times.

To compile your code, run the following command:

make

To run your code, run the following command:

./randomized

Checkoff [4/5]

Exercise 5: Performance Comparisons

You might have noticed that we've only been running the randomized executable so far. That's because now, we want to compare it to the code in sorted.c! Recall that executing the code in randomized.c just sums up the elements of a huge array of random ints. The only difference with sorted.c is that the array is sorted before running the sum functions! Let's check out the performance by running:

$ ./sorted

Task NUMBER 1: WHOA! You should've noticed that the naive sum function ran a lot faster than it did when the array was randomized. Your task is to think about WHY.

Recall that there is a check within our sum function which tells us to only add to the running total if the current element is greater than or equal to 128. Question: what kind of assembly instruction would need to be used to implement this? From your answer to the preceding question, recall that these kinds of instructions introduce hazards in the datapath pipeline, and that one technique to deal with them was to have a __________ predictor that chooses to always execute the first instruction of one of the outcomes of the _________.

It turns out that much like caches, these predictors fare better when our code is more predictable. They make their predictions based on what's been happening most recently at the ________ in question. In particular, if our code seems to make the same decision at the __________ many times in a row, it will probably predict that that particular outcome will happen again... Why might this explain how sorting the array speeds up the execution of the program?

Task NUMBER 2: OK, cool! It's neat that sorting the array gets us such a high degree of speedup, buuuuuut why didn't the sum_simd or sum_simd_unrolled speed up? To think about this, you need to think about how exactly the original condition is checked in the SIMDized versions that you wrote. What is the important SSE function that you used?

It turns out that this function is implemented without the use of branch instructions at the assembly level! What does this fact tell you about the advantages of sorting the array in the SIMDized code?

Task NUMBER 3: YOU ARE ALMOST DONE! The last observation we should make is that out of the 4 versions of the sum function running on a sorted array, the one that performs best is actually sum_unrolled as opposed to sum_simd_unrolled! If you ran randomized however, the sum_simd version is actually better than sum_unrolled Question: What gives? More formally, what does this tell you about the advantages of successful branch prediction (oopsie) in comparison to data-level parallelism in this special case where the data is sorted?

Checkoff [5/5]