Lab 8: Data-Level Parallelism

Deadline: Friday, July 30, 11:59:59 PM PT

Objectives

  • Learn about and use various SIMD functions to perform data level parallelism
  • Write code to SIMD-ize certain functions
  • Learn about loop-unrolling and why it works

Setup

Pull the lab files from the starter:

git pull starter main
We strongly recommend working on the Hive machines for this lab. Many older processors don't support SSE intrinsics. Even if your processor does support SSE intrinsics, it may produce different results on Hive and your computer due to differing characteristics (cache size, clock speed, etc.).

Exercise 1: Familiarize Yourself with the SIMD Functions

Given the large number of available SIMD intrinsics we want you to learn how to find the ones that you'll need in your application.

For this mini-exercise, we ask you to look at the Intrinsics Naming and Usage documentation. Then, visit the Intel Intrinsics Guide. The Hive machines support SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, and AVX2, so you can check those boxes in the filters list. Some of the other instruction sets are also supported, but we can ignore those for the purposes of this lab.

There isn't an autograded task for this exercise, but try looking through the various instructions in the guide and familiarize yourself with the syntax used. As an example, __m128 _mm_add_ps (__m128 a, __m128 b) adds 128-bit vectors of floats (ps suffix), processing 128 / 32 = 4 floats per operation.

Exercise 2: Loop Unrolling Example

The sum() function in simd.c is an un-optimized implementation of the sum the elements of a really big array (roughly 2^16 elements). This doesn't take long enough to take decent speedup measurements, so we use an outer loop to repeat the sum OUTER_ITERATIONS (roughly 2^14) times. We also time the execution of the code by finding the difference between the start and end timestamps (using clock()). The file test_simd.c is the one which will have a main function to run the various sum functions.

Let's look at sum_unrolled(). The inner loop processes 4 elements per iteration, whereas the inner loop in sum() processes 1 element per iteration. By performing more operations per iteration of the for loop, we have to loop less and not have to waste as many cycles -- remember the effects and hazards related to pipelining and branching! Note the extra loop after the primary loop -- since the primary loop advances through the array in groups of 4 elements, we need a tail case loop to handle arrays with lengths that are not multiples of 4.

Code Example: RISC-V visualization of unrolling a sum function

For example, consider this very simple example that adds together the first n elements of an array arr:

int total = 0;
for (int i = 0; i < n; i++) {
  total += arr[i];
}

The corresponding assembly code might look something like this:

    add t0, x0, x0
    add t1, x0, x0 // Initialize loop counter
loop:   beq t0, a1, end // Assume register a1 contains the size n of the array
    slli t2, t1, 2
    add t2, t1, a0 // Assume register a0 contains a pointer to the beginning of the array
    lw t3, 0(t2) // Load arr[i] into t3
    add t0, t3, t0 // total += arr[i]
    addi t1, t1, 1 // Increment the loop counter
    jal x0, loop
end:  ...

If we unroll the loop 4 times, this would be our equivalent code, with a tail case for the situations where n is not a multiple of 4:

int total = 0;
for (int i = 0; i < n / 4 * 4; i+=4) {
  total += arr[i];
  total += arr[i + 1];
  total += arr[i + 2];
  total += arr[i + 3];
}

for (i = n / 4 * 4; i < n; i++) {
  total += arr[i];
}

For the unrolled code, the corresponding assembly code might look something like this:

      add t0, x0, x0
      add t1, a1, x0 // Assume register a1 contains the size n of the array
      srli t1, t1, 2
      slli t1, t1, 2 // Find largest of multiple 4 <= n
      add t2, x0, x0 // Initialize loop counter
loop: beq t2, t1, tail
      slli t3, t2, 2
      add t3, t3, a0 // Assume register a0 contains a pointer to the beginning of the array
      lw t4, 0(t3) // Load arr[i] into t4
      add t0, t4, t0 // total += arr[i]
      lw t4, 4(t3) // Load arr[i + 1] into t4
      add t0, t4, t0
      lw t4, 8(t3), t0 // Load arr[i + 2] into t4
      add t0, t4, t0
      lw t4, 12(t3), // Load arr[i + 3] into t4
      add t0, t4, t0
      addi t2, t2, 4 // Increment the loop counter
      jal x0, loop
tail: beq t2, a1, end
      slli t3, t2, 2
      lw t4, 0(t3)
      add t0, t4, t0
      addi t2, t2, 1
end: ...

Try compiling and running the code:

$ make simd
$ ./simd

The unrolled function should be slightly faster, although not by much. But faster programs are always nice to have!

Question: if loop unrolling helps, why don't we unroll everything?

  • Loop unrolling means more instructions, which means larger programs and potentially worse caching behavior!
  • Our simplified examples in simd.c use a known array size. If you don't know the size of the array you're working on, your unrolled loop might not be a good fit for the array!
  • The unrolled code is harder to read and write. Unless you plan to never look at the code again, code readability may outweigh the benefits of loop unrolling!
  • Sometimes, the compiler will automatically unroll your naive loops for you! Emphasis on sometimes -- it can be difficult to figure out what magic tricks a modern compiler performs (see Godbolt in the next paragraph). For demonstration purposes, we've disabled compiler optimizations in this lab.

Optional: you can visualize how the vectors and the different functions work together by inputting your code into the code environment at this link! Another interesting tool that might help you understand the behavior of SIMD instructions is the Godbolt Compiler Explorer project. It can also provide a lot of insights when you need to optimize any code in the future.

Exercise 3: Writing SIMD Code

General Advice

Some general advice on working with SIMD instructions:

  • Be cautious of memory alignment. For example, _m256d _mm256_load_pd (double const * mem_addr) would not work with unaligned data -- you would need _m256d _mm256_loadu_pd. Meanwhile, if you have control over memory allocation, is almost always desireable to keep your data aligned (can be achieved using special memory allocation APIs). Aligned loads can be folded into other operations as a memory operand which reduces code size and throughput slightly. Modern CPUs have very good support for unaligned loads, but there's still a significant performance hit when a load crosses a cache-line boundary.
  • Recall various CPU pipeline hazards you have learned earlier this semester. Data hazards can drastically hurt performance. That being said, you may want to check data dependencies in adjacent SIMD operations if not getting the desired performance.

Additionally, below are common bugs that the staff have noticed in implementations for this exercise. Some of it may not make sense yet, but feel free to refer back as you're working on the code:

  • Forgetting the CONDITIONAL in the tail case: what condition have we been checking before adding something to the sum?
  • Adding to an UNINITIALIZED array: if you add stuff to your result array without initializing it, you are adding stuff to garbage, which makes the array still garbage! Using storeu before adding stuff is okay though.
  • Re-initializing your sum vector: make sure you are not creating a new sum vector for every iteration of the inner loop!
  • Trying to store your sum vector into a long long int array: use an int array. The return value of this function is indeed a long long int, but that's because an int isn't big enough to hold the sum of all the values across all iterations of the outer loop. long long int and int have different bit widths, so storing an int array into a long long int will produce different numbers!

Action Item

Now, let's implement sum_simd(), a vectorized version of the naive sum() implementation!

You only need to vectorize the inner loop with SIMD. Implementation can be done with the following intrinsics:

  • __m128i _mm_setzero_si128() - returns a 128-bit zero vector
  • __m128i _mm_loadu_si128(__m128i *p) - returns 128-bit vector stored at pointer p
  • __m128i _mm_add_epi32(__m128i a, __m128i b) - returns vector (a_0 + b_0, a_1 + b_1, a_2 + b_2, a_3 + b_3)
  • void _mm_storeu_si128(__m128i *p, __m128i a) - stores 128-bit vector a into pointer p
  • __m128i _mm_cmpgt_epi32(__m128i a, __m128i b) - returns the vector (a_i > b_i ? 0xffffffff : 0x0 for i from 0 to 3). AKA a 32-bit all-1s mask if a_i > b_i and a 32-bit all-0s mask otherwise
  • __m128i _mm_and_si128(__m128i a, __m128i b) - returns vector (a_0 & b_0, a_1 & b_1, a_2 & b_2, a_3 & b_3), where & represents the bit-wise and operator
Code Example: vectorizing the sum of an 8-element array

Note that the following examples demonstrate how to approach vectorizing functions with SIMD. The array is too short for SIMD performance benefits

Consider an 8-element integer array. We can add its elements without SIMD (ignoring the existence of for loops):

int arr = {3, 1, 4, 1, 5, 9, 2, 6};
int sum = 0;
sum += arr[0];
sum += arr[1];
sum += arr[2];
sum += arr[3];
sum += arr[4];
sum += arr[5];
sum += arr[6];
sum += arr[7];

Let's break it down a little more:

int arr = {3, 1, 4, 1, 5, 9, 2, 6};
int sum = 0;

// Group 1
sum = sum + arr[0];
sum = sum + arr[1];
sum = sum + arr[2];
sum = sum + arr[3];

// Group 2
sum = sum + arr[4];
sum = sum + arr[5];
sum = sum + arr[6];
sum = sum + arr[7];

With SIMD:

int arr[8] = {3, 1, 4, 1, 5, 9, 2, 6};
// Initialize sum vector of {0, 0, 0, 0}
__m128i sum_vec = _mm_setzero_si128();
__m128i tmp;

// Group 1
// Load array elements 0-3 into a temporary vector register
tmp = _mm_loadu_si128((__m128i *) arr);
// Add to existing sum vector
sum_vec = _mm_add_epi32(sum_vec, tmp);
// sum_vec = {3, 1, 4, 1}

// Group 2
// Load array elements 4-7 into a temporary vector register
tmp = _mm_loadu_si128((__m128i *) (arr + 4));
// Add to existing sum vector
sum_vec = _mm_add_epi32(sum_vec, tmp);
// sum_vec = {3 + 5, 1 + 9, 4 + 2, 1 + 6}

// Create temporary array to hold values from sum_vec
int tmp_arr[4];
_mm_storeu_si128((__m128i *) tmp_arr, sum_vec);
// Collect values from sum_vec in a single integer
int sum = tmp_arr[0] + tmp_arr[1] + tmp_arr[2] + tmp_arr[3];

Let's use sum_unrolled() as a reference, and use SSE intrinsics to re-implement the inner loop. Recall that the vector instructions perform operations on multiple pieces of data (in a vector) in parallel. This turns out to be faster than running through a for loop and performing one operation on one piece of data per iteration.

In sum_unrolled(), we process 4 array elements in each iteration. Similarly, the vector functions listed above allow you to perform one operation on 4 integers at once, so the 4 sets of operations can be merged into 1 set of SIMD operations! When implementing sum_simd(), you should add a few array elements to a sum vector in parallel and then consolidate the individual values of the sum vector into our desired sum at the end.

  • Hint 1: __m128i is the data type for Intel's special 128-bit vector. We'll be using them to encode 4 (four) 32-bit ints.
  • Hint 2: We've left you a vector called _127 which contains four copies of the number 127. You should use this to compare with some stuff when you implement the condition within the sum loop.
  • Hint 3: DON'T use the store function (_mm_storeu_si128) until after completing the inner loop! It turns out that storing is very costly and performing a store in every iteration will actually cause your code to slow down. However, if you wait until after the outer loop completes you may have overflow issues.
  • Hint 4: It's bad practice to index into the __m128i vector like they are arrays. You should store them into arrays first with the storeu function, and then access the integers elementwise by indexing into the array.
  • Hint 5: READ the function declarations in the above table carefully! You'll notice that the loadu and storeu take __m128i* type arguments. You can just cast an int array to a __m128i pointer.

To compile and run your code, run the following commands:

$ make simd
$ ./simd

Sanity check: The naive version runs at about 7 seconds on the Hive machines, and your SIMDized version should run in about 1-2 seconds.

Exercise 4: Unrolling Loops

To obtain even more performance improvement, carefully unroll the SIMD vector sum code that you created in the previous exercise to create sum_simd_unrolled(). This should get you a little more increase in performance from sum_simd (a few fractions of a second). As an example of loop unrolling, consider the difference between sum() and sum_unrolled().

Task

Within simd.c, copy your sum_simd() code into sum_simd_unrolled() and unroll it 4 (four) times. Don't forget about your tail case!

To compile and run your code, run the following commands:

$ make simd
$ ./simd

Submission

Please submit to the Lab 8 autograder on Gradescope. The autograder tests are similar to those in test_simd.c, but with potentially different constants (NUM_ELEMS and OUTER_ITERATIONS) and reduced speedup requirements (to compensate for more variability in autograder resources).