Lab 8: SIMD Instructions

Deadline: Friday, November 4, 11:59:59 PM PT

Setup

Warning: We strongly recommend working on the hive machines for this lab. Many older processors don't support SSE intrinsics, and your local machine may perform differently due to having different hardware (CPU cache size, memory speed, etc.).

In your labs directory, pull the files for this lab with:

git pull starter main

Exercise 1: SIMD Functions

This exercise consists of reading documentation to familiarize yourself with Intel's SIMD instructions. This exercise is not graded, but is needed to understand the remaining exercises.

Read over the Intel Intrinsics Guide to learn about the available SIMD instructions (an intrinsic function is a function whose implementation is handled by the compiler). The Intrinsics Naming and Usage documentation will be helpful in understanding the documentation.

The hive machines support SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, and AVX2, so you can check those boxes in the filters list. Some of the other instruction sets are also supported, but we can ignore those for the purposes of this lab.

Exercise 2: Loop Unrolling Example

This exercise consists of reading provided code to help you understand loop unrolling and is not graded.

The sum() function in simd.c is an un-optimized implementation of the sum the elements whose value is >= 128 of a really big array (roughly 2^16 elements). We use an outer loop to repeat the sum OUTER_ITERATIONS (roughly 2^14) times to increase the code runtime so we can take decent speedup measurements. We time the execution of the code by finding the difference between the start and end timestamps (using clock()). The file test_simd.c is the one which will have a main function to run the various sum functions.

Let's look at sum_unrolled(). The inner loop processes 4 elements per iteration, whereas the inner loop in sum() processes 1 element per iteration. Note the extra loop after the primary loop -- since the primary loop advances through the array in groups of 4 elements, we need a tail case loop to handle arrays with lengths that are not multiples of 4.

Try compiling and running the code:

make simd
./simd

The unrolled function should be slightly faster, although not by much. But faster programs are always nice to have!

Question: if loop unrolling helps, why don't we unroll everything?

The unrolled code is harder to read and write. Unless you plan to never look at the code again, code readability may outweigh the benefits of loop unrolling!
Sometimes, the compiler will automatically unroll your naive loops for you! Emphasis on sometimes -- it can be difficult to figure out what magic tricks a modern compiler performs (see Godbolt in the next paragraph). For demonstration purposes, we've disabled compiler optimizations in this lab.
Loop unrolling means more instructions, which means larger programs and potentially worse caching behavior!
Our simplified examples in simd.c use a known array size. If you don't know the size of the array you're working on, your unrolled loop might not be a good fit for the array!

Optional: you can visualize how the vectors and the different functions work together by inputting your code into the code environment at this link! Another interesting tool that might help you understand the behavior of SIMD instructions is the Godbolt Compiler Explorer project. It can also provide a lot of insights when you need to optimize any code in the future.

Exercise 3: Writing SIMD Code

The following code demonstrates how to add together an 8-element integer array using SIMD instructions. Our registers in this example are 128-bits and integers are 32 bits. This means that we can fit four integers into one register.

int arr[8] = {3, 1, 4, 1, 5, 9, 2, 6};
// Initialize sum vector to {0, 0, 0, 0}
__m128i sum_vec = _mm_setzero_si128();

// Load array elements 0-3 into a temporary vector register
__m128i tmp = _mm_loadu_si128((__m128i *) arr);
// Add to existing sum vector
sum_vec = _mm_add_epi32(sum_vec, tmp);
// sum_vec = {3, 1, 4, 1}

// Load array elements 4-7 into a temporary vector register
tmp = _mm_loadu_si128((__m128i *) (arr + 4));
// Add to existing sum vector
sum_vec = _mm_add_epi32(sum_vec, tmp);
// sum_vec = {3 + 5, 1 + 9, 4 + 2, 1 + 6}

// Create temporary array to hold values from sum_vec
// We must store the vector into an array in order to access the individual values (as seen below)
int tmp_arr[4];
_mm_storeu_si128((__m128i *) tmp_arr, sum_vec);
// Collect values from sum_vec in a single integer
int sum = tmp_arr[0] + tmp_arr[1] + tmp_arr[2] + tmp_arr[3];

This is a lot of work for adding together 8 elements. However, this process greatly improves the performance of summing together the elements of large arrays.

Action Item

Let's implement sum_simd(), a vectorized version of the naive sum() implementation!

You only need to vectorize the inner loop with SIMD. Implementation can be done with the following intrinsics:

__m128i _mm_setzero_si128() - returns a 128-bit zero vector
__m128i _mm_loadu_si128(__m128i *p) - returns 128-bit vector stored at pointer p
__m128i _mm_add_epi32(__m128i a, __m128i b) - returns vector (a_0 + b_0, a_1 + b_1, a_2 + b_2, a_3 + b_3)
void _mm_storeu_si128(__m128i *p, __m128i a) - stores 128-bit vector a into pointer p
__m128i _mm_cmpgt_epi32(__m128i a, __m128i b) - returns the vector (a_i > b_i ? 0xffffffff : 0x0 for i from 0 to 3). AKA a 32-bit all-1s mask if a_i > b_i and a 32-bit all-0s mask otherwise
__m128i _mm_and_si128(__m128i a, __m128i b) - returns vector (a_0 & b_0, a_1 & b_1, a_2 & b_2, a_3 & b_3), where & represents the bit-wise and operator

Tips

DON'T use the store function (_mm_storeu_si128) until after completing the inner loop! It turns out that storing is very costly and performing a store in every iteration will actually cause your code to slow down. However, if you wait until after the outer loop completes you may have overflow issues.
Read the function declarations in the above table carefully! You'll notice that the loadu and storeu take __m128i* type arguments. You can just cast an int array to a __m128i pointer.

Testing

To compile and run your code, run the following commands:

make simd
./simd

The naive version runs at about 7 seconds on the hive machines, and your SIMDized version should run in about 1-2 seconds.

Common Bugs

Below are common bugs that the staff have noticed in implementations for this exercise.

Forgetting the conditional in the tail case: what condition have we been checking before adding something to the sum?
Adding to an uninitialized array: if you add stuff to your result array without initializing it, you are adding stuff to garbage, which makes the array still garbage!
Re-initializing your sum vector: make sure you are not creating a new sum vector for every iteration of the inner loop!
Trying to store your sum vector into a long long int array: use an int array. The return value of this function is indeed a long long int, but that's because an int isn't big enough to hold the sum of all the values across all iterations of the outer loop. long long int and int have different bit widths, so storing an int array into a long long int will produce different numbers!

Exercise 4: Unrolling SIMD Loop

To obtain even more performance improvement, carefully unroll the SIMD vector sum code that you created in the previous exercise to create sum_simd_unrolled(). This should get you a little more increase in performance from sum_simd (a few fractions of a second).

Action Item

Within simd.c, copy your sum_simd() code into sum_simd_unrolled() and unroll it 4 (four) times. Don't forget about your tail case!

Testing

To compile and run your code, run the following commands:

make simd
./simd

General SIMD Advice

Some general advice on working with SIMD instructions:

Be cautious of memory alignment. For example, _m256d _mm256_load_pd (double const * mem_addr) would not work with unaligned data -- you would need _m256d _mm256_loadu_pd. Meanwhile, if you have control over memory allocation, is almost always desireable to keep your data aligned (can be achieved using special memory allocation APIs). Aligned loads can be folded into other operations as a memory operand which reduces code size and throughput slightly. Modern CPUs have very good support for unaligned loads, but there's still a significant performance hit when a load crosses a cache-line boundary.
Recall various CPU pipeline hazards you have learned earlier this semester. Data hazards can drastically hurt performance. That being said, you may want to check data dependencies in adjacent SIMD operations if not getting the desired performance.

Feedback Form

We are working to improve the labs for next semester, so please fill out this survey to tell us about your experience with Lab 8. The survey will be collecting your email to verify that you have submitted it, but your responses will be anonymized before the data is analyzed. Thank you!

Submission

Save, commit, and push your work, then submit to the Lab 8 assignment on Gradescope. The autograder tests are similar to those in test_simd.c, but with potentially different constants (NUM_ELEMS and OUTER_ITERATIONS) and reduced speedup requirements (to compensate for more variability in autograder resources).