# CS61C Fall 2018 Lab 9: Data- and Thread-Level Parallelism

Congrats on finishing a difficult midterm! We're going to dive a deeper into practical parallelism this week, which should get you excited for Project 2. This lab covers a lot of very important ground, so please don't hesitate to ask for help! CS61C TA helping their student (Cartoon from Liz Climo)

## The Lab Files

Pull the Lab 9 files from the lab starter repository with

`\$ git pull starter master`

## Part 1: SIMD

### SIMD Reference Material

NOTE: For this part of the lab, we'll be working in the simd directory:

```\$  cd simd
```

### Exercise 1: Writing SIMD Code

We've got one header file common.h that has some code to sum the elements of a really big array. Be sure to take a careful look at the code--does it sum every single one of the elements? It's a minor detail that it randomly does this 1 << 16 times... but you don't need to worry about that. We also pincer the execution of the code between two timestamps (that's what the clock() function does) to measure how fast it runs! The file simd.c is the one which will have a main function to run the sum functions.

For Exercise 1, you will vectorize/SIMDize the following code in common.h to speed up the naive implementation shown here:

``` long long int sum(unsigned int vals[NUM_ELEMS]) {
long long int sum = 0;

for(unsigned int w = 0; w < OUTER_ITERATIONS; w++) {
for(unsigned int i = 0; i < NUM_ELEMS; i++) {
if(vals[i] >= 128) {
sum += vals[i];
}
}
}
return sum;
```

Note: you need only vectorize the inner loop.

You might find the following intrinsics useful (Hint: You're going to need all of these functions):

 __m128i _mm_setzero_si128( ) returns 128-bit zero vector Maybe to initialize something __m128i _mm_loadu_si128( __m128i *p ) returns 128-bit vector stored at pointer p You've got an array vals... how do you get some elements in vector (__m128i) form? __m128i _mm_add_epi32( __m128i a, __m128i b ) returns vector (a0+b0, a1+b1, a2+b2, a3+b3) This is a summing function after all! You'll definitely need to add some stuff together! void _mm_storeu_si128( __m128i *p, __m128i a ) stores 128-bit vector a at pointer p Load goes from pointer to vector, this goes from vector to pointer! __m128i _mm_cmpgt_epi32( __m128i a, __m128i b ) returns the vector (ai>bi ? 0xffffffff : 0x0 for i from 0 to 3) AKA a 32-bit all-1s mask if ai > bi and a 32-bit all-0s mask otherwise cmpgt is how SSE says, "compare greater than." How do you use this to implement the if statement? __m128i _mm_and_si128( __m128i a, __m128i b ) returns vector (a0&b0, a1&b1, a2&b2, a3&b3) Mask stuff

Task: Start with sum, and use SSE instrinsics to implement the sum_simd() function, which needs to be vectorized and achieve a significant amount of speedup.

How do I do this?

Recall that the SSE intrinsics are basically functions which perform operations on multiple pieces of data in a vector in parallel. This turns out to be faster than running through a for loop and applying the operation once for each element in the vector.

In our sum function, we've got a basic structure of iterating through an array. On every iteration, we add an array element to a running sum. To vectorize, you should add a few array elements to a sum vector in parallel and then consolidate the individual values of the sum vector into our desired sum at the end.

Hint 1: __m128i is the data type for 128-Bit vectors which will be handled by Intel's special 128-bit vector. We'll be using them to encode 4 (four) 32-bit ints.

Hint 2: We've left you a vector called _127 which contains four copies of the number 127. You should use this to compare with some stuff when you implement the condition within the sum loop.

Hint 3: DON'T use the store function (_mm_storeu_si128) until after completing the inner loop! It turns out that storing is very costly and performing a store in every iteration will actually cause your code to slow down. However, if you wait until after the outer loop completes you may have overflow issues.

Hint 4: It's bad practice to index into the __m128i vector like they are arrays. You should store them into arrays first with the storeu function, and then access the integers elementwise by indexing into the array.

Hint 5: READ the function declarations in the above table carefully! You'll notice that the loadu and storeu take __m128i* type arguments. You can just cast an int array to a _m128i pointer. Alternatively, you could skip the typecast at the cost of a bunch of compiler warnings.

#### COMMON LITTLE MISTAKES

The following are bugs that the staff have noticed were preventing students from passing the tests (bold text is what you should not do):

• Trying to store your sum vector into a long long int array. Use an int array. Side note: why?? The return value of this function is indeed a long long int, but that's because an int isn't big enough to hold the sum of all the values across all iterations of the outer loop. However, it is big enough to hold the sum of all the value accross a single iteration of the outer loop. This means you'll want to store your sum vector into an int array after every iteration of the outer loop and add the total sum to the final result result.
• Re-initializing your sum vector. Make sure when you add to your running sum vector, you aren't declaring a new sum vector!!
• Forgetting the CONDITIONAL in the tail case!
• Adding to an UNINITIALIZED array! If you add stuff to your result array without initializing it, you are adding stuff to garbage, which makes the array still garbage! Using storeu before adding stuff is okay though.

To compile your code, run the following command:

```\$  make simd
```

To run your code, run the following command:

```\$ ./simd
```

Sanity check: The naive version runs at about 25 seconds on the hive machines, and your SIMDized version should run in about 5-6 seconds. The naive function may take a long time to run! Do not try commenting it out, since we rely on that result for comparing against a reference; sometimes code can take a long time to run and this is one of those cases.

#### Checkoff [1/5]

• Show the TA or AI checking you off your working code and performance improvement from sum to sum_simd.

### Exercise 2: Loop Unrolling

In this exercise, we'll get an additional performance boost for our code with a technique called loop unrolling. Loop unrolling involves reducing the number of iterations a for loop has to go through by doing more work in each iteration of the for loop. For example, consider this very simple example that adds togther the first n elements of an array arr:

```int total = 0;
for (int i = 0; i < n; i++) {
total += arr[i];
}
```

The corresponding assembly code might look something like this:

```      add t0, x0, x0
add t1, x0, x0 // Initialize loop counter
loop: beq t0, a1, end // Assume register a1 contains the size n of the array
slli t2, t1, 2
add t2, t1, a0 // Assume register a0 contains a pointer to the beginning of the array
lw t3, 0(t2) // Load arr[i] into t3
add t0, t3, t0 // total += arr[i]
addi t1, t1, 1 // Increment the loop counter
jal x0, loop
end: ...
```

If we unroll the loop 4 times, this would be our equivalent code, with a tail case for the situations where n is not a multiple of 4:

```int total = 0;
for (int i = 0; i < n / 4 * 4; i+=4) {
total += arr[i];
total += arr[i + 1];
total += arr[i + 2];
total += arr[i + 3];
}

for (i = n / 4 * 4; i < n; i++) {
total += arr[i];
}
```

For the unrolled code, the corresponding assembly code might look something like this:

```      add t0, x0, x0
add t1, a1, x0 // Assume register a1 contains the size n of the array
srli t1, t1, 2
slli t1, t1, 2 // Find largest of multiple 4 <= n
add t2, x0, x0 // Initialize loop counter
loop: beq t2, t1, tail
slli t3, t2, 2
add t3, t3, a0 // Assume register a0 contains a pointer to the beginning of the array
lw t4, 0(t3) // Load arr[i] into t3
add t0, t4, t0 // total += arr[i]
lw t4, 4(t3) // Load arr[i + 1] into t3
lw t4, 8(t3), t0 // Load arr[i + 2] into t3
lw t4, 12(t3), // Load arr[i + 3] into t3
addi t2, t2, 4 // Increment the loop counter
jal x0, loop
tail: beq t2, a1, end
slli t3, t2, 2
lw t4, 0(t3)
end: ...
```

Why does loop unrolling give us a speedup? In the code above, if N was 100, in the "rolled up" code, we would call the instruction beq t0, a1, end 100 times. In the unrolled code, we would call the equivalent instruction beq t2, t1, tail only 25 times. When we cover how the datapath is implemented, we'll learn that branch instructions often take longer to execute than other instructions, because they cause something called a control hazard. Loop unrolling can significantly cut down on the number of branch instructions in our compiled code, leading to a performance improvement. However, don't go crazy and unroll all the way! We need to strike a balance when unrolling our loops; if we increase the number of instructions we have too much, we will start getting misses in our instruction cache.

Carefully unroll the SIMD vector sum code that you created in the previous exercise. This should get you a little more increase in performance from sum_simd (a few fractions of a second). As an example of loop unrolling, consider the supplied function sum_unrolled():

```long long int sum_unrolled(unsigned int vals[NUM_ELEMS]) {
long long int sum = 0;

for(unsigned int w = 0; w < OUTER_ITERATIONS; w++) {
for(unsigned int i = 0; i < NUM_ELEMS / 4 * 4; i += 4) {
if(vals[i] >= 128) sum += vals[i];
if(vals[i + 1] >= 128) sum += vals[i + 1];
if(vals[i + 2] >= 128) sum += vals[i + 2];
if(vals[i + 3] >= 128) sum += vals[i + 3];
}

// This is what we call the TAIL CASE
for(unsigned int i = NUM_ELEMS / 4 * 4; i < NUM_ELEMS; i++) {
if (vals[i] >= 128) {
sum += vals[i];
}
}
}
return sum;
}
```

Task: Within common.h, copy your sum_simd() code into sum_simd_unrolled() and unroll it 4 (four) times.

To compile your code, run the following command:

```make simd
```

To run your code, run the following command:

```./simd
```

#### Checkoff [2/5]

• Show the TA or AI checking you off your working code and performance improvement from sum_simd to sum_simd_unrolled.
• Give a short explanation of why unrolling a loop leads to a performance improvement

## Part 2: OpenMP

NOTE: For this part of the lab, we'll be working in the openmp directory:

```\$  cd openmp
```

### Hello World Example

For this lab, we will use C to leverage our prior programming experience with it. OpenMP is a framework with a C interface, and it is not a built-in part of the language. Most OpenMP features are actually directives to the compiler. Consider the following implementation of Hello World (`hello.c`):

```      int main() {
#pragma omp parallel
{
}
}```

This program will fork off the default number of threads and each thread will print out "hello world" in addition to which thread number it is. You can change the number of OpenMP threads by setting the environment variable OMP_NUM_THREADS or by using the `omp_set_num_threads` function in your program. The `#pragma` tells the compiler that the rest of the line is a directive, and in this case it is `omp parallel`. `omp` declares that it is for OpenMP and `parallel` says the following code block (what is contained in { }) can be executed in parallel. Give it a try:

```    \$ make hello
\$ ./hello```

Task: Run the command ./hello a few times. Notice how the numbers are not necessarily in numerical order and not in the same order if you run `hello` multiple times!! This is because within a `omp parallel` region, the programmer guarantees that the operations can be done in parallel, and there is no ordering between the threads. It is also worth noting that the variable `thread_ID` is local to each thread. In general with OpenMP, variables declared outside an `omp parallel` block have only one copy and are shared amongst all threads, while variables declared within a `omp parallel` block have a private copy for each thread.

## Exercises

Vector addition is a naturally parallel computation, so it makes for a good first exercise. The `v_add` function inside `v_add.c` will return the array that is the cell-by-cell addition of its inputs `x` and `y`. A first attempt at this might look like:

```    void v_add(double* x, double* y, double* z) {
#pragma omp parallel
{
for(int i=0; i<ARRAY_SIZE; i++)
z[i] = x[i] + y[i];
}
}```

You can run this (`make v_add` followed by `./v_add`) and the testing framework will automatically time it and vary the number of threads. You will see that this actually seems to do worse as we increase the number of threads. The issue is that each thread is executing all of the code within the `omp parallel` block, meaning if we have 8 threads, we will actually be adding the vectors 8 times. DON'T DO THIS! To get speedup when increasing the number of threads, we need each thread to do less work, not the same amount as before. Your task is to modify `v_add.c` so there is some speedup (speedup may plateau as the number of threads continues to increase). The best way to do this is to decrease the amount of work each thread does. To aid you in this process, two useful OpenMP functions are:

```      int omp_get_num_threads();

The function `omp_get_num_threads()` will return how many threads there are in a `omp parallel` block, and `omp_get_thread_num()` will return the thread ID.

Divide up the work for each thread through two different methods (write different code for each of these methods):

• Second task, chunking: if there are N threads, break the vectors into N contiguous chunks, and have each thread only add that chunk (like the figure above).

Hint 1: Use the two functions we listed above somehow in the for loop to choose which elements each thread handles.

Hint 2: You may need a special case to prevent going out of bounds for v_add_optimized_chunks. Don't be afraid to write one.

Hint 3:As you're working on this exercise, you should be thinking about false sharing--read more here and here. Fact: For this exercise, we are asking you to manually split the work amongst threads. Since this is such a common pattern for software, the designers of OpenMP actually made the `for` directive to automatically split up independent work. Here is the function rewritten using it. You may NOT use this directive in your solution to this exercise.

```void v_add(double* x, double* y, double* z) {
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<ARRAY_SIZE; i++)
z[i] = x[i] + y[i];
}
}```

If both of your optimized versions of v_add are surprisingly slow, it's probably due to the hive machine you're using being overloaded. Use the 'who' command to see how many users you are sharing your hive machine with. To give your code a try, use the following commands:

```    \$ make v_add
```

#### Checkoff [3/5]

• Show the TA or AI checking you off your code for both optimized versions of `v_add` that manually splits up the work. Remember, you should not have used #pragma omp for here.
• Run your code to show that it gets parallel speedup.
• Which version of your code runs faster, chunks or adjacent? What do you think the reason for this is? Explain to the person checking you off.

### Exercise 4: Dot Product

The next interesting computation we want to compute is the dot product of two vectors. At first glance, implementing this might seem not too different from `v_add`, but the challenge is how to sum up all of the products into the same variable (reduction). A sloppy handling of reduction may lead to data races: all the threads are trying to read and write to the same address simultaneously. One solution is to use a critical section. The code in a critical section can only be executed by a single thread at any given time. Thus, having a critical section naturally prevents multiple threads from reading and writing to the same data, a problem that would otherwise lead to data races. A naive implementation would protect the sum with a critical section, like (`dotp.c`):

```double dotp(double* x, double* y) {
double global_sum = 0.0;
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<ARRAY_SIZE; i++)
#pragma omp critical
global_sum += x[i] * y[i];
}
return global_sum;
}```

Try out the code (`make dotp` and `./dotp`). Notice how the performance gets much worse as the number of threads goes up? By putting all of the work of reduction in a critical section, we have flattened the parallelism and made it so only one thread can do useful work at a time (not exactly the idea behind thread-level parallelism). This contention is problematic; each thread is constantly fighting for the critical section and only one is making any progress at any given time. As the number of threads goes up, so does the contention, and the performance pays the price. Can you fix this performance bottleneck? Hint: REDUCE the number of times that each thread needs to use a critical section!

First task: try fixing this code without using the OpenMP Reduction keyword. Hint: Try reducing the number of times a thread must add to the shared `global_sum` variable.

Second task: fix the problem using OpenMP's built-in Reduction keyword (Google for more information on it). Is your performance any better than in the case of the manual fix? Why or why not?

Note: The exact syntax for using these directives can be kind of confusing. Here are the key things to remember:

• A #pragma omp parallel section should be specified with curly braces around the block of code to be parallelized.
• A #pragma omp for section should not be accompanied with extra curly braces. Just stick that directive directly above a for loop.

Hint: You'll need to type the '+' operator somewhere when using reduction.

Sanity check: If you've used the reduction keyword correctly, you shouldn't have a critical section anymore.

To give your code a try, use the following commands:

```    \$ make dotp
\$ ./dotp
```

#### Checkoff [4/5]

• Show the TA or AI checking you off your manual fix to `dotp.c` that gets speedup over the single threaded case.
• Show the TA or AI checking you off your Reduction keyword fix for `dotp.c`, and explain the difference in performance, if there is any.
• Run your code to show the speedup.

## Part 3: Survey #### Checkoff [5/5]

• Show the TA or AI checking you off your completed survey screen. If you haven't completed it yet, here is the link