**Question**: Why do I always like to use the word **nontrivial**?

**Answer**: I once went to Professor Josh Hug's office hours, and I swear he used the word every 4 sentences. I thought it
was a really nice word to use to describe stuff that you think is *important* but you don't want to use the dry word *important* for. When I
say something is nontrivial, I'm saying that it's something **worth** your attention; that requires a significant NONTRIVIAL amount of energy
to process and understand. I challenge every one of you to always do **nontrivial** work in your lives.

**IMPORTANT**: Part of today's exercises (3-5) were made possible by TA **Nikhil**!

## UPDATE: PLEASE COPY AGAIN

**If you were in Thursday 7/20's 11-1 labs** you may have copied a buggy version of the starter code for Exercise 3.
Please run the following commands again to get the updated starter code. In particular, the following changes were made:

- Added a comment telling you to not write code above the outer loop.
- Put parentheses around the
`NUM_ELEMS`macro.

## The Lab Files

Copy the lab files from the instructional servers to your lab account with

`$ cp -r ~cs61c/labs/09/ ~/labs/09/`

Alternatively, secure-copy (`scp`) them from the instructional servers to your own laptop with

`$ scp -r cs61c-xxx@hive10.cs.berkeley.edu:~cs61c/labs/09/ ~/YOUR_FILEPATH`

And if you want to secure-copy them back to your class account:

`$ scp -r ~/YOUR_FILEPATH/09 cs61c-xxx@hive10.cs.berkeley.edu:~/YOUR_LABACCT_FILEPATH`

**Note that all code using SSE instructions is only guaranteed to work on the hive machines.**

Many newer processors support SSE intrinsics, so it is certainly possible that your machine will be sufficient, but you may not see accurate speedups.

**Ideally**, you should ssh into one of the hive machines to run this lab.

## Exercises

### Exercise 1: Familiarize Yourself with the SIMD Functions

Given the large number of available SIMD intrinsics we want you to learn how to find the ones that you'll need in your application.

Intel hosts a variety of tools related to intrinsics, which you can find here (**but these are not necessary for this lab**).

The one that we're particularly interested in is the Intel Intrinsics Guide. Open this page and once there, click the checkboxes for everything that begins with "SSE" (SSE all the way to SSE4.2).

Do your best to interpret the new syntax and terminology. (There are certainly an overwhelming amount of m's...) Find the 128-bit intrinsics for the following SIMD operations (one for each):

- Four floating point divisions in single precision (i.e.
`float`) - Sixteen max operations over signed 8-bit integers (i.e.
`char`) - Arithmetic shift right of eight signed 16-bit integers (i.e.
`short`)

**HINT**: If you're overwhelmed, just try clicking on a function whose name you understand! Maybe try one of the "add" functions.
Now read the "Description" and "Operation" sections in the drop-down menu opened by clicking the instruction. If you clicked an "add" function
like I suggested, you should've read something like, "Add packed X-bit integers in `a` and `b`, and store the results in `dst`."
Then you should realize that the value "X" also appears in the function name! The pattern you see here can be described as follows:

`__m128i _mm_add_epiX` adds together (128/X) packed X-bit integers.

Another hint: Things that say "epi" or "pi" deal with integers, and those that say "**ps**" or "**pd**" deal with **s**ingle **p**recision
and **d**ouble **p**recision floats.

#### Checkoff [1/5] (uh oh... 5 exercises...)

- Record the 3 (three) intrinsics we specified above in a text file to show your TA... or however you want to remember the answers.

## Exercise 2: Reading SIMD Code

In this exercise you will consider the vectorization of 2-by-2 matrix multiplication in double precision (the code we looked at in the SIMD lecture):

The math in the above image amounts to the following arithmetic operations:

C[0] += A[0]*B[0] + A[2]*B[1]; C[1] += A[1]*B[0] + A[3]*B[1]; C[2] += A[0]*B[2] + A[2]*B[3]; C[3] += A[1]*B[2] + A[3]*B[3];

You are given the code `sseTest.c` that implements these operations in a SIMD manner.

The following intrinsics are used:

__m128d _mm_loadu_pd( double *p ) |
returns vector (p[0], p[1]) |

__m128d _mm_load1_pd( double *p ) |
returns vector (p[0], p[0]) |

__m128d _mm_add_pd( __m128d a, __m128d b ) |
returns vector (a_{0}+b_{0}, a_{1}+b_{1}) |

__m128d _mm_mul_pd( __m128d a, __m128d b ) |
returns vector (a_{0}b_{0}, a_{1}b_{1}) |

void _mm_storeu_pd( double *p, __m128d a ) |
stores p[0]=a_{0}, p[1]=a_{1} |

Compile sseTest.c into **x86 assembly** (not MIPS!) by running:

```
make sseTest.s
```

**Now, observe the general structure of the code in sseTest.s. See if you can find the for-loop in sseTest.s (hint: it's a trick question, see exercise 4) and see if you can identify which instructions are performing SIMD operations.** Be prepared to describe to your TA what is happening in general, but you do not need to spend too much time on this section (recall that we are not interested in x86 assembly in this class). We

**don't**expect you to tell us exactly what the code is doing line by line. If you're stuck with decipering the x86 assembly, refer to the original

`sseTest.c`file that contains the matching C code.

**HINT**: Usually loops in assembly code require some kind of jumping or branching... can you find any in this code?

#### Checkoff [2/5]

- Describe where the loop from
`sseTest.c`went in`sseTest.s`, and point out some of the SIMD instructions.

## Exercise 3: Writing SIMD Code

### COMMON LITTLE MISTAKES

The following are bugs that the staff have noticed were preventing students from passing the tests:

**Trying to store your sum vector into a**. Use an`long long int`array`int`array.**Side note: why??**The return value of this function is indeed a`long long int`, but that's because an`int`isn't big enough to hold the sum of all the values across all iterations of the__outer loop__. However, it**is**big enough to hold the sum of all the value accross a__single__iteration of the outer loop. This means you'll want to store your sum vector into an`int`array after every iteration of the outer loop and add the total sum to the final result`result`.**Re-initializing your sum vector**. Make sure when you add to your running sum vector, you aren't**declaring**a new sum vector!!**Forgetting the CONDITIONAL in the tail case!****Adding to an UNINITIALIZED array!**If you add stuff to your result array without initializing it, you are adding stuff to garbage, which makes the array still garbage! Using`storeu`before adding stuff is okay though.

**NOTE**: We'll be changing directories now! Run the command:

```
$ cd lab09-branch_prediction
```

**What the heck is going on in here?** Allow me to tell you.

We've got one header file common.h that has some code to sum the elements of a * really big* array.
It's a minor detail that it randomly does this

`1 << 16`times... but you don't need to worry about that. We also pincer the execution of the code between two timestamps (that's what the

`clock()`function does) to measure how fast it runs! The file randomized.c is the one which will have a

`main`function to run the

`sum`functions. The file sorted.c does the same, but with an extra detail which we'll only worry about in Exercise 5.

**GOT IT??** **GOOD.**

For Exercise 3, you will vectorize/SIMDize the following code in `common.h` to speed up the naive implementation shown here:

long long int sum(unsigned int vals[NUM_ELEMS]) { long long int sum = 0; for(unsigned int w = 0; w < OUTER_ITERATIONS; w++) { for(unsigned int i = 0; i < NUM_ELEMS; i++) { if(vals[i] >= 128) { sum += vals[i]; } } } return sum;

**Note**: you need **only** vectorize the inner loop.

You might find the following intrinsics useful... as in: **You're going to need all of these functions**:

__m128i _mm_setzero_si128( ) |
returns 128-bit zero vector | Maybe to initialize something |

__m128i _mm_loadu_si128( __m128i *p ) |
returns 128-bit vector stored at pointer p | You've got an array vals... how do you get some elements in vector (__m128i) form? |

__m128i _mm_add_epi32( __m128i a, __m128i b ) |
returns vector (a_{0}+b_{0}, a_{1}+b_{1}, a_{2}+b_{2}, a_{3}+b_{3}) |
This is a summing function after all! You'll definitely need to add some stuff together! |

void _mm_storeu_si128( __m128i *p, __m128i a ) |
stores 128-bit vector a at pointer p | Load goes from pointer to vector, this goes from vector to pointer! |

_m128i _mm_cmpgt_epi32( __m128i a, __m128i b ) |
returns the vector (a for i from 0 to 3)
_{i}>b_{i} ? 0xffffffff : 0x0AKA a 32-bit all-1s mask if a and a 32-bit all-0s mask otherwise_{i} > b_{i} |
cmpgt is how SSE says, "compare greater than." How do you use this to implement the if statement? |

_m128i _mm_and_si128( __m128i a, __m128i b ) |
returns vector (a_{0}&b_{0}, a_{1}&b_{1}, a_{2}&b_{2}, a_{3}&b_{3}) |
Mask stuff |

**I'm REALLY SORRY** that this table is kinda crappy... If you have any questions on how these functions work, feel free to
ask a staff member. You could **also** find them on that SSE Intrinsics Guide you used in Exercise 1!

**Task**: Start with `sum`, and use SSE instrinsics to implement the `sum_simd()`
function, which needs to be vectorized and achieve a significant amount of speedup.

* How do I even do this?* What do we mean when we say to SIMDize a piece of code? Well,
recall that the SSE intrinsics are basically functions which perform operations on

*multiple*pieces of data in a vector

**in parallel**. This turns out to be faster than running through a

`for`loop and applying the operation once for each element in the vector.

In our

`sum`function, we've got a basic structure of iterating through an array. On every iteration, we add an array element to a running

`sum`.

**Here's my question for you**.

*Why don't we add a*Doing so is your

**few**array elements to a sum**vector**in**parallel**and then consolidate the individual values of the sum vector into our desired sum at the end?**task**, and I hope you now have some ideas about how to use the SSE functions to help you.

**Hint 1**: See all those `__m128i` things? Just call them **vectors** to make your life easier.
They're basically 128B vectors which can be interpreted as any 128B chunk of memory. We'll be using them to encode
4 (four) 32-bit ints.

**Hint 2**: We've left you a vector called `_127` which contains four copies of the number 127. You should
use this to compare with some stuff when you implement the condition within the sum loop.

**Hint 3**: DON'T use the store function (`_mm_storeu_si128`) until the very end! It turns out that storing
is very costly and performing a store in every iteration will actually cause your code to slow down.

**Hint 4**: Why you want to `storeu` in the first place: It's bad practice to index into the
`__m128i` vector things like they are arrays, so you want to store them into arrays first with the `storeu`
function, and then access the integers elementwise by indexing into the array.

**Hint 5**: READ the function declarations in the above table carefully! You'll notice that the `loadu`
and `storeu` take `__m128i*` type arguments. *But shouldn't we be loading from arrays and storing
into them?* Answer: **yes**. You can just cast an int array to a `_m128i` pointer, like we did
in the "floating-point bit-ops" lab so we can interpret 128B of the array as a `_m128i*` type input. Alternatively,
you could skip the typecast, but the compiler will throw a bunch of warnings at you... which is really annoying.

To compile and run your code, run the following commands:

$ cd lab09-branch_prediction $ make $ ./randomized

**Sanity check**: The naive version runs at about 25 seconds on the hive machines, and your SIMDized version should
run in about 5-6 seconds.

#### Checkoff [3/5]

- Show your TA your working code and performance improvement from
`sum`to`sum_simd`.

## You already know what time it is. Exercise 3.5: TA Alex's Sandbox

**I'm going to be honest with you**: I think we peaked with the Liz Climo comics from last time.
Literally nothing is going to beat them, and I probably should have saved them for last... or at least
like the 2nd to last or 3rd to last lab.

**Question**: Where do you get all this cute stuff?

**Answer**: It's all the stuff I've upvoted on Reddit. I'm
very selective with what I upvote and what I like.

Here's today's image of the day.

I really like the way Liz's comics show up on the lab page, embedded, so I think I'll leave another one here.

## Exercise 4: Loop Unrolling

**Fact**: You can obtain even more performance improvement!
Carefully **unroll** the SIMD vector sum code that you created in the previous exercise.
This should get you a little more increase in performance from `sum_simd` (a few fractions of a second).
As an example of loop unrolling, consider the supplied function `sum_unrolled()`:

long long int sum_unrolled(unsigned int vals[NUM_ELEMS]) { long long int sum = 0; for(unsigned int w = 0; w < OUTER_ITERATIONS; w++) { for(unsigned int i = 0; i < NUM_ELEMS / 4 * 4; i += 4) { if(vals[i] >= 128) sum += vals[i]; if(vals[i + 1] >= 128) sum += vals[i + 1]; if(vals[i + 2] >= 128) sum += vals[i + 2]; if(vals[i + 3] >= 128) sum += vals[i + 3]; } // This is what we call the TAIL CASE for(unsigned int i = NUM_ELEMS / 4 * 4; i < NUM_ELEMS; i++) { if (vals[i] >= 128) { sum += vals[i]; } } } return sum; }

**Task**: Consider why unrolling the loop like this is useful at all.

**Hint**: What part of the for loop do we have to do **LESS** of now that we do four operations at a time?

You can do a Google search for "loop unrolling" for some inspiration...

**Big task**: Within `common.h`, copy your `sum_simd()` code into `sum_simd_unrolled()` and unroll it 4 (four) times.

To compile your code, run the following command:

```
make
```

To run your code, run the following command:

```
./randomized
```

#### Checkoff [4/5]

- Show your TA your working code and performance improvement from
`sum_simd`to`sum_simd_unrolled`. - Why does loop unrolling get you a speedup?

## Exercise 5: Performance Comparisons

You might have noticed that we've only been running the `randomized` executable so far.
That's because now, we want to compare it to the code in `sorted.c`! Recall that executing the code
in `randomized.c` just sums up the elements of a huge array of random `int`s.
The * only* difference with

`sorted.c`is that the array is sorted before running the

`sum`functions! Let's check out the performance by running:

```
$ ./sorted
```

**Task NUMBER 1**: **WHOA**! You should've noticed that the *naive* `sum` function ran a **lot**
faster than it did when the array was randomized. Your **task** is to think about **WHY**.

Recall that there is a check within our `sum` function which tells us to only add to the running total
if the current element is greater than or equal to 128. **Question**: what kind of assembly instruction
would need to be used to implement this? From your answer to the preceding question, recall that these kinds
of instructions introduce **hazards** in the datapath pipeline, and that one technique to deal with them was
to have a __________ predictor that chooses to always execute the first instruction of one of the outcomes of the
_________.

It turns out that much like caches, these predictors fare better when our code is more predictable. They
make their predictions based on what's been happening *most recently* at the ________ in question. In particular,
if our code seems to make the same decision at the __________ many times in a row, it will probably predict that
that particular outcome will happen *again*... Why might this explain how
sorting the array speeds up the execution of the program?

**Task NUMBER 2**: OK, cool! It's neat that sorting the array gets us such a high degree of speedup,
buuuuuut **why didn't the sum_simd or sum_simd_unrolled speed up?** To think about this,
you need to think about

*how exactly*the original condition is checked in the SIMDized versions that you wrote. What is the important SSE function that you used?

It turns out that this function is implemented without the use of branch instructions at the assembly level! What does this fact tell you about the advantages of sorting the array in the SIMDized code?

**Task NUMBER 3**: YOU ARE ALMOST DONE! The last observation we should make is that out of the 4 versions
of the `sum` function running on a sorted array, the one that performs best is actually `sum_unrolled`
as opposed to `sum_simd_unrolled`! If you ran `randomized` however, the `sum_simd` version
is actually better than `sum_unrolled` **Question**: What gives? More formally, what does this tell you about
the advantages of successful branch prediction (oopsie) *in comparison* to data-level parallelism in this
special case where the data is sorted?

#### Checkoff [5/5]

- Why does sorting the array make
`sum`faster? - Why does sorting the array NOT make
`sum_simd`or`sum_simd_unrolled`much faster? - [If this doesn't apply to you, ignore it.] Why does
`sum_unrolled`outperform`sum_simd`when the array is sorted? Why DOESN'T it do so when the array is random?