## The Lab Files

Copy the lab files from the instructional servers to your lab account with

`$ cp -r ~cs61c/labs/11/ ~/labs/11/`

Alternatively, secure-copy (`scp`) them from the instructional servers to your own laptop with

`$ scp -r cs61c-xxx@hive10.cs.berkeley.edu:~cs61c/labs/11/ ~/YOUR_FILEPATH`

And if you want to secure-copy them back to your class account:

`$ scp -r ~/YOUR_FILEPATH/11 cs61c-xxx@hive10.cs.berkeley.edu:~/YOUR_LABACCT_FILEPATH`

### Full Disclosure

This is one of those new labs that we're testing out... kind of like labs 5 and 9 (exercises 3-5). It may work out, or it may go horribly. We'll see. Bear with us, please.

## The Exercise

Your single task for this lab is to speed up the code given in calcDepthNaive.c using the techniques you practiced in the previous two labs (9 and 10). In particular, you'll be implementing:

- Loop Unrolling
- SSE Intrinsics
- OpenMP Multithreading

### What does this code do, anyway?

**Answer**: Given two images `left` and `right`, encoded as 2D matrices (encoded as C arrays),
populate a `depth` 2D matrix (AKA C array) with the *depth* of each pixel in the image.

The `left` and `right` images are supposed to be what a person might see out of his or her left and right eyes, respectively.
They will be slightly displaced from each other, and our task is to use the information of **how much they're displaced** to calculate
the relative **depth** of the pixel from the point of view of the person. A smaller displacement indicates a greater depth (if you look at
an object far away with your eyes individually, they will appear less displaced than if the object were close to you! (Try it out: hold your fist out
right in front of your nose, and alternate closing one eye and opening the other. Your field of vision should shift by a noticeable amount. Now look
at something far away and do the same. You'll notice that your field of vision doesn't shift much).

**How (in English):**

The function will scan through the `left` image. For every pixel, it will consider a **search space** box around that same pixel (the same coordinates) in the `right` image. Within that search space, it will look at every **feature**, which is a square box smaller than the
search space. The function will figure out which feature within the search space has the **minimum Euclidean distance of the pixel intensities** from the **feature** centered around the original pixel in the `left` image. In the `depth` array, at the index of the original pixel, it will save the **displacement** AKA Euclidean distance from the original pixel's coordinates in the `left` image of the pixel for which there was a minimum *pixel intensity Euclidean distance*.

**How (in pseudocode):**

For each pixel in the `left` image:

For every pixel in the `(2 * maximumDisplacement + 1)`-by-`(2 * maximumDisplacement + 1)` **search space** box centered around the coordinates of the `left` image pixel in the ** right** image (which doesn't go out of bounds of the image):

For every

`(2 * featureHeight + 1)`-by-

`(2 * featureWidth + 1)`

__box__in the search space (which doesn't go outside of the search space):

Calculate the Euclidean norm of the difference between that

__box__in the

`right`image and the identically-sized box in the

`left`image.

Save the

**displacement**AKA distance from the original pixel (in the

`left`image) where the Euclidean norm was minimized.

If there were multiple minimums, save the

**smallest displacement**from the original pixel out of the ones with the minimum Euclidean norm.

Populate the entry in the

`depth`array with the same index as the original pixel in the

`left`image with that displacement.

There are 6 for loops in the code. You should consider them in pairs. Thus, there are three (3) pairs of for loops to understand.

- The first pair (
`y`from 0 to`imageHeight`,`x`from 0 to`imageWidth`) iterate over each pixel in the`left`image. - The second pair (
`dy`from`-maximumDisplacement`to`maximumDisplacement`,`dx`from`-maximumDisplacement`to`maximumDisplacement`) iterate over possible**displacements**from the original (x, y) coordinate. In other words, they iterate over possible centers of features in the**search space**. - In this pair of loops, we save the (dx, dy) which had the minimum Euclidean distance (
`minimumSquaredDifference`) in the variables`minimumDx`and`minimumDy`. Then,`displacementNaive(dx, dy)`gets saved to`depth[y * imageWidth + x]`. - The third pair (
`boxY`from`-featureHeight`to`featureHeight`,`boxX`from`-featureWidth`to`featureWidth`are meant to be further displacements which, when combined with`(x + dx, y + dy)`, iterate over the features in the search space.- Here, we calculate a sample
`squaredDifference`, which represents the Euclidean distance of a feature in the`right`image with the base feature centered at (x, y) in the`left`image. - Notice that we apply the
`dx`and`dy`offsets to`right`but not to`left`. That's because we scan through multiple features in`right`(within the search space), but we always**compare**to the same base feature in`left`.

- Here, we calculate a sample

### The Euclidean norm

**Question**: What do I mean when I say the Euclidean norm of the difference between two features?

**Answer**: I mean the sum of the squares of the elementwise differences (subtraction) of the pixel intensities in the `left` and `right` images. The distance from the origin (0, 0) is implemented for you in the function `displacementNaive`, which takes in integers `dx` and `dy` and returns `sqrt(dx*dx + dy*dy)`.

For example, let's say I had a 5x5 feature centered at coordinates (x, y) in `left`, and I'm considering a 5x5 feature centered at coordinates (x+1, y+1) in `right`. Then, the Euclidean norm of the difference of those two features would be the square root of the sum of following expression...

`(left[(y + boxY) * imageWidth + x + boxX] - right[(y + boxY + 1) * imageWidth + x + boxX + 1])^2`

...evaluated for boxY from -2 to 2, for boxX from -2 to 2.

For another example, given two sets of two 2×2 images below:

← Squared Euclidean distance is (1-1)^{2}+(5-5)^{2}+(4-4)^{2}+(6-6)^{2} = 0 →

← Squared Euclidean distance is (1-3)^{2}+(5-5)^{2}+(4-4)^{2}+(6-6)^{2} = 4 →

Take a look at the following animation:

The following are some **facts**:

- The bright square in the middle is just part of both images. It isn't anything special.
- The left image is the
`left`image. The right image is the`right`image. - The green pixel in the left image is the current pixel we're processing.
- The red box in the left image is the
**feature**surrounding that pixel. Here,`featureWidth`and`featureHeight`are both**2**, resulting in a 5x5 feature. - The green box in the right image is the
**search space**. It is centered around a the stable green pixel in the right image, which has the same coordinates as the green pixel in the left image. - The
**moving**red box in the right image indicates the**feature**in the right image that we're considering at a given time. The red box moves through the green box (search space), indicating that we consider every**feature**within the bounds of the**search space**. - The
**flashing green pixel**that shows up at the end of the animation is the pixel for which the Euclidean norm of the difference between the left feature (red box in left image) and the right feature (moving red box in right image, centered at flashing green pixel) was**minimized**.**You can intuitively understand this because it is the pixel at the bottom right-hand corner of the bright square, which is exactly what the original pixel (green in the left image) was.** - The
**displacement**that would be saved in the`depth`array at the index corresponding to the original pixel (green in the left image) would be`sqrt(1^2 + 2^2)`, because that's how far the pixel with the minimized Euclidean norm is from the center of the**search space**, which has the coordinates of the original pixel in the left image.

### Your Job

**Now** that we have a good understanding of what `calcDepthNaive.c` does, we can talk about how to speed it up.

### Comparing against the naive version

Run the following commands within your lab 11 folder to compare your code in `calcDepthOptimized.c` to the code in `calcDepthNaive.c`.

$ make $ ./benchmark

Do this periodically (after each task) to make sure that you've actually improved your execution speed (the "speedup ratio" should go up!)

Additionally, you must check that your code is still consistent with the functionality of the naive version:

$ make $ ./check

#### Task 1: Distribute the work amongst threads

What tool are we talking about? **Hint**: it starts with a `#pragma omp` and ends with a `parallel for`. Where should you declare the code to run in parallel amongst the threads? Recall that it's actually inefficient to have **nested** forking and joining of threads. When making use of multithreading, you should make a **single** division of labor for the whole task. This divides up the work evenly without having to fork and join multiple times.

**Sanity check**: At this point, you should've wrapped some of the code inside a parallel block.

**Taking it further**: There's one more thing we can do to get a nontrivial amount of speedup. We've got a few variables that are declared outside the `parallel` block, namely `imageHeight`, `imageWidth`, etc., which are **shared** amongst the threads that you spawned by creating the parallel block. It turns out that instead of keeping them as shared variables, **making a private copy of each** for each thread nets us some speedup!

**Task**: Use the OpenMP keyword for declaring a private copy of some variables to declare private copies of `imageHeight/Width`, `featureHeight/Width`, and `maximumDisplacement` for each thread. **Make sure you use the one** which keeps their initial values!

**Sanity check**: You should've edited the line `#pragma omp parallel for` to make it longer. The `private` keyword initializes each private copy to garbage. You want to use a different keyword which keeps their initial values from outside the parallel section.

**Hint hint**: click click

So why does firstprivate help things? Let's think about what it does: it creates a local copy -- initialized -- of each variable for each thread. This has benefits (since our variables are small and this a one time cost only):

- We have multiple copies of variables across each thread. This means, more importantly, that all of these variables will have their own locations, meaning that we don't have to share them across threads. If we kept the variables shared and we have an invalidate on write model and we write to anything on the same block as that variable, we will be forced to evict the entire block with the variable to follow the MOESI protocol. So we might get a lot of false sharing problems, essentially bottle-necking our performance on the reads to the shared variables.
- Even if we don't write to the same block as the variables, by having the variables in multiple places, for a small memory fee (think maybe 32 bytes (at most) * 8 threads < 300 bytes), we will have thread local copies of all of our shared variables. This means that we will have a much smaller chance of all the shared variables (now private) being evicted from all of our cores' caches. Meanwhile, if we had kept the variables shared and we had evicted the block where the variables were located even once, it would mean that we would have to remove it from every cache and go back to main memory again when we have to retrieve it. This is an expensive operation that we can minimize with the use of firstprivate.
- This is not the biggest problem in the world for this code, which is why firstprivate speeds things up a little bit, but it is something you should always consider when performance programming.

#### Task 2: Compute multiple things with one instruction

Now it's time to SIMDize some things again. Recall that in the two innermost loops, we iterate through a **feature** of the `right` image and evaluate its squared Euclidean distance with the base **feature** of the `left` image. At its core, this operation is just taking the sum of a bunch of subtractions, followed by a bunch of squarings (multiplications). Sounds like a job for the good ol' SSE intrinsics! (The ones with a lot of underscores and m's).

Here's a table with some SSE Intrinsics you'll need.

__m128 _mm_setzero_ps( ) |
returns 128-bit float zero vector |
Maybe to initialize something |

__m128 _mm_loadu_ps( float *p ) |
returns 128-bit vector stored at pointer p (4 floats) | You've got arrays left and right... This is how you get some elements in vector (__m128i) form. |

__m128 _mm_sub_ps( __m128 a, __m128 b ) |
returns FP vector (a_{0}-b_{0}, a_{1}-b_{1}, a_{2}-b_{2}, a_{3}-b_{3}) |
You're accumulating some squared differences! |

__m128 _mm_mul_ps( __m128 a, __m128 b ) |
returns FP vector (a_{0}*b_{0}, a_{1}*b_{1}, a_{2}*b_{2}, a_{3}*b_{3}) |
You're accumulating some squared differences! |

__m128 _mm_add_ps( __m128 a, __m128 b ) |
returns FP vector (a_{0}+b_{0}, a_{1}+b_{1}, a_{2}+b_{2}, a_{3}+b_{3}) |
You're accumulating some squared differences! |

void _mm_store_ps( float *p, __m128 a ) |
stores 128-bit vector a at pointer p | Load goes from pointer to vector, this goes from vector to pointer! |

**Right now** in the naive version, we are computing a single squared difference at a time (1 pixel at a time), and accumulating them as a sum in the variable `squaredDifference`. Let's instead accumulate them in a **vector** (a `__m128i` thingy) and consolidate the vector once at the end into `squaredDifference`. You could call the vector something like "squaredDifference_vector" or you could call it something like "doggy". Either way, using the `setzero` function. You have two options: declaring a single copy outside both `boxY` and `boxX` loops.
Or you could declare a new copy in every iteration of the inner `boxX` loop.

Now, we want to start adding to it, four (4) squared differences at a time. **Right now** in the naive version, we're loading one pixel value at a time from each image. Let's load **four (4)** at a time from each, just like we did in lab 9. Then, use the SIMD arithmetic functions detailed above to compute a vector of squared differences, which you can add to the running total!

** The tail case**: Choosing the loop bounds for this particular assignment is a little tricky. Recall from lab 9 that when we process 4 elements of an array at a time using SSE Intrinsics, we want to stop at the biggest multiple of 4 less than or equal to the array size.

**However**, since

`boxX`isn't a 0-index into the array (rather, it's a displacement value), going up to

`featureWidth / 4 * 4`isn't actually going to work. You have a few options here:

**FIRST OF ALL**keep in mind that originally,`boxX`goes from`-featureWidth`**up to and including**`featureWidth`! (note the`<=`operator). So, there are exactly`2 * featureWidth + 1`iterations.- Modify the start and end indicies of
`boxX`so that you treat it like an index into an array. This will let you go up to`something / 4 * 4`. However, you'll have to modify exactly how you use it to index into`left`and`right`. - Realize that your goal is to stop when there are less than 4 elements left in the row (i.e. when
`_____ + 4 > ___________`). Using this stopping condition will make your starting index for the tail case tricky to calculate, however. Try some examples:- If
`featureWidth = 3`(i.e. boxX goes from -3 to 3), what value should`boxX`start at in the tail case? - If
`featureWidth = 4`(i.e. boxX goes from -4 to 4), what value should`boxX`start at in the tail case?

- If
- Compute an explicit formula involving
`featureWidth`for your stopping index. It should take the form of`featureWidth - _____`, where the blank is some constant that might depend on the parity of`featureWidth`. This is a little harder than option 2, but it would give you exactly the starting index for the tail case that you were looking for!

**HINT/WARNING**: If you find that calculating the loop bounds is taking a really long time, raise your hand and ask a TA! This can be very frustrating. You'll have to repeat this again when you do the next task (loop unrolling).

**Sanity check**: Once you're done SIMDizing this loop, you should've used all the SSE Intrinsics listed above. Additionally, you should've only changed the loop bounds on the variable `boxX`, not `boxY`.

**Sanity check2**: Your tail case can either go inside the `boxY` loop, in which case it would be executed once for each row of the image, or it could go outside the `boxY` loop, in which case it would have nested `for` loops and be executed once at the very end.

## Random Assortment of *Stuff* of the day!

Hello! This is approximately the halfway mark of the lab. Here are some happy things that make me smile:

Here's today's image of the day!

Here's a Liz Climo comic:

#### Task 3: 2x Loop Unrolling

**Task**: Unroll your SIMDized version of the `boxX` code **twice**.

**Question**: Why not 4 times?

**Answer**: Because the `featureWidth`s that we test with will be in the ranges of 4 to 8. If we have a small `featureWidth` (i.e. 7 or less), we can't stride by 16 elements, and everything will be deferred to the tail case... which makes our SIMD code useless!

You should **notice** that by changing your stride to 8 elements, you need to recalculate your loop bounds!

**Hint**: If you computed an explicit formula for your end index, you should've noticed that it depended on the parity of `featureWidth`. That was because we were striding by 4 elements. Now that we're striding by 8, the constant you subtract will NOT depend on the value of `featureWidth % 2`. Rather, it will depend on the value of `featureWidth % 4`.

#### Task 4 (last one!): Removing some conditionals

On line 30 of `calcDepthNaive.c`, you'll notice that we have a special case for when we're on a pixel which we can't draw a **full feature box** around. In this case, we just set the `depth` of that pixel to 0!

**Task**: Instead of having to check this condition on every iteration of the outermost two loops (`x` and `y`), why don't we modify some more loop boundaries? **Think**: The edges of the image are the places where we can't fit a full feature box around a pixel. Instead of starting at the edges, why don't we start **a little bit inwards from the borders**? In particular, if we want to fit a feature box around every pixel we iterate through, and a feature box extends `featureWidth` left and `featureWidth` right of the pixel as well as `featureHeight` up and `featureHeight` down... that means we should start the outermost loops at:

`x = ________, y = ________`

and stop when`x + _________ > __________, y + _________ > __________`

**One more thing**: Because we've changed the loop boundaries, we no longer have zeros saved in the border indices of `depth`. How should we initialize `depth` to account for this?

#### Checkoff [1/1]

- Show off your sped-up code to the person checking you off.