CS61C $ummer 2017 Lab 11 - Make this piece of code go SUPER FAST

The Lab Files

Copy the lab files from the instructional servers to your lab account with

$ cp -r ~cs61c/labs/11/ ~/labs/11/

Alternatively, secure-copy (scp) them from the instructional servers to your own laptop with

$ scp -r cs61c-xxx@hive10.cs.berkeley.edu:~cs61c/labs/11/ ~/YOUR_FILEPATH

And if you want to secure-copy them back to your class account:

$ scp -r ~/YOUR_FILEPATH/11 cs61c-xxx@hive10.cs.berkeley.edu:~/YOUR_LABACCT_FILEPATH

Full Disclosure

This is one of those new labs that we're testing out... kind of like labs 5 and 9 (exercises 3-5). It may work out, or it may go horribly. We'll see. Bear with us, please.

The Exercise

Your single task for this lab is to speed up the code given in calcDepthNaive.c using the techniques you practiced in the previous two labs (9 and 10). In particular, you'll be implementing:

What does this code do, anyway?

Answer: Given two images left and right, encoded as 2D matrices (encoded as C arrays), populate a depth 2D matrix (AKA C array) with the depth of each pixel in the image.

The left and right images are supposed to be what a person might see out of his or her left and right eyes, respectively. They will be slightly displaced from each other, and our task is to use the information of how much they're displaced to calculate the relative depth of the pixel from the point of view of the person. A smaller displacement indicates a greater depth (if you look at an object far away with your eyes individually, they will appear less displaced than if the object were close to you! (Try it out: hold your fist out right in front of your nose, and alternate closing one eye and opening the other. Your field of vision should shift by a noticeable amount. Now look at something far away and do the same. You'll notice that your field of vision doesn't shift much).

How (in English):

The function will scan through the left image. For every pixel, it will consider a search space box around that same pixel (the same coordinates) in the right image. Within that search space, it will look at every feature, which is a square box smaller than the search space. The function will figure out which feature within the search space has the minimum Euclidean distance of the pixel intensities from the feature centered around the original pixel in the left image. In the depth array, at the index of the original pixel, it will save the displacement AKA Euclidean distance from the original pixel's coordinates in the left image of the pixel for which there was a minimum pixel intensity Euclidean distance.

How (in pseudocode):

For each pixel in the left image:
 For every pixel in the (2 * maximumDisplacement + 1)-by-(2 * maximumDisplacement + 1) search space box centered around the coordinates of the left image pixel in the right image (which doesn't go out of bounds of the image):
  For every (2 * featureHeight + 1)-by-(2 * featureWidth + 1) box in the search space (which doesn't go outside of the search space):
   Calculate the Euclidean norm of the difference between that box in the right image and the identically-sized box in the left image.
   Save the displacement AKA distance from the original pixel (in the left image) where the Euclidean norm was minimized.
   If there were multiple minimums, save the smallest displacement from the original pixel out of the ones with the minimum Euclidean norm.
   Populate the entry in the depth array with the same index as the original pixel in the left image with that displacement.

There are 6 for loops in the code. You should consider them in pairs. Thus, there are three (3) pairs of for loops to understand.

  1. The first pair (y from 0 to imageHeight, x from 0 to imageWidth) iterate over each pixel in the left image.
  2. The second pair (dy from -maximumDisplacement to maximumDisplacement, dx from -maximumDisplacement to maximumDisplacement) iterate over possible displacements from the original (x, y) coordinate. In other words, they iterate over possible centers of features in the search space.
  3. The third pair (boxY from -featureHeight to featureHeight, boxX from -featureWidth to featureWidth are meant to be further displacements which, when combined with (x + dx, y + dy), iterate over the features in the search space.
    • Here, we calculate a sample squaredDifference, which represents the Euclidean distance of a feature in the right image with the base feature centered at (x, y) in the left image.
    • Notice that we apply the dx and dy offsets to right but not to left. That's because we scan through multiple features in right (within the search space), but we always compare to the same base feature in left.

The Euclidean norm

Question: What do I mean when I say the Euclidean norm of the difference between two features?
Answer: I mean the sum of the squares of the elementwise differences (subtraction) of the pixel intensities in the left and right images. The distance from the origin (0, 0) is implemented for you in the function displacementNaive, which takes in integers dx and dy and returns sqrt(dx*dx + dy*dy).

For example, let's say I had a 5x5 feature centered at coordinates (x, y) in left, and I'm considering a 5x5 feature centered at coordinates (x+1, y+1) in right. Then, the Euclidean norm of the difference of those two features would be the square root of the sum of following expression...

(left[(y + boxY) * imageWidth + x + boxX] - right[(y + boxY + 1) * imageWidth + x + boxX + 1])^2

...evaluated for boxY from -2 to 2, for boxX from -2 to 2.

For another example, given two sets of two 2×2 images below:

← Squared Euclidean distance is (1-1)2+(5-5)2+(4-4)2+(6-6)2 = 0 →

← Squared Euclidean distance is (1-3)2+(5-5)2+(4-4)2+(6-6)2 = 4 →

Take a look at the following animation:

The following are some facts:

Your Job

Now that we have a good understanding of what calcDepthNaive.c does, we can talk about how to speed it up.

Comparing against the naive version

Run the following commands within your lab 11 folder to compare your code in calcDepthOptimized.c to the code in calcDepthNaive.c.

$ make
$ ./benchmark

Do this periodically (after each task) to make sure that you've actually improved your execution speed (the "speedup ratio" should go up!)

Additionally, you must check that your code is still consistent with the functionality of the naive version:

$ make
$ ./check

Task 1: Distribute the work amongst threads

What tool are we talking about? Hint: it starts with a #pragma omp and ends with a parallel for. Where should you declare the code to run in parallel amongst the threads? Recall that it's actually inefficient to have nested forking and joining of threads. When making use of multithreading, you should make a single division of labor for the whole task. This divides up the work evenly without having to fork and join multiple times.

Sanity check: At this point, you should've wrapped some of the code inside a parallel block.

Taking it further: There's one more thing we can do to get a nontrivial amount of speedup. We've got a few variables that are declared outside the parallel block, namely imageHeight, imageWidth, etc., which are shared amongst the threads that you spawned by creating the parallel block. It turns out that instead of keeping them as shared variables, making a private copy of each for each thread nets us some speedup!

Task: Use the OpenMP keyword for declaring a private copy of some variables to declare private copies of imageHeight/Width, featureHeight/Width, and maximumDisplacement for each thread. Make sure you use the one which keeps their initial values!

Sanity check: You should've edited the line #pragma omp parallel for to make it longer. The private keyword initializes each private copy to garbage. You want to use a different keyword which keeps their initial values from outside the parallel section.

Hint hint: click click

So why does firstprivate help things? Let's think about what it does: it creates a local copy -- initialized -- of each variable for each thread. This has benefits (since our variables are small and this a one time cost only):

  1. We have multiple copies of variables across each thread. This means, more importantly, that all of these variables will have their own locations, meaning that we don't have to share them across threads. If we kept the variables shared and we have an invalidate on write model and we write to anything on the same block as that variable, we will be forced to evict the entire block with the variable to follow the MOESI protocol. So we might get a lot of false sharing problems, essentially bottle-necking our performance on the reads to the shared variables.
  2. Even if we don't write to the same block as the variables, by having the variables in multiple places, for a small memory fee (think maybe 32 bytes (at most) * 8 threads < 300 bytes), we will have thread local copies of all of our shared variables. This means that we will have a much smaller chance of all the shared variables (now private) being evicted from all of our cores' caches. Meanwhile, if we had kept the variables shared and we had evicted the block where the variables were located even once, it would mean that we would have to remove it from every cache and go back to main memory again when we have to retrieve it. This is an expensive operation that we can minimize with the use of firstprivate.
  3. This is not the biggest problem in the world for this code, which is why firstprivate speeds things up a little bit, but it is something you should always consider when performance programming.

Task 2: Compute multiple things with one instruction

Now it's time to SIMDize some things again. Recall that in the two innermost loops, we iterate through a feature of the right image and evaluate its squared Euclidean distance with the base feature of the left image. At its core, this operation is just taking the sum of a bunch of subtractions, followed by a bunch of squarings (multiplications). Sounds like a job for the good ol' SSE intrinsics! (The ones with a lot of underscores and m's).

Here's a table with some SSE Intrinsics you'll need.

__m128 _mm_setzero_ps( ) returns 128-bit float zero vector Maybe to initialize something
__m128 _mm_loadu_ps( float *p ) returns 128-bit vector stored at pointer p (4 floats) You've got arrays left and right... This is how you get some elements in vector (__m128i) form.
__m128 _mm_sub_ps( __m128 a, __m128 b ) returns FP vector (a0-b0, a1-b1, a2-b2, a3-b3) You're accumulating some squared differences!
__m128 _mm_mul_ps( __m128 a, __m128 b ) returns FP vector (a0*b0, a1*b1, a2*b2, a3*b3) You're accumulating some squared differences!
__m128 _mm_add_ps( __m128 a, __m128 b ) returns FP vector (a0+b0, a1+b1, a2+b2, a3+b3) You're accumulating some squared differences!
void _mm_store_ps( float *p, __m128 a ) stores 128-bit vector a at pointer p Load goes from pointer to vector, this goes from vector to pointer!

Right now in the naive version, we are computing a single squared difference at a time (1 pixel at a time), and accumulating them as a sum in the variable squaredDifference. Let's instead accumulate them in a vector (a __m128i thingy) and consolidate the vector once at the end into squaredDifference. You could call the vector something like "squaredDifference_vector" or you could call it something like "doggy". Either way, using the setzero function. You have two options: declaring a single copy outside both boxY and boxX loops. Or you could declare a new copy in every iteration of the inner boxX loop.

Now, we want to start adding to it, four (4) squared differences at a time. Right now in the naive version, we're loading one pixel value at a time from each image. Let's load four (4) at a time from each, just like we did in lab 9. Then, use the SIMD arithmetic functions detailed above to compute a vector of squared differences, which you can add to the running total!

The tail case: Choosing the loop bounds for this particular assignment is a little tricky. Recall from lab 9 that when we process 4 elements of an array at a time using SSE Intrinsics, we want to stop at the biggest multiple of 4 less than or equal to the array size. However, since boxX isn't a 0-index into the array (rather, it's a displacement value), going up to featureWidth / 4 * 4 isn't actually going to work. You have a few options here:

HINT/WARNING: If you find that calculating the loop bounds is taking a really long time, raise your hand and ask a TA! This can be very frustrating. You'll have to repeat this again when you do the next task (loop unrolling).

Sanity check: Once you're done SIMDizing this loop, you should've used all the SSE Intrinsics listed above. Additionally, you should've only changed the loop bounds on the variable boxX, not boxY.
Sanity check2: Your tail case can either go inside the boxY loop, in which case it would be executed once for each row of the image, or it could go outside the boxY loop, in which case it would have nested for loops and be executed once at the very end.

Random Assortment of Stuff of the day!

Hello! This is approximately the halfway mark of the lab. Here are some happy things that make me smile:

Here's today's image of the day!

Here's a Liz Climo comic:

Task 3: 2x Loop Unrolling

Task: Unroll your SIMDized version of the boxX code twice.

Question: Why not 4 times?
Answer: Because the featureWidths that we test with will be in the ranges of 4 to 8. If we have a small featureWidth (i.e. 7 or less), we can't stride by 16 elements, and everything will be deferred to the tail case... which makes our SIMD code useless!

You should notice that by changing your stride to 8 elements, you need to recalculate your loop bounds!

Hint: If you computed an explicit formula for your end index, you should've noticed that it depended on the parity of featureWidth. That was because we were striding by 4 elements. Now that we're striding by 8, the constant you subtract will NOT depend on the value of featureWidth % 2. Rather, it will depend on the value of featureWidth % 4.

Task 4 (last one!): Removing some conditionals

On line 30 of calcDepthNaive.c, you'll notice that we have a special case for when we're on a pixel which we can't draw a full feature box around. In this case, we just set the depth of that pixel to 0!

Task: Instead of having to check this condition on every iteration of the outermost two loops (x and y), why don't we modify some more loop boundaries? Think: The edges of the image are the places where we can't fit a full feature box around a pixel. Instead of starting at the edges, why don't we start a little bit inwards from the borders? In particular, if we want to fit a feature box around every pixel we iterate through, and a feature box extends featureWidth left and featureWidth right of the pixel as well as featureHeight up and featureHeight down... that means we should start the outermost loops at:

x = ________, y = ________
and stop when
x + _________ > __________, y + _________ > __________

One more thing: Because we've changed the loop boundaries, we no longer have zeros saved in the border indices of depth. How should we initialize depth to account for this?

Checkoff [1/1]