CS 61C Lab 9

Thread Level Parallelism with OpenMP

Background

Goals

This lab will give you a brief introduction to shared memory parallel programming using OpenMP. We will be using the hive cluster machines in 330 Soda, which feature:

2 x Quad-core 2.40 GHz Intel Xeon processors (8 physical cores total)

Partners

Working with a partner is highly recommended. If you work with a partner, make sure that both partners understand all aspects of your solution.

Setup

Pull the contents of ~cs61c/labs/sp12/09.

            $ cd ~/work 

            $ git pull ~cs61c/labs/fa12/09 master

Introduction to OpenMP

All semester we have been using multicore computers, and in this lab we will finally write some code that takes advantage of multiple cores. OpenMP is a parallel programming framework for C/C++ and Fortran. It has gained quite a bit of traction in recent years, primarily due to that fact that it is easy to use and can provide good performance in many situations. In this lab we will only be learning about a small fraction of its features and capabilities. Check out the links in the Additional References section if you'd like to learn more.

There are many types of parallelism and patterns for exploiting it, and OpenMP implements what's called a "nested fork-join" model. By default, an OpenMP program runs like a standard sequential program, except for regions that the programmer explicitly declares should be executed in parallel. In the parallel regions, the framework creates (forks) some number of threads. Typically these threads all execute the same code, but operate on different portions of a data set. At the end of the parallel region, the framework waits for all threads to complete (join) before it leaves that region and resumes executing the program sequentially.

OpenMP uses shared memory, meaning all threads run in the same address space. The alternative to this is distributed memory, which is prevalent on clusters and requires that data must be explicitly moved between address spaces. Many programmers find that shared memory systems are easier to program since they do not have to worry about moving their data. However, it is difficult to build the hardware that implements shared memory systems in a scalable way. Later in the lab we will declare some memory to be thread local (accessible only by the thread that created it) for performance reasons, but the programming framework provides the ability for threads to share memory without requiring special effort on the part of the programmer.

Hello World Example

For this lab, we will be using C. OpenMP is a framework with a C interface, but OpenMP is not a standard part of the C language. Most OpenMP commands are actually directives to the compiler. Consider the following implementation of Hello World (hello.c):

int main() {
  # pragma omp parallel
  {
    int thread_ID = omp_get_thread_num();
    printf(" hello world %d\n", thread_ID);
  }
}

This program will fork off the default number of threads and each thread will print out "hello world" in addition to a number that uniquely it. The # pragma tells the compiler that the rest of the line is a directive, and in this case it is omp parallel. omp identifies it as an OpenMP directive, and parallel indicates that the following code block (what is contained in { }) should be executed in parallel by multiple threads. Give it a try:

$ make hello
$ ./hello

Notice how the numbers printed out aren't necessarily in sequential order, and their order may vary when you run hello multiple times. OpenMP doesn't provide any guarantees about the order in which threads will run the code inside an omp parallel block. The programmer must guarantee that the code inside an omp parallel region will behave correctly, regardless of the order in which the forked threads execute it. It is also worth noting that the variable thread_ID is local to each thread. In general with OpenMP, variables declared outside an omp parallel block exist in a single memory location and are shared amongst all threads, while variables declared inside an omp parallel block exist in multiple locations - a private copy is created for each thread.

OpenMP Programming Exercises

Exercise 1: Vector Addition

Vector addition is inherently a very parallel computation and as such, it makes for a good first exercise. The v_add function inside v_add.c performs an element-by-element addition of two vectors x and y and places the result in a third vector z, as shown below:

void v_add(double* x, double* y, double* z) {
  # pragma omp parallel
  {
    for(int i=0; i<ARRAY_SIZE; i++)
      z[i] = x[i] + y[i];
  }
}

You can run this (make v_add followed by ./v_add) and the testing framework we have provided will measure its execution time while varying the number of threads used. You will see that performance decreases as we increase the number of threads. What's going on here is that each thread is executing all of the code within the omp parallel block, meaning if we use 8 threads, we will actually be redundantly adding the vectors 8 times. To realize a speedup when using multiple threads, we need each thread to do less work, not the same amount as before.

Your task is to modify v_add so there is some speedup (speedup may plateau as the number of threads continues to increase). The best way to accomplish this is by reducing the amount of work done by each thread. Two useful OpenMP functions that can aid you in this process are:

int omp_get_num_threads();
int omp_get_thread_num();

The function omp_get_num_threads() will return how many threads are executing a omp parallel block, and omp_get_thread_num() will return the current thread's unique ID.

Divide up the work for each thread using two different methods:

First, have each thread handle adjacent sums: i.e. Thread 0 will add the elements at index 0, Thread 1 will add the elements at index 1, etc. This method will not be very efficient. It will exhibit the behavior known as false sharing.
Second, if there are N threads, break the vectors into N contiguous chunks, and have each thread only add that chunk (like the figure above).

For this exercise, we are asking you to manually split the work amongst threads. Since this is such a common pattern for software, the designers of OpenMP made the for directive to automatically split up independent work. You are NOT to use this for this exercise, but for future reference here is the function rewritten using it:

void v_add(double* x, double* y, double* z) {
  #pragma omp parallel
  {
    #pragma omp for
    for(int i=0; i<ARRAY_SIZE; i++)
      z[i] = x[i] + y[i];
  }
}

Checkoff

Show the TA your code for `v_add` that manually splits up the work (both methods)
Run your code to show that it gets parallel speedup (second method)

Exercise 2: Dot Product

Next, we will consider a piece of code that computes the dot product of two vectors. At first glance, this might seem similar to what we did in v_add, but the challenge is in how to combine the results of all the additions into a single variable. This process is known as a reduction. An improper handling of the reduction process may lead to data races: if all the threads attempt to read from and write to the same memory location they might overwrite each other's results and arrive at the wrong answer. One way to handle this situation is to use what's called a critical section. The code in a critical section can only be executed by a single thread at any given point in time. We can use a critical section to prevent different threads from accessing the same memory location in an unsafe manner, and thereby avoid data races. A naive dot-product implementation might protect the sum using a critical section, as shown below. (dotp.c):

double dotp(double* x, double* y) {
  double global_sum = 0.0;
  #pragma omp parallel
  {
    #pragma omp for
    for(int i=0; i<ARRAY_SIZE; i++)
      #pragma omp critical
        global_sum += x[i] * y[i];
  }
  return global_sum;
}

Try running the code (make dotp and ./dotp). Notice how the performance gets much worse as the number of threads goes up? (Feel free to press control-c if the test starts to take forever) By putting all of the work of reduction in a critical section, we have flattened the parallelism and made it so only one thread can do useful work at a time (not exactly the point of thread-level parallelism). This contention is problematic; each thread is constantly trying to get into the critical section and only one is making any progress at any given time. As the number of threads goes up, so does the amount of contention, and performance drops accordingly. Can you alleviate this performance bottleneck?

First, try to improve the performance of this code without using the OpenMP reduction clause. Hint: Try to minimize the number of times each thread must add to the shared global_sum variable.

Next, try using the reduction clause with the omp parallel for pragma. For documentation, see slide 33 of Hands On Introduction to OpenMP. Does this code perform better than your code from the previous step? If so, why?

Checkoff

Show the TA your manual fix to `dotp.c` that experiences a speedup relative to the single threaded case
Show the TA your Reduction keyword fix for `dotp.c`, and explain the difference in performance (if there is one).