CS61C Su13 Lab 8

Goals

This lab will give you a glimpse of Shared Memory Parallel Programming by using OpenMP.
Learn the basics of synchronization and common parallelization errors.

Background Information

Orchard machines in 200 Sutardja Dai:

2 x Quad-core 2.27 GHz Intel Xeon E5520 (8 physical/16 logical cores)
12 GiB DDR3 RAM

Hive machines in 330 Soda:

2 x Quad-core 2.40 GHz Intel Xeon E5620 (8 physical/16 logical cores)
6 GB DDR3 RAM

Additional OpenMP References:

Setup

Copy the contents of ~cs61c/labs/08 to a suitable location in your home directory.

Note: If you're working on the orchard machines and encounter an error that says "Abort Trap 6" while working through this lab, please ssh to a hive machine and work there.

Introduction to OpenMP

Basics

In this lab we finally take advantage of the multiple cores on the lab machines! OpenMP is a parallel programming framework for C/C++ and Fortran. It has gained quite a bit of traction in recent years, primarily due to its simplicity while still providing good performance. In this lab we will be taking a quick peek at a small fraction of its features, but the links in the Background Information section can provide more information and tutorials for the curious/interested.

There are many types of parallelism and patterns for exploiting it, and OpenMP chooses to use a nested fork-join model. By default, an OpenMP program is a normal sequential program, except for regions that the programmer explicitly declares to be executed in parallel. In the parallel region, the framework creates (fork) a set number of threads. Typically these threads all execute the same instructions, just on different portions of the data. At the end of the parallel region, the framework waits for all threads to complete (join) before it leaves that region and continues sequentially.

OpenMP uses shared memory, meaning all threads can access the same address space. The alternative to this is distributed memory, which is prevalent on clusters where data must be explicitly moved between address spaces. Many programmers find shared memory easier to program since they do not have to worry about moving their data, but it is usually harder to implement in hardware in a scalable way. Later in the lab we will declare some memory to be thread local (accessible only by the thread that created it) for performance reasons, but the programming framework provides the flexibility for threads to share memory without programmer effort.

Hello World Example

For this lab, we will use C to leverage our prior programming experience with it. OpenMP is a framework with a C interface, and it is not a built-in part of the language. Most OpenMP commands are actually directives to the compiler. Consider the following implementation of Hello World (found in hello.c):

int main() {
  #pragma omp parallel
  {
    int thread_ID = omp_get_thread_num();
    printf(" hello world %d\n", thread_ID);
  }
}

This program will fork off the default number of threads and each thread will print out "hello world" in addition to which thread number it is. The #pragma tells the compiler that the rest of the line is a directive, and in this case it is omp parallel. omp declares that it is for OpenMP and parallel says the following code block (what is contained in { }) can be executed in parallel. Give it a try:

$ make hello
$ ./hello

Notice how the numbers are not necessarily in numerical order and not in the same order if you run hello multiple times. This is because within a omp parallel region, the programmer guarantees that the operations can be done in parallel, and there is no ordering between the threads. It is also worth noting that the variable thread_ID is local to each thread. In general with OpenMP, variables declared outside a omp parallel block have only one copy and are shared amongst all threads, while variables declared within a omp parallel block have a private copy for each thread.

Exercises

Exercise 1: Vector Addition

Vector addition is a very parallel computation and it makes for a good first exercise. The v_add() function inside v_add.c will return the array that is the cell-by-cell addition of its inputs x and y. A first attempt at this might look like:

void v_add(double *x, double *y, double *z) {
  #pragma omp parallel
  {
    for(int i=0; i<ARRAY_SIZE; i++)
      z[i] = x[i] + y[i];
  }
}

You can run this by typing:

$ make v_add
$ ./v_add

and the testing framework will automatically time it and vary the number of threads. You will see that this actually seems to do worse as we increase the number of threads. The issue is that each thread is executing all of the code within the omp parallel block, meaning if we have 8 threads, we will actually be adding the vectors 8 times. To get speedup when increasing the number of threads, we need each thread to do less work, not the same amount as before.

Your task is to modify v_add() so there is some speedup (speedup may plateau as the number of threads continues to increase). The best way to do this is to decrease the amount of work each thread does. To aid you in this process, two useful OpenMP functions are: int omp_get_num_threads() and int omp_get_thread_num().

The function omp_get_num_threads() will return how many threads there are in a omp parallel block, and omp_get_thread_num() will return the thread ID.

Divide up the work for each thread through three different methods:

Give each thread alternating elements (i.e. for N threads, thread i sums every N-th element starting at index i. This method will not be very efficient. It will encounter the problem known as false sharing.
Give each thread a continuous chunk of elements. This should achieve much faster speeds.
Use OpenMP's built-in parallel for combined directive.

There are two suggested methods for implementing multiple methods in the same file: (1) put everything in v_add() and use block comments (/* ... */) to comment out the ones you're not using OR (2) put each in a separate function, then rename them so that the one you want to use is called v_add().

Check-off

Show your TA your code for all three methods described above as well as the execution speeds.
For the first method, what happens to the speed as you increase the number of threads? Explain exactly how false sharing hurts your performance in this case.
How does the manual splitting (method 2) compare against the automatic splitting (method 3)?

Exercise 2: Dot Product

The next interesting computation we want to compute is the dot product of two vectors. At first glance, implementing this might seem not too dissimilar from v_add(), but the challenge is how to sum up all of the products into the same variable (reduction). A sloppy handling of reduction may lead to data races: all the threads are trying to read and write to the same address simultaneously. One solution is to use a critical section. The code in a critical section can only be executed by a single thread at any given time. Thus, having a critical section naturally prevents multiple threads from reading and writing to the same data, a problem that would otherwise lead to data races. A naive implementation would protect the sum with a critical section (as found in dotp.c):

double dotp(double* x, double* y) {
  double global_sum = 0.0;
  #pragma omp parallel
  {
    #pragma omp for
    for(int i=0; i<ARRAY_SIZE; i++)
      #pragma omp critical
        global_sum += x[i] * y[i];
  }
  return global_sum;
}

Try out the code using:

$ make dotp
$ ./dotp

Notice how the performance gets much worse as the number of threads goes up? By putting all of the work of reduction in a critical section, we have flattened the parallelism and made it so only one thread can do useful work at a time (not exactly the idea behind thread-level parallelism). This contention is problematic; each thread is constantly fighting for the critical section and only one is making any progress at any given time. As the number of threads goes up, so does the contention, and the performance pays the price. Can you fix this performance bottleneck?

Fix this code by reducing the number of times a thread must add to the shared global_sum variable.
Fix this code using the built-in reduction keyword.

Check-off

Show your TA your code for both fixes and their performances.
Compare the two methods and explain why the second method is slightly faster. (Hint: what does reduction actually do?)

CS61C Summer 2013 Lab 8 - Thread Level Parallelism with OpenMP

Goals

Background Information

Setup

Introduction to OpenMP

Basics

Hello World Example

Exercises

Exercise 1: Vector Addition

Check-off

Exercise 2: Dot Product

Check-off