Lab 8: Parallelism II: OpenMP, Open MPI

Deadline: Monday, April 17, 11:59:59 PM PT

Setup

You must complete this lab on the hive machines (not your local machine). See Lab 0 if you need to set up the hive machines again.

In your labs directory on the hive machine, pull any changes you may have made in past labs:

git pull origin main

Still in your labs directory on the hive machine, pull the files for this lab with:

git pull starter main

If you run into any git errors, please check out the common errors page.

OpenMP Directives

Usage for OpenMP directives can be found on the OpenMP summary card.

For

In the last lab, you manually assigned iterations within a for loop to individual threads. OpenMP can take care of this automatically with the for directive! You can either add a #pragma omp for to an existing for loop within a #pragma omp parallel, or use #pragma omp parallel for by itself.

Critical

The code in a critical section can only be executed by a single thread at any given time. Thus, having a critical section naturally prevents multiple threads from reading and writing to the same data, a problem that would otherwise lead to data races. OpenMP provides the critical primitive to allow you to perform computations within a critical section.

Reduction

Critical sections may be slow, so OpenMP has a built in way to reduce this problem through the use of the reduction keyword. The reduction keyword will increase the parallelizability of your code by automatically reducing the amount of code contained within the critical section. To use this keyword, you must specify the the variable that should be in a critical section and the operation being performed on that variable.

Exercise 1: OpenMP Dot Product

At first glance, implementing a dot product might seem not too different from v_add from lab 7, since we should now just perform elementwise multiplication instead of addition. The challenge is how to add the products into one variable (aka reduction) to get our final answer. If multiple threads try to add their results to the same variable at the same time, it will lead to a data race which will result in an incorrect result.

Try running the tests now, only the naive solution should pass since the other optimizations are not implemented.

make ex1
./ex1
  1. Implement dotp_critical using OpenMP and a critical section.
    • Notice how the performance gets much worse as the number of threads go up? By putting all of the work of reduction in a critical section, we have flattened the parallelism and made it so only one thread can do useful work at a time (not exactly the idea behind thread-level parallelism).
    • This contention is problematic; each thread is constantly fighting for the critical section and only one is making any progress at any given time. As the number of threads go up, so does the contention, and the performance pays the price.
  2. Implement dotp_reduction using OpenMP and a reduction statement.
    • Hint: the variable that should only be accessed by one thread at a time is global_sum and the operation being performed on it is addition (+).
  3. Implement dotp_manual_reduction using OpenMP and the idea of reduction, but do not use the reduction keyword.
    • Hint: create variables for each thread and only add to the final sum when absolutely necessary.

Open MPI

The Open MPI project provides a way of writing programs which can be run on multiple processes. We can use its C libraries by calling their functions. Then, when we run the program, Open MPI will create a bunch of processes and run a copy of the code on each process. Here is a list of the most important functions for this class:

  • int MPI_Init(int* argc, char*** argv) should be called at the start of the program, passing in the addresses of argc and argv.
  • int MPI_Finalize() should be called at the end of the program.
  • int MPI_Comm_size(MPI_COMM_WORLD, int *size) gets the total number of processes running the program, and puts it in size.
  • int MPI_Comm_rank(MPI_COMM_WORLD, int *rank) gets the ID of the current process (0 ∼ total number of processes - 1) and puts it in rank.
  • int MPI_Send(const void *buf, int count, MPI_Datatype datatype, int dest, 0, MPI_COMM_WORLD) (Man page)
    • buf is where the message to send should be stored
    • count is the number of elements within the message to send
    • datatype is the datatype of each element within the message
    • dest is the process ID of the recipient of the message
    • tag is the tag of the message
      • For the scope of this lab, feel free to use 0.
    • comm is the communication context of the message
      • For the scope of this class, use MPI_COMM_WORLD
  • int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) (Man page)
    • buf is used to store the received message(s)
    • count is the maximum number of elements to receive
    • datatype is the datatype of each element within the received message
    • source is the source where the message should be from
      • If you want to receive a message from any source, set the source to be MPI_ANY_SOURCE.
      • The source of the message can be found in the MPI_SOURCE field of the outputted status struct.
    • tag is the tag that the message must be sent with
      • If you want to receive a message with any tag, set the tag to be MPI_ANY_TAG.
    • comm is the communication context that the message must be sent within
      • For the scope of this class, use MPI_COMM_WORLD
    • status will store a status struct with additional information
      • If you don’t need the information in the status struct (e.g. because you already know the source of the message), set the status address to MPI_STATUS_IGNORE.

For Open MPI, make sure you use mpicc (instead of gcc) to compile and mpirun to run your programs.

Exercise 2: Hello Open MPI

  1. In ex2.c, use the above functions to write code to run a program that prints Hello world on multiple processes. For this exercise, since there's only one type of messages, feel free to use 0 as the tag and MPI_COMM_WORLD as comm.
  2. Compile the program with mpicc ex2.c -o ex2
  3. Run the program with mpirun ./ex2 [num tasks], where [num tasks] is the number of tasks to run.
    • You can set the number of processes to use with mpirun -np [num processes] ./ex2 [num tasks]
    • If you choose to run with more processes than physical cores (4 for the hive machines), you must pass the --oversubscribe flag (e.g. mpirun --oversubscribe -np 100 ./ex2 100)
  4. Play around with the different number of threads and tasks and see how the output changes!

Feel free to refer to discussion 10 for an example on writing Open MPI code! The code structure will be extremely similar.


Submission

Save, commit, and push your work, then submit to the Lab 8 assignment on Gradescope.