CS61C Fall 2018 Lab 8 - Midterm Prep

About

This lab is meant to help you study for the upcoming midterm. We have included some of next week's lab reading to ensure you have enough exposure to parallelism ideas. There will be no checkoff or coding exercises for this lab. Good luck on your midterm!

Midterm Resources

A few important piazza links:

Midterm Information

Topical Midterm Guide

Exam Review Resources

The midterm covers content up to and including the lecture on Wednesday, October 17, though we will emphasize material that has been covered in homeworks, projects, lab, and discussion.

We highly recommend looking through the guide to give you a strong base understanding of the material. The guide should give you some intuition of the important 61C concepts, but it will, more importantly, give you examples of the ways in which you want to think about it. This should then allow you to study more effectively and approach problems in the right way. Once you feel comfortable with the material, we have compiled resources for the exam in a google drive folder (piazza post linked above). Not all of the problems are relevant for this semester's exam, so make sure you know what is in scope!

Readings

The following readings are important for you to understand the fundamentals about parallel programming. While you won't be doing any coding exercises, it is important to read through this lab to make sure you have an understanding of this material before the exam.

Exercise 1: Familiarize Yourself with the SIMD Functions

Given the large number of available SIMD intrinsics we want you to learn how to find the ones that you'll need in your application.

Intel hosts a variety of tools related to intrinsics, which you can find here (but these are not necessary for the coming lab).

The one that we're particularly interested in is the Intel Intrinsics Guide. Open this page and once there, click the checkboxes for everything that begins with "SSE" (SSE all the way to SSE4.2).

Do your best to interpret the new syntax and terminology. Find the 128-bit intrinsics for the following SIMD operations (one for each):

HINT: If you're overwhelmed, just try clicking on a function whose name you understand! Maybe try one of the "add" functions. Now read the "Description" and "Operation" sections in the drop-down menu opened by clicking the instruction. If you clicked an "add" function, you should've read something like, "Add packed X-bit integers in a and b, and store the results in dst." Then you should realize that the value "X" also appears in the function name! The pattern you see here can be described as follows:

__m128i _mm_add_epiX adds together (128/X) packed X-bit integers.

Another hint: Things that say "epi" or "pi" deal with integers, and those that say "ps" or "pd" deal with single precision and double precision floats.

Exercise 2: Reading SIMD Code

In this exercise you will consider (not implement) the vectorization of 2-by-2 matrix multiplication in double precision.

The math in the above image amounts to the following arithmetic operations:

    C[0] += A[0]*B[0] + A[2]*B[1];
    C[1] += A[1]*B[0] + A[3]*B[1];
    C[2] += A[0]*B[2] + A[2]*B[3];
    C[3] += A[1]*B[2] + A[3]*B[3];

You are given the code sseTest.c that implements these operations in a SIMD manner.
The following intrinsics are used:

__m128d _mm_loadu_pd( double *p ) returns vector (p[0], p[1])
__m128d _mm_load1_pd( double *p ) returns vector (p[0], p[0])
__m128d _mm_add_pd( __m128d a, __m128d b ) returns vector (a0+b0, a1+b1)
__m128d _mm_mul_pd( __m128d a, __m128d b ) returns vector (a0b0, a1b1)
   void _mm_storeu_pd( double *p, __m128d a ) stores p[0]=a0, p[1]=a1

See how these are implemented by viewing the code here.

Introduction to OpenMP

OpenMP is a parallel programming framework for C/C++ and Fortran. It has gained quite a bit of traction in recent years, primarily due to its simplicity and good performance. In future labs we will be taking a quick peek at a small fraction of its features.

There are many types of parallelism and patterns for exploiting it. OpenMP chooses to use a nested fork-join model. By default, an OpenMP program is a normal sequential program, except for regions that the programmer explicitly declares to be executed in parallel. In the parallel region, the framework creates (forks) a set number of threads. Typically these threads all execute the same instructions, just on different portions of the data. At the end of the parallel region, the framework waits for all threads to complete (join) before it leaves that region and continues sequentially.

OpenMP uses shared memory, meaning all threads can access the same address space. The alternative to this is distributed memory, which is prevalent on clusters where data must be explicitly moved between address spaces. Many programmers find shared memory easier to program since they do not have to worry about moving their data, but it is usually harder to implement in hardware in a scalable way. Later we will talk about declaring some memory to be "thread local" (accessible only by the thread that created it) for performance reasons, but the programming framework provides the flexibility for threads to share memory without programmer effort.

Hello World Example

For future labs, we will use C to leverage our prior programming experience with it. OpenMP is a framework with a C interface, and it is not a built-in part of the language. Most OpenMP features are actually directives to the compiler. Consider the basic implementation of Hello World (hello.c):

int main() {
  #pragma omp parallel
  {
    int thread_ID = omp_get_thread_num();
    printf(" hello world %d\n", thread_ID);
  }
}

This program will fork off the default number of threads and each thread will print out "hello world" in addition to which thread number it is. You can change the number of OpenMP threads, but this will be discussed later in lab. The #pragma tells the compiler that the rest of the line is a directive, and in this case it is omp parallel. omp declares that it is for OpenMP and parallel says the following code block (what is contained in { }) can be executed in parallel.