Performance Programming

Deadline: Wednesday, December 4th at 11:59 pm

Objectives

  • TSW use the performance programming techniques learned in class to quickly generate the Mandelbrot set.
  • TSW focus on major speed improvements first before attempting micro-optimizations.

Getting Started

Please accept the GitHub Classroom assignment by clicking this link.

After the repository has been created, make sure to fill out this required form so that we can identify your submission.

Once you’ve created your repository, you’ll need to clone it to your instructional account and/or your local machine. You’ll also need to add the starter code remote with the following command:

$ git remote add starter https://github.com/61c-teach/fa19-HWperf-starter

If we publish changes to the starter code, retrieve them using git pull starter master.

Overview

For this homework, you will apply the performance optimization techniques you learned in class to speed up a reference implementation of code that generates the Mandelbrot set. Rather than generating a set of images for each zoom level, we want to instead see what fraction of pixels are actually inside the Mandelbrot set. For example, if we use the same parameters as testB2 from project 1, we will report the following:

Frame 0, Scale 2.000000:  3772/40401 pixels in set
Frame 1, Scale 0.029907:   144/40401 pixels in set
Frame 2, Scale 0.000447:     0/40401 pixels in set
Frame 3, Scale 0.000007:   579/40401 pixels in set
Frame 4, Scale 0.000000:  9227/40401 pixels in set

Step 1: Understanding the Code

When you pull the starter files from the repository created by GitHub Classroom, you should see the following files:

Makefile
benchmark.c
mandelbrot.c
mandelbrot.h
mandelbrot_baseline.c
parameters.h

You should only modify mandelbrot.c for this homework. Any changes to any other files will be ignored when we are grading your submission. For this homework, we will also require that you do not add any new header files to your code. For instance, since the starter code does not include math.h, we will require that your implementation also does not include math.h. In other words, these are the expected headers that we will check for in mandelbrot.c:

// Include SSE intrinsics
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__))
#include <immintrin.h>
#include <x86intrin.h>
#endif

// Include OpenMP
#include <omp.h>

#include "mandelbrot.h"
#include "parameters.h"

For convenience, we have provided the mandelbrot_baseline.c file which is exactly the same as mandelbrot.c. This file should not be modified. It is there so that you can

  1. Measure your speedup and
  2. Reference a fully-correct (albeit slow) implementation.

Mandelbrot Implementation

If you take a look at mandelbrot.h, notice that it only contains one method header:

void mandelbrot(struct parameters, double, int32_t *);

Header files are almost like interfaces in Java; they are a sort of contract that the implementation (contained in the corresponding .c file) will have a method with the exact same function signature as the one defined in the header file. In other words, you can add as many or as few helper functions as you like to mandelbrot.c, but there must be a function named mandelbrot that takes in the above arguments and returns nothing.

We have a reference implementation of mandelbrot, along with a helper function iterations. This is very similar to how one may have implemented project 1 with a few key differences.

The double complex Type

Rather than using a custom complex number implementation, this implementation uses the double complex datatype. This type has built-in + and * operations that performs complex number addition and multiplication operations, and since it is very much like the other primitive types in C, there’s no need to explicitly manage memory for complex types. The complex.h header also defines I, which one can multiply with a double to create an imaginary number. The following snippet of code demonstrates all of these features:

double complex point = (params.center +
        j * scale / params.resolution +
        i * scale / params.resolution * I);

The complex.h header also defines several utility functions for dealing with complex numbers. For instance, the reference implementation uses creal and cimag to get the real and imaginary parts of a complex number. For more information, check out the manual page for the complex number type with man complex.

The Parameters Struct

To shorten the length of function headers, we pass around a struct parameters type that contains all of the information we might need at any point during the Mandelbrot set computation. The definition of the struct is in parameters.h:

struct parameters {
    double threshold;
    int maxiters;
    int numframes;
    int resolution;
    double complex center;
    double initialscale;
    double finalscale;
};

Counting Pixels in Set

Recall that the main purpose of this code is to count the number of pixels in the Mandelbrot set at multiple zoom levels. As such, we no longer want to fill an array with iteration counts for each pixel—we want to see how many pixels remained in the threshold after max iterations. In the reference implementation, this corresponds to all pixels for which the iterations helper function returned 0. So, the third argument to mandelbrot is a pointer to a single int32_t type which we only update once in the reference implementation:

*num_pixels_in_set = num_zero_pixels;

Usage

You can simply run make to compile everything in this folder and create two executables: benchmark (which uses your mandelbrot implementation) and benchmark_baseline (which uses the reference implementation).

The main entry-point for this program is benchmark.c which parses command line arguments and runs your mandelbrot implementation for every zoom level. After compiling everything with make, execute ./benchmark to see the following usage message:

Usage:
        benchmark -t <threshold> -m <maxiterations> -n <numframes> -r <resolution>
                  -R <creal> -I <cimag> -i <initialscale> -f <finalscale>
Options:
        -t, threshold
                The threshold used to determine what's in the Mandelbrot set.
        -m, maxiters
                The maximum number of iterations to determine if a point is in the Mandelbrot set.
        -n, numframes
                The number of frames to generate.
        -r, resolution
                The width and height (in pixel) of the image.
        -R, center_real
                The real component of the center of the image.
        -I, center_imag
                The imaginary component of the center of the image.
        -i, initialscale
                The starting width and height (in the complex plane) of the image.
        -f, finalscale
                The ending width and height (in the complex plane) of the image.

Notice that we use command line flags (e.g. the -a in ls -a) to pass in arguments to benchmark. For example, to run benchmark with the same parameters as testB2 from project 1, we run the following command:

$ ./benchmark --threshold 2 \
              --maxiters 1536 \
              --numframes 5 \
              --resolution 100 \
              --center_real -0.561397233777 \
              --center_imag -0.643059076016 \
              --initialscale 2 \
              --finalscale 1e-7

For convenience, there are also one-letter versions of each argument above (specified in the usage string). The same command using the abbreviated arguments looks like:

$ ./benchmark -t 2 -m 1536 -r 100 -n 5 -R -0.561397233777 -I -0.643059076016 -i 2 -f 1e-7

If we run this command on a Hive machine, we get the following output:

$ ./benchmark -t 2 -m 1536 -r 100 -n 5 -R -0.561397233777 -I -0.643059076016 -i 2 -f 1e-7
209884 microseconds
Frame 0, Scale 2.000000:  3772/40401 pixels in set
Frame 1, Scale 0.029907:   144/40401 pixels in set
Frame 2, Scale 0.000447:     0/40401 pixels in set
Frame 3, Scale 0.000007:   579/40401 pixels in set
Frame 4, Scale 0.000000:  9226/40401 pixels in set

The first line shows how much time it took to compute all of the requested frames; for the above run, it took 209884 microseconds which is about 0.2 seconds. Note that there can be significant variations in the reported runtime depending on the load of the Hive machine.

The next n lines, where n is the number of requested frames, show how many pixels are actually in the set relative to the number of pixels that were evaluated. Note that the denominator is tied to the resolution and should not change for different frames.

Step 2: Unrolling and Other Optimizations

You should first try to speed up the computation by trying to apply conventional code optimizations (i.e. without using SSE or OpenMP). While we won’t tell you the exact steps, here are some hints that should help you get started:

  1. Function calls are expensive since they involve setting up a stack frame and jumping to a different part of code. See if there are any functions that are frequently called that don’t necessarily need to be.
  2. Are there any places where you could do manual loop unrolling?
  3. Is there any unnecessary computation being done?

Note that the above hints relate to general optimization practices. You do not necessarily need to do all of these to achieve a good speedup, and in fact, not all of these may apply to the reference implementation. Recall that the mandelbrot function is calculating something fundamentally different in this homework than in project 1, so make sure to account for that in your speedup strategy.

Once you have improved performance using these optimizations, you can start applying vectorization and parallelism to make the program even faster. Note that you have considerable freedom to apply any of these optimizations, and there is more than one correct solution. Try to experiment with different approaches and see which one gives you the best performance.

Step 3: SIMD Instructions

In lab, you learned how to apply SIMD instructions to improve performance. The processors in the Hive machines support the Intel AVX extensions, which allow you to do SIMD operations on 256 bit values (not just 128 bit, as we have seen in the lab). You should use these extensions to perform four operations in parallel since all of the floating point values in the reference implementation are doubles, which are 64 bits in size.

As a reminder, you can use the Intel Intrinsics Guide as a reference to look up the relevant instructions. You will have to use the __m256d type to hold 4 doubles in a YMM register, and then use the _mm256_* intrinsics to operate on them.

Step 4: OpenMP

Finally you should use OpenMP to parallelize computation. Note that you will need to make sure that none of the different threads overwrites each others’ data. Just adding a #pragma omp parallel for may cause errors.

Note that the Hive machines have 4 cores with two hyperthreads each. This means that you should expect a speed-up of 4-8x (note that hyperthreads mean that two different threads execute on the same physical core at the same time; they will therefore compete for processor resources, and as a result, you will not get the same performance as if you were running on two completely separate cores).

Testing for Correctness

The default make command compiles both your code and the reference implementation. To make sure your implementation is correct, run both your code (with ./benchmark) and the baseline code (with ./benchmark_baseline) and make sure the values reported by both are almost the same. For example:

$ ./benchmark -t 2 -m 1536 -r 100 -n 5 -R -0.561397233777 -I -0.643059076016 -i 2 -f 1e-7
_________ microseconds
Frame 0, Scale 2.000000:  3772/40401 pixels in set
Frame 1, Scale 0.029907:   144/40401 pixels in set
Frame 2, Scale 0.000447:     0/40401 pixels in set
Frame 3, Scale 0.000007:   579/40401 pixels in set
Frame 4, Scale 0.000000:  9226/40401 pixels in set

$ ./benchmark_baseline -t 2 -m 1536 -r 100 -n 5 -R -0.561397233777 -I -0.643059076016 -i 2 -f 1e-7
_________ microseconds
Frame 0, Scale 2.000000:  3772/40401 pixels in set
Frame 1, Scale 0.029907:   144/40401 pixels in set
Frame 2, Scale 0.000447:     0/40401 pixels in set
Frame 3, Scale 0.000007:   579/40401 pixels in set
Frame 4, Scale 0.000000:  9226/40401 pixels in set

Due to floating point error, there might be some slight variation between the two. Anything beyond a small variation is indicative of an incorrect implementation.

Measuring Your Speedup

To measure your speedup, simply divide the runtime of benchmark_baseline with the runtime of benchmark when run on the same arguments. Because there are many more 61C students than Hive machines, you will likely share resources with your classmates. This might affect the measurement of your code’s speedup. We encourage you to use Hivemind to help balance the load amongst the Hive machines.

Grading

In this homework, your grade will be determined based upon how fast your code runs relative to the reference solution given to you. We have also provided a minimal Gradescope autograder that measures the correctness of your solution. If your code does not pass the Gradescope autograder, it will receive a 0 for the homework.

We will be using the following function to determine your grade:

double grade(double speedup) {
    if (0 <= speedup && speedup < 1) {
        return 0;
    } else if (1 <= speedup && speedup <= 14) {
        return 1 - pow(speedup - 14, 2) / 169;
    } else {
        return 1;
    }
}

For reference, here are the grades for some the speedup values:

Speedup Grade
14x 100%
12x ~98%
10x ~91%
8x ~79%
6x ~62%
4x ~41%
2x ~15%
<1x 0%

We will test your program’s speed on the Hive machines. You should test on the Hive machines at a minimum to ensure everything is correct and working properly, but note that because of the size of the course the peak hours of the project right before it is due may not be an accurate reflection of your speedup if the Hive machines get too busy. As a result, starting early is your friend and similarly your tests will likely be more accurate at 3am than noon. Also, make sure to measure speedups on sufficiently large resolutions and frame counts; it should take the reference solution ~1 minute to finish executing on a set of parameters to be sure that overhead is not a large percentage of the execution time.

Submission

There is a Gradescope submission with a few correctness tests configured. Because of the hardware-specific nature of this homework, we will not be measuring speedup on Gradescope. Instead, we will periodically pull your latest submission from GitHub and run them on some reserved Hive machines. If you ever want to revert to a previous submission, make sure to create a new submission with the same code as the older one. There is no way for us to measure the speedup of anything besides the most recent submission.