Deadline: Wednesday, December 4th at 11:59 pm
Please accept the GitHub Classroom assignment by clicking this link.
After the repository has been created, make sure to fill out this required form so that we can identify your submission.
Once you’ve created your repository, you’ll need to clone it to your instructional account and/or your local machine. You’ll also need to add the starter code remote with the following command:
$ git remote add starter https://github.com/61c-teach/fa19-HWperf-starter
If we publish changes to the starter code, retrieve them using git pull starter master
.
For this homework, you will apply the performance optimization techniques you learned in
class to speed up a reference implementation of code that generates the Mandelbrot set.
Rather than generating a set of images for each zoom level, we want to instead see what
fraction of pixels are actually inside the Mandelbrot set. For example, if we use the same
parameters as testB2
from project 1, we will report the following:
Frame 0, Scale 2.000000: 3772/40401 pixels in set
Frame 1, Scale 0.029907: 144/40401 pixels in set
Frame 2, Scale 0.000447: 0/40401 pixels in set
Frame 3, Scale 0.000007: 579/40401 pixels in set
Frame 4, Scale 0.000000: 9227/40401 pixels in set
When you pull the starter files from the repository created by GitHub Classroom, you should see the following files:
Makefile
benchmark.c
mandelbrot.c
mandelbrot.h
mandelbrot_baseline.c
parameters.h
You should only modify mandelbrot.c
for this homework. Any changes to any other files
will be ignored when we are grading your submission. For this homework, we will also
require that you do not add any new header files to your code. For instance, since the
starter code does not include math.h
, we will require that your implementation also does
not include math.h
. In other words, these are the expected headers that we will check
for in mandelbrot.c
:
// Include SSE intrinsics
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__))
#include <immintrin.h>
#include <x86intrin.h>
#endif
// Include OpenMP
#include <omp.h>
#include "mandelbrot.h"
#include "parameters.h"
For convenience, we have provided the mandelbrot_baseline.c
file which is exactly the
same as mandelbrot.c
. This file should not be modified. It is there so that you can
If you take a look at mandelbrot.h
, notice that it only contains one method header:
void mandelbrot(struct parameters, double, int32_t *);
Header files are almost like interfaces in Java; they are a sort of contract that the
implementation (contained in the corresponding .c
file) will have a method with the exact
same function signature as the one defined in the header file. In other words, you can add
as many or as few helper functions as you like to mandelbrot.c
, but there must be a
function named mandelbrot
that takes in the above arguments and returns nothing.
We have a reference implementation of mandelbrot
, along with a helper function
iterations
. This is very similar to how one may have implemented project 1 with a few
key differences.
double complex
TypeRather than using a custom complex number implementation, this implementation uses the
double complex
datatype. This type has built-in +
and *
operations that performs
complex number addition and multiplication operations, and since it is very much like
the other primitive types in C, there’s no need to explicitly manage memory for complex
types. The complex.h
header also defines I
, which one can multiply with a double to
create an imaginary number. The following snippet of code demonstrates all of these
features:
double complex point = (params.center +
j * scale / params.resolution +
i * scale / params.resolution * I);
The complex.h
header also defines several utility functions for dealing with complex
numbers. For instance, the reference implementation uses creal
and cimag
to get the
real and imaginary parts of a complex number. For more information, check out the manual
page for the complex number type with man complex
.
To shorten the length of function headers, we pass around a struct parameters
type that
contains all of the information we might need at any point during the Mandelbrot set
computation. The definition of the struct is in parameters.h
:
struct parameters {
double threshold;
int maxiters;
int numframes;
int resolution;
double complex center;
double initialscale;
double finalscale;
};
Recall that the main purpose of this code is to count the number of pixels in the Mandelbrot
set at multiple zoom levels. As such, we no longer want to fill an array with iteration
counts for each pixel—we want to see how many pixels remained in the threshold after
max iterations. In the reference implementation, this corresponds to all pixels for which
the iterations
helper function returned 0
. So, the third argument to mandelbrot
is
a pointer to a single int32_t
type which we only update once in the reference implementation:
*num_pixels_in_set = num_zero_pixels;
You can simply run make
to compile everything in this folder and create two executables:
benchmark
(which uses your mandelbrot implementation) and benchmark_baseline
(which
uses the reference implementation).
The main entry-point for this program is benchmark.c
which parses command line arguments
and runs your mandelbrot
implementation for every zoom level. After compiling everything
with make
, execute ./benchmark
to see the following usage message:
Usage:
benchmark -t <threshold> -m <maxiterations> -n <numframes> -r <resolution>
-R <creal> -I <cimag> -i <initialscale> -f <finalscale>
Options:
-t, threshold
The threshold used to determine what's in the Mandelbrot set.
-m, maxiters
The maximum number of iterations to determine if a point is in the Mandelbrot set.
-n, numframes
The number of frames to generate.
-r, resolution
The width and height (in pixel) of the image.
-R, center_real
The real component of the center of the image.
-I, center_imag
The imaginary component of the center of the image.
-i, initialscale
The starting width and height (in the complex plane) of the image.
-f, finalscale
The ending width and height (in the complex plane) of the image.
Notice that we use command line flags (e.g. the -a
in ls -a
) to pass in arguments to
benchmark
. For example, to run benchmark
with the same parameters as testB2
from
project 1, we run the following command:
$ ./benchmark --threshold 2 \
--maxiters 1536 \
--numframes 5 \
--resolution 100 \
--center_real -0.561397233777 \
--center_imag -0.643059076016 \
--initialscale 2 \
--finalscale 1e-7
For convenience, there are also one-letter versions of each argument above (specified in the usage string). The same command using the abbreviated arguments looks like:
$ ./benchmark -t 2 -m 1536 -r 100 -n 5 -R -0.561397233777 -I -0.643059076016 -i 2 -f 1e-7
If we run this command on a Hive machine, we get the following output:
$ ./benchmark -t 2 -m 1536 -r 100 -n 5 -R -0.561397233777 -I -0.643059076016 -i 2 -f 1e-7
209884 microseconds
Frame 0, Scale 2.000000: 3772/40401 pixels in set
Frame 1, Scale 0.029907: 144/40401 pixels in set
Frame 2, Scale 0.000447: 0/40401 pixels in set
Frame 3, Scale 0.000007: 579/40401 pixels in set
Frame 4, Scale 0.000000: 9226/40401 pixels in set
The first line shows how much time it took to compute all of the requested frames; for
the above run, it took 209884 microseconds
which is about 0.2 seconds
. Note that there
can be significant variations in the reported runtime depending on the load of the Hive
machine.
The next n
lines, where n
is the number of requested frames, show how many pixels are
actually in the set relative to the number of pixels that were evaluated. Note that the
denominator is tied to the resolution and should not change for different frames.
You should first try to speed up the computation by trying to apply conventional code optimizations (i.e. without using SSE or OpenMP). While we won’t tell you the exact steps, here are some hints that should help you get started:
Note that the above hints relate to general optimization practices. You do not necessarily
need to do all of these to achieve a good speedup, and in fact, not all of these may apply
to the reference implementation. Recall that the mandelbrot
function is calculating
something fundamentally different in this homework than in project 1, so make sure to
account for that in your speedup strategy.
Once you have improved performance using these optimizations, you can start applying vectorization and parallelism to make the program even faster. Note that you have considerable freedom to apply any of these optimizations, and there is more than one correct solution. Try to experiment with different approaches and see which one gives you the best performance.
In lab, you learned how to apply SIMD instructions to improve performance. The processors in the Hive machines support the Intel AVX extensions, which allow you to do SIMD operations on 256 bit values (not just 128 bit, as we have seen in the lab). You should use these extensions to perform four operations in parallel since all of the floating point values in the reference implementation are doubles, which are 64 bits in size.
As a reminder, you can use the Intel Intrinsics Guide
as a reference to look up the relevant instructions. You will have to use the __m256d
type to hold 4 doubles in a YMM
register,
and then use the _mm256_*
intrinsics to operate on them.
Finally you should use OpenMP to parallelize computation. Note that you will need to make
sure that none of the different threads overwrites each others’ data. Just adding a
#pragma omp parallel for
may cause errors.
Note that the Hive machines have 4 cores with two hyperthreads each. This means that you should expect a speed-up of 4-8x (note that hyperthreads mean that two different threads execute on the same physical core at the same time; they will therefore compete for processor resources, and as a result, you will not get the same performance as if you were running on two completely separate cores).
The default make
command compiles both your code and the reference implementation. To
make sure your implementation is correct, run both your code (with ./benchmark
) and the
baseline code (with ./benchmark_baseline
) and make sure the values reported by both are
almost the same. For example:
$ ./benchmark -t 2 -m 1536 -r 100 -n 5 -R -0.561397233777 -I -0.643059076016 -i 2 -f 1e-7
_________ microseconds
Frame 0, Scale 2.000000: 3772/40401 pixels in set
Frame 1, Scale 0.029907: 144/40401 pixels in set
Frame 2, Scale 0.000447: 0/40401 pixels in set
Frame 3, Scale 0.000007: 579/40401 pixels in set
Frame 4, Scale 0.000000: 9226/40401 pixels in set
$ ./benchmark_baseline -t 2 -m 1536 -r 100 -n 5 -R -0.561397233777 -I -0.643059076016 -i 2 -f 1e-7
_________ microseconds
Frame 0, Scale 2.000000: 3772/40401 pixels in set
Frame 1, Scale 0.029907: 144/40401 pixels in set
Frame 2, Scale 0.000447: 0/40401 pixels in set
Frame 3, Scale 0.000007: 579/40401 pixels in set
Frame 4, Scale 0.000000: 9226/40401 pixels in set
Due to floating point error, there might be some slight variation between the two. Anything beyond a small variation is indicative of an incorrect implementation.
To measure your speedup, simply divide the runtime of benchmark_baseline
with the runtime
of benchmark
when run on the same arguments. Because there are many more 61C students
than Hive machines, you will likely share resources with your classmates. This might affect
the measurement of your code’s speedup. We encourage you to use Hivemind
to help balance the load amongst the Hive machines.
In this homework, your grade will be determined based upon how fast your code runs relative to the reference solution given to you. We have also provided a minimal Gradescope autograder that measures the correctness of your solution. If your code does not pass the Gradescope autograder, it will receive a 0 for the homework.
We will be using the following function to determine your grade:
double grade(double speedup) {
if (0 <= speedup && speedup < 1) {
return 0;
} else if (1 <= speedup && speedup <= 14) {
return 1 - pow(speedup - 14, 2) / 169;
} else {
return 1;
}
}
For reference, here are the grades for some the speedup values:
Speedup | Grade |
---|---|
14x | 100% |
12x | ~98% |
10x | ~91% |
8x | ~79% |
6x | ~62% |
4x | ~41% |
2x | ~15% |
<1x | 0% |
We will test your program’s speed on the Hive machines. You should test on the Hive machines at a minimum to ensure everything is correct and working properly, but note that because of the size of the course the peak hours of the project right before it is due may not be an accurate reflection of your speedup if the Hive machines get too busy. As a result, starting early is your friend and similarly your tests will likely be more accurate at 3am than noon. Also, make sure to measure speedups on sufficiently large resolutions and frame counts; it should take the reference solution ~1 minute to finish executing on a set of parameters to be sure that overhead is not a large percentage of the execution time.
There is a Gradescope submission with a few correctness tests configured. Because of the hardware-specific nature of this homework, we will not be measuring speedup on Gradescope. Instead, we will periodically pull your latest submission from GitHub and run them on some reserved Hive machines. If you ever want to revert to a previous submission, make sure to create a new submission with the same code as the older one. There is no way for us to measure the speedup of anything besides the most recent submission.