CS61C Fall 2014 Project 3 Part 1: Performance Optimization - SSE/SIMD

TAs: Andrew Luo, David Adams, Fred Hong
Part 1: Due 11/18 @ 23:59:59

Updates

Clarifications/Reminders

Goals

In Project 1, you wrote a program that would generate a depth map from stereo images. However, the code for project 1 was fairly slow, and thus was impossible to practically use on larger images (or in real time due to latency), such as those you would see in the real world or in production. This project will let you apply the performance optimization techniques you learned to write an optimized version of calcDepth that should be able to handle much larger images in more reasonable running times.

Background

Your objective is to parallelize and optimize a depth map generator very similar to the one you wrote for Project 1. While the algorithm is largely the exact same there are a few key differences to be aware of:

Architecture

Your code will be tested on the hive machines, so keep that in mind while choosing the values to use for certain optimizations. The following is detailed information about the hive workstations; you may or may not find it useful.

They are Dell Precision T5500 machines in a Xeon DP configuration on an Intel 5520 chipset. They are packing not one, but two Intel Xeon E5620 CPUs. Each has 4 cores, for a total of 8 processors, and each core runs at a clock rate of 2.40 GHz. This system also features 6GB of 1333MHz DDR3 ECC Registered memory, in a 6x1GB hex-channel configuration (note that the Intel 5520 is NUMA so 3x1GB is local to one physical processor and the other 3x1GB is local to the second physical processor).

Each core comes with 16 general purpose integer registers (though a few are reserved such as the stack pointer) and 8 floating point registers. They also have 16 XMM registers per core for use with the SIMD intrinsics.

All caches deal with block sizes of 64 bytes. Each core has an L1 instruction and L1 data cache, both of 32 Kibibytes. A core also has a unified L2 cache (same cache for instructions and data) of 256 Kibibytes. The 4 cores on a microprocessor share an L3 cache of 8 Mibibytes. The associativities for the L1 instruction, L1 data, unified L2, and shared L3 caches are 4-way, 8-way, 8-way, and 16-way set associative, respectively.

Measuring Performance

For this project, we will measuring performance in GFlop/s, an industry standard metric for performance. GFlop/s stands for billions of floating point operations per second. There are two values needed to calculate this value: time and number of floating point operations performed by the algorithm. Time is easily calculated by measuring the difference between timestamps before and after the algorithm is ran (in this case we actually use the microprocessor's time stamp counter, which is more accurate than the system clock, but this detail is irrelevant). Counting the number of floating point operations is slightly trickier. For this project, the number of operations is counted manually by calcDepthNaive.c and stored in a pointer that benchmark.c uses to calculate the GFlop/s for your optimized code.

Part 1 (Due 11/18 @ 23:59:59)

Optimize your depth map generator using any of the techniques you learned except OpenMP (or any other threading library, such as pthreads, or otherwise use multiple cores). For full credit, the program needs to achieve an average performance of 5.5 GFlop/s for varying sizes of images. This requirement is worth 35 pts. Your output should remain correct, and we have provided utilities for your to check the correctness of your code. Please remember, however, that the code provided is not a guarantee of correctness and we expect you to test your code yourself. Code that is incorrect will be considered to run at 0 GFlop/s for grading purposes.

Your code will also be tested with varying parameters (nothing too small) and must achieve an average performance of 4.0 GFlop/s. This requirement is worth 15 pts.

As a reminder, in Part 1 you are not permitted to use OpenMP or any other threading library (or otherwise use multiple cores).  Code that does any of the aforementioned will receive no credit.

Getting started

Copy the files in the directory ~cs61c/proj/03/part1/ to your proj3/part1 directory, by entering the following command:

mkdir -p ~/proj3/part1
cp -r ~cs61c/proj/03/part1/* ~/proj3/part1

The only file you need to modify and submit is calcDepthOptimized.c. It is your job to implement an optimized version of calcDepthNaive in calcDepthOptimized.

The skeleton includes a naive (unoptimized) solution for calcDepth in calcDepthNaive.c. You may find it useful to start with this file as a base OR to start with a clean slate and use the naive solution as a reference (feel free to use your own project 1 code - however you will need to make some changes to get it to work). If you want a refresher on the details of the algorithm we are implementing or how the edge cases are handled, refer to the Project 1 website.

The rest of the files are part of the framework. It may be helpful to look at and modify some of the other files to more thoroughly test your code. A description of these files is provided below:

You are free to define and implement additional helper functions, but if you want them to be run when we grade your project, you must include them in calcDepthOptimized.c. Changes you make to any other files will be overwritten when we grade your project.

Test your changes by compiling your files with make and running the benchmark or check executable.

make
./check
./benchmark

Scoring

Here is the breakdown of how points will be awarded for partial credit. Note that we will not necessarily be using the same sizes provided in the benchmark, however, we will be using similar sizes (nothing far bigger or smaller). We will also ensure that the feature width/height and maximum displacement is at least 4 for benchmarking purposes (note that this does not exempt you from having working code for smaller images, features, and maximum displacements - for these cases we will not be measuring performance but checking correctness only).

Gflop/s 1.5 2.0 3.0 4.0 4.5 5.0 5.5
Point Value 0 5 14 23 28 32 35

Intermediate Glop/s point values are linearly interpolated based on the point values of the two determined neighbors and then rounded to nearest integer. Note that heavy emphasis is placed on the middle Gflop/s, with the last couple being worth relatively few points. So don't sweat it if you are having trouble squeezing those last few floating point operations out of the machine!

Here are some suggestions for improving the performance of your code:

Debugging and Testing

We have added a debug target to the makefile to assist in debugging.  Release/optimized code is difficult for GDB to debug as the compiler is permitted to restructure your source code. While adding in print statements can be helpful, GDB and valgrind can make debugging a lot quicker. To use the debug target, you can run the following commands (followed by running GDB, valgrind, or whatever debugging tools you may find helpful):

make debug

While you are working on the project, we encourage you to keep your code under version control. However, if you are using a hosted repository, please make sure that it is private, or else it could be viewed as a violation of academic integrity.

Before you submit, make sure you test your code on the hive machines. We will be grading your code there. IF YOUR PROGRAM FAILS TO COMPILE, YOU WILL AUTOMATICALLY GET A ZERO . There is no excuse not to test your code.

Submission

The full proj3-1 is due Tuesday 11/18 @ 23:59:59. Please make sure only ONE partner submits. To submit proj3-1, enter in the following:

submit proj3-1

Your changes should be limited to calcDepthOptimized.c. If you have any issues submitting please e-mail your TA. Good luck!

Portable Batch System (PBS)

For this project we have implemented a batch queue system to ensure you can get an accurate measure of your codes performance even while the hive machines are bogged down with traffic. A batch queue system (with a scheduler) attempts to load-balance a set of machines so that the computational resources are balanced efficiently. In this case, we installed the PBS scheduler on hive26, hive27, and hive28.

Important: These machines have been restricted to just cs61c account logins and they are to be used only for testing code. In order to keep these machines completely free for benchmarking, you are forbidden from developing on hive26, hive27, and hive28 throughout the duration of Project 3. Only log on to these servers when you are ready to submit a job to the queue.

You will submit jobs to the queue by running a short shell script we have provided. Read the following instructions carefully to make sure you can properly test your program.

  1. Log into hive26, hive27, or hive28 via ssh.
  2. Type make clean, then recompile your project using the Makefile. This step may be unnecessary if you were developing on the other hive machines.
  3. Edit the file qsubScript.sh with your information. There are only 2 lines you need to edit:
    • At the top of the file, fill in your e-mail here to get a notification when the job begins and completes: #PBS -M [email]@berkeley.edu
    • At the bottom of the file, put whatever command you want your job to run: The default line is ./benchmark > bench-output.txt which will run the benchmark executable and place the output in a text file named bench-output.txt.
  4. You can now submit your job to the PBS.

    qsub qsubScript.sh

    This command will respond with the job number followed by the hostname of the PBS server machine. "16.hive26.cs.berkeley.edu" means that job number 16 has been submitted to the PBS server running on hive26.cs.berkeley.edu.

  5. Once you have submitted your job, you can check on the status by using the "qstat" command. The output will look something like this:

    qstat
    Job id                    Name             User            Time Use S Queue
    ------------------------- ---------------- --------------- -------- - -----
    12.hive26                 CS61C            cs61c-ti        00:00:14 C batch
    16.hive26                 CS61C            cs61c-ti        00:00:27 C batch
    17.hive26                 CS61C            cs61c-ti               0 R batch

    This output indicates that the job id 12.hive26 was submitted by "cs61c-ti" and is now complete "C". Other useful state indicators are "Q" (queued) and "R" (running). So job 17.hive26 is still running in the example above.

  6. Once your job is complete, the output of your job will be found in the file you specified in qsubScript.sh (default bench-output.txt)

Competition

There will be a performance competition for this project. The winners of this competition will receive recognition and EPA, and will be announced in class on the last day. The details will be announced later. The competition will be due near the end of the class (in December).