CS61C Fall 2014 Project 3

Updates

11/18: Cleaning up unused header files. Files changed: calcDepthNaive.c, check.c. To update to the latest files, use the following command:

 \cp ~cs61c/proj/03/part2/{benchmark.c,calcDepthNaive.c,calcDepthNaive.h,calcDepthOptimized.h,check.c,depthMap.c,Makefile,qsubScript.sh,utils.c,utils.h} .

11/17: Updated correctness check. Files changed: utils.c.
11/16: Project 3 Part 2 release.

Clarifications/Reminders

Start Early!
This project should be done on the hive machines. The code will not compile on non-x86 machines and will likely perform differently on other machines not configured similarly to the hive machines.
You are required to work with one partner for this project. You may share code with your partner, but ONLY your partner. The submit command will ask you for your partner's name, so only one partner needs to submit.
Make sure you read through the project spec before starting.
Please refrain from using hive26, hive27, or hive28 for anything other than PBS scheduling.

Goals

In Project 1, you wrote a program that would generate a depth map from stereo images. You then optimized this depth map generator in Project 3 Part 1. However, your Project 3 Part 1 code only improved the performance using SSE/SIMD instructions. In Part 2, you will further optimize the performance of your code by parallelizing your code using OpenMP.

Background

Refer to the Project 3 Part 1 website and the Project 1 website.

Architecture

Your code will be tested on the hive machines, so keep that in mind while choosing the values to use for certain optimizations. The following is detailed information about the hive workstations; you may or may not find it useful.

They are Dell Precision T5500 machines in a Xeon DP configuration on an Intel 5520 chipset. They are packing not one, but two Intel Xeon E5620 CPUs. Each has 4 cores, for a total of 8 processors, and each core runs at a clock rate of 2.40 GHz. This system also features 6GB of 1333MHz DDR3 ECC Registered memory, in a 6x1GB hex-channel configuration (note that the Intel 5520 is NUMA so 3x1GB is local to one physical processor and the other 3x1GB is local to the second physical processor).

Each core comes with 16 general purpose integer registers (though a few are reserved such as the stack pointer) and 8 floating point registers. They also have 16 XMM registers per core for use with the SIMD intrinsics.

All caches deal with block sizes of 64 bytes. Each core has an L1 instruction and L1 data cache, both of 32 Kibibytes. A core also has a unified L2 cache (same cache for instructions and data) of 256 Kibibytes. The 4 cores on a microprocessor share an L3 cache of 8 Mibibytes. The associativities for the L1 instruction, L1 data, unified L2, and shared L3 caches are 4-way, 8-way, 8-way, and 16-way set associative, respectively.

Part 2 (Due 11/23 @ 23:59:59)

Optimize your depth map generator using any of the techniques you learned including OpenMP. For full credit, the program needs to achieve an average performance of 20 GFlop/s for varying sizes of images. This requirement is worth 35 pts. Your output should remain correct, and we have provided utilities for your to check the correctness of your code. Please remember, however, that the code provided is not a guarantee of correctness and we expect you to test your code yourself. Code that is incorrect will be considered to run at 0 GFlop/s for grading purposes.

Your code will also be tested with varying parameters (nothing too small) and must achieve an average performance of 15 GFlop/s. This requirement is worth 15 pts.

Getting started

Copy the files in the directory ~cs61c/proj/03/part2/ to your proj3/part2 directory, by entering the following command:

mkdir -p ~/proj3/part2
cp -r ~cs61c/proj/03/part2/* ~/proj3/part2
cp -r ~/proj3/part1/calcDepthOptimized.c ~/proj3/part2

You will need to add the following OpenMP headers to your calcDepthOptimized.c to enable OpenMP support:

// include OpenMP
#if !defined(_MSC_VER)
#include <pthread.h>
#endif
#include <omp.h>

The only file you need to modify and submit is calcDepthOptimized.c. It is your job to implement an optimized version of calcDepthNaive in calcDepthOptimized.

You should copy your code from part 1 and replace the provided calcDepthOptimized.c using the commands above.

The rest of the files are part of the framework. It may be helpful to look at and modify some of the other files to more thoroughly test your code. A description of these files is provided below:

Makefile: Defines all the compilation commands.
check.c: Tests calcDepthOptimized on various size randomly generated inputs and compares the results with calcDepthNaive
benchmark.c: Tests calcDepthOptimized on increasingly large images and reports the performance level in GFlop/s.
depthMap.c: Loads bitmap images and calls calcDepthOptimized() to quickly calculate the depth map.
utils.c: Defines bitmap loading, printing, and saving functions along with other utilities.
qsubScript.sh: Script for submitting jobs to PBS queue.
test/: Contains the a set of images for testing. You can run your code on these images with depthMap.

You are free to define and implement additional helper functions, but if you want them to be run when we grade your project, you must include them in calcDepthOptimized.c. Changes you make to any other files will be overwritten when we grade your project.

Test your changes by compiling your files with make and running the benchmark or check executable.

make
./check
./benchmark

Scoring

Here is the breakdown of how points will be awarded for partial credit. Note that we will not necessarily be using the same sizes provided in the benchmark, however, we will be using similar sizes (nothing far bigger or smaller). Note that despite the fact that we will be benchmarking with similar sizes, this does not exempt you from having working code for smaller images, features, and maximum displacements - for these cases we will not be measuring performance but checking correctness only.

Gflop/s	5.5	7.5	10	12.5	15	17.5	20
Point Value	0	5	14	23	28	32	35

Intermediate Glop/s point values are linearly interpolated based on the point values of the two determined neighbors. Note that heavy emphasis is placed on the middle Gflop/s, with the last couple being worth relatively few points. So don't sweat it if you are having trouble squeezing those last few floating point operations out of the machine!

Here are some suggestions for improving the performance of your code:

Factor out common calculations and keep the result.
Avoid recalculating values you have already calculated.
Unroll loops so that you can use SSE and perform 4 operations at a time.
Don't be afraid to restructure the loops or change the ordering of some operations.
Padding the matrices may give better performance for odd sized matrices.
Use OpenMP to multithread your code.
Avoid forking and joining too many times, and avoid critical sections, barriers, and single sections.
Avoid false sharing and data dependencies between processors.

Debugging and Testing

We have added a debug target to the makefile to assist in debugging. Release/optimized code is difficult for GDB to debug as the compiler is permitted to restructure your source code. While adding in print statements can be helpful, GDB and valgrind can make debugging a lot quicker. To use the debug target, you can run the following commands (followed by running GDB, valgrind, or whatever debugging tools you may find helpful):

make debug

While you are working on the project, we encourage you to keep your code under version control. However, if you are using a hosted repository, please make sure that it is private, or else it could be viewed as a violation of academic integrity.

Before you submit, make sure you test your code on the hive machines. We will be grading your code there. IF YOUR PROGRAM FAILS TO COMPILE, YOU WILL AUTOMATICALLY GET A ZERO . There is no excuse not to test your code.

Submission

The full proj3-2 is due Sunday 11/23 @ 23:59:59. Please make sure only ONE partner submits. To submit proj3-2, enter in the following:

submit proj3-2

Your changes should be limited to calcDepthOptimized.c. If you have any issues submitting please e-mail your TA. Good luck!

Portable Batch System (PBS)

For this project we have implemented a batch queue system to ensure you can get an accurate measure of your codes performance even while the hive machines are bogged down with traffic. A batch queue system (with a scheduler) attempts to load-balance a set of machines so that the computational resources are balanced efficiently. In this case, we installed the PBS scheduler on hive26, hive27, and hive28.

Important: These machines have been restricted to just cs61c account logins and they are to be used only for testing code. In order to keep these machines completely free for benchmarking, you are forbidden from developing on hive26, hive27, and hive28 throughout the duration of Project 3. Only log on to these servers when you are ready to submit a job to the queue.

You will submit jobs to the queue by running a short shell script we have provided. Read the following instructions carefully to make sure you can properly test your program.

Log into hive26, hive27, or hive28 via ssh.
Type make clean, then recompile your project using the Makefile. This step may be unnecessary if you were developing on the other hive machines.
Edit the file qsubScript.sh with your information. There are only 2 lines you need to edit:
- At the top of the file, fill in your e-mail here to get a notification when the job begins and completes: #PBS -M [email]@berkeley.edu
- At the bottom of the file, put whatever command you want your job to run: The default line is ./benchmark > bench-output.txt which will run the benchmark executable and place the output in a text file named bench-output.txt.
You can now submit your job to the PBS.
```
qsub qsubScript.sh
```
This command will respond with the job number followed by the hostname of the PBS server machine. "16.hive26.cs.berkeley.edu" means that job number 16 has been submitted to the PBS server running on hive26.cs.berkeley.edu.

Once you have submitted your job, you can check on the status by using the "qstat" command. The output will look something like this:

qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
12.hive26                 CS61C            cs61c-ti        00:00:14 C batch
16.hive26                 CS61C            cs61c-ti        00:00:27 C batch
17.hive26                 CS61C            cs61c-ti               0 R batch

This output indicates that the job id 12.hive26 was submitted by "cs61c-ti" and is now complete "C". Other useful state indicators are "Q" (queued) and "R" (running). So job 17.hive26 is still running in the example above.

Once your job is complete, the output of your job will be found in the file you specified in qsubScript.sh (default bench-output.txt)

Competition

There will be a performance competition for this project. The winners of this competition will receive recognition and EPA, and will be announced in class on the last day. The details will be announced later. The competition will be due near the end of the class (in December).

CS61C Fall 2014 Project 3 Part 2: Performance Optimization - OpenMP