Updates
- 11/18: Cleaning up unused header files. Files changed: calcDepthNaive.c,
check.c. To update to the latest files, use the following command:
\cp ~cs61c/proj/03/part2/{benchmark.c,calcDepthNaive.c,calcDepthNaive.h,calcDepthOptimized.h,check.c,depthMap.c,Makefile,qsubScript.sh,utils.c,utils.h} .
- 11/17: Updated correctness check. Files changed: utils.c.
- 11/16: Project 3 Part 2 release.
Clarifications/Reminders
- Start Early!
- This project should be done on the hive machines. The code will not compile on non-x86 machines and will likely perform differently on other machines not configured similarly to the hive machines.
- You are required to work with one partner for this project. You may share code with your partner, but ONLY your partner. The submit command will ask you for your partner's name, so only one partner needs to submit.
- Make sure you read through the project spec before starting.
- Please refrain from using hive26, hive27, or hive28 for anything other than PBS scheduling.
Goals
In Project 1, you wrote a program that would generate a depth map from stereo images. You then optimized this depth map generator in Project 3 Part 1. However, your Project 3 Part 1 code only improved the performance using SSE/SIMD instructions. In Part 2, you will further optimize the performance of your code by parallelizing your code using OpenMP.
Background
Refer to the Project 3 Part 1 website and the Project 1 website.
Architecture
Your code will be tested on the hive machines, so keep that in mind while choosing the values to use for certain optimizations. The following is detailed information about the hive workstations; you may or may not find it useful.
They are Dell Precision T5500 machines in a Xeon DP configuration on an Intel 5520 chipset. They are packing not one, but two Intel Xeon E5620 CPUs. Each has 4 cores, for a total of 8 processors, and each core runs at a clock rate of 2.40 GHz. This system also features 6GB of 1333MHz DDR3 ECC Registered memory, in a 6x1GB hex-channel configuration (note that the Intel 5520 is NUMA so 3x1GB is local to one physical processor and the other 3x1GB is local to the second physical processor).
Each core comes with 16 general purpose integer registers (though a few are reserved such as the stack pointer) and 8 floating point registers. They also have 16 XMM registers per core for use with the SIMD intrinsics.
All caches deal with block sizes of 64 bytes. Each core has an L1 instruction and L1 data cache, both of 32 Kibibytes. A core also has a unified L2 cache (same cache for instructions and data) of 256 Kibibytes. The 4 cores on a microprocessor share an L3 cache of 8 Mibibytes. The associativities for the L1 instruction, L1 data, unified L2, and shared L3 caches are 4-way, 8-way, 8-way, and 16-way set associative, respectively.
Part 2 (Due 11/23 @ 23:59:59)
Optimize your depth map generator using any of the techniques you learned including OpenMP. For full credit, the program needs to achieve an average performance of 20 GFlop/s for varying sizes of images. This requirement is worth 35 pts. Your output should remain correct, and we have provided utilities for your to check the correctness of your code. Please remember, however, that the code provided is not a guarantee of correctness and we expect you to test your code yourself. Code that is incorrect will be considered to run at 0 GFlop/s for grading purposes.
Your code will also be tested with varying parameters (nothing too small) and must achieve an average performance of 15 GFlop/s. This requirement is worth 15 pts.
Getting started
Copy the files in the directory ~cs61c/proj/03/part2/
to your proj3/part2
directory, by entering the following command:
mkdir -p ~/proj3/part2 cp -r ~cs61c/proj/03/part2/* ~/proj3/part2 cp -r ~/proj3/part1/calcDepthOptimized.c ~/proj3/part2
You will need to add the following OpenMP headers to your calcDepthOptimized.c
to enable OpenMP support:
// include OpenMP #if !defined(_MSC_VER) #include <pthread.h> #endif #include <omp.h>
The only file you need to modify and submit is calcDepthOptimized.c
. It is your job to implement an optimized version of calcDepthNaive
in calcDepthOptimized
.
You should copy your code from part 1 and replace the provided
calcDepthOptimized.c
using the commands above.
The rest of the files are part of the framework. It may be helpful to look at and modify some of the other files to more thoroughly test your code. A description of these files is provided below:
Makefile
: Defines all the compilation commands.check.c
: TestscalcDepthOptimized
on various size randomly generated inputs and compares the results withcalcDepthNaive
benchmark.c
: TestscalcDepthOptimized
on increasingly large images and reports the performance level in GFlop/s.depthMap.c
: Loads bitmap images and calls calcDepthOptimized() to quickly calculate the depth map.utils.c
: Defines bitmap loading, printing, and saving functions along with other utilities.qsubScript.sh
: Script for submitting jobs to PBS queue.test/
: Contains the a set of images for testing. You can run your code on these images with depthMap.
You are free to define and implement additional helper functions, but if you want them to be run when we grade your project, you must include them in calcDepthOptimized.c
. Changes you make to any other files will be overwritten when we grade your project.
Test your changes by compiling your files with make
and running the benchmark
or check
executable.
make ./check ./benchmark
Scoring
Here is the breakdown of how points will be awarded for partial credit. Note that we will not necessarily be using the same sizes provided in the benchmark, however, we will be using similar sizes (nothing far bigger or smaller). Note that despite the fact that we will be benchmarking with similar sizes, this does not exempt you from having working code for smaller images, features, and maximum displacements - for these cases we will not be measuring performance but checking correctness only.
Gflop/s | 5.5 | 7.5 | 10 | 12.5 | 15 | 17.5 | 20 |
Point Value | 0 | 5 | 14 | 23 | 28 | 32 | 35 |
Intermediate Glop/s point values are linearly interpolated based on the point values of the two determined neighbors. Note that heavy emphasis is placed on the middle Gflop/s, with the last couple being worth relatively few points. So don't sweat it if you are having trouble squeezing those last few floating point operations out of the machine!
Here are some suggestions for improving the performance of your code:
- Factor out common calculations and keep the result.
- Avoid recalculating values you have already calculated.
- Unroll loops so that you can use SSE and perform 4 operations at a time.
- Don't be afraid to restructure the loops or change the ordering of some operations.
- Padding the matrices may give better performance for odd sized matrices.
- Use OpenMP to multithread your code.
- Avoid forking and joining too many times, and avoid critical sections, barriers, and single sections.
- Avoid false sharing and data dependencies between processors.
Debugging and Testing
We have added a debug target to the makefile to assist in debugging. Release/optimized code is difficult for GDB to debug as the compiler is permitted to restructure your source code. While adding in print statements can be helpful, GDB and valgrind can make debugging a lot quicker. To use the debug target, you can run the following commands (followed by running GDB, valgrind, or whatever debugging tools you may find helpful):
make debug
While you are working on the project, we encourage you to keep your code under version control. However, if you are using a hosted repository, please make sure that it is private, or else it could be viewed as a violation of academic integrity.
Before you submit, make sure you test your code on the hive machines. We will be grading your code there. IF YOUR PROGRAM FAILS TO COMPILE, YOU WILL AUTOMATICALLY GET A ZERO . There is no excuse not to test your code.
Submission
The full proj3-2 is due Sunday 11/23 @ 23:59:59. Please make sure only ONE partner submits. To submit proj3-2, enter in the following:
submit proj3-2
Your changes should be limited to calcDepthOptimized.c
. If you have any issues submitting please e-mail your TA. Good luck!
Portable Batch System (PBS)
For this project we have implemented a batch queue system to ensure you can get an accurate measure of your codes performance even while the hive machines are bogged down with traffic. A batch queue system (with a scheduler) attempts to load-balance a set of machines so that the computational resources are balanced efficiently. In this case, we installed the PBS scheduler on hive26, hive27, and hive28.
Important: These machines have been restricted to just cs61c account logins and they are to be used only for testing code. In order to keep these machines completely free for benchmarking, you are forbidden from developing on hive26, hive27, and hive28 throughout the duration of Project 3. Only log on to these servers when you are ready to submit a job to the queue.
You will submit jobs to the queue by running a short shell script we have provided. Read the following instructions carefully to make sure you can properly test your program.
- Log into hive26, hive27, or hive28 via ssh.
- Type
make clean
, then recompile your project using the Makefile. This step may be unnecessary if you were developing on the other hive machines. - Edit the file
qsubScript.sh
with your information. There are only 2 lines you need to edit:- At the top of the file, fill in your e-mail here to get a notification when the job begins and completes:
#PBS -M [email]@berkeley.edu
-
At the bottom of the file, put whatever command you want your job to run: The default line is
./benchmark > bench-output.txt
which will run thebenchmark
executable and place the output in a text file namedbench-output.txt
.
- At the top of the file, fill in your e-mail here to get a notification when the job begins and completes:
-
You can now submit your job to the PBS.
qsub qsubScript.sh
This command will respond with the job number followed by the hostname of the PBS server machine. "16.hive26.cs.berkeley.edu" means that job number 16 has been submitted to the PBS server running on hive26.cs.berkeley.edu.
-
Once you have submitted your job, you can check on the status by using the "qstat" command. The output will look something like this:
qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 12.hive26 CS61C cs61c-ti 00:00:14 C batch 16.hive26 CS61C cs61c-ti 00:00:27 C batch 17.hive26 CS61C cs61c-ti 0 R batch
This output indicates that the job id 12.hive26 was submitted by "cs61c-ti" and is now complete "C". Other useful state indicators are "Q" (queued) and "R" (running). So job 17.hive26 is still running in the example above.
-
Once your job is complete, the output of your job will be found in the file you specified in
qsubScript.sh
(defaultbench-output.txt
)
Competition
There will be a performance competition for this project. The winners of this competition will receive recognition and EPA, and will be announced in class on the last day. The details will be announced later. The competition will be due near the end of the class (in December).