CS61C Fall 2018 Project 2

Clarifications/Reminders

Start Early!
Add features incrementally, check, and benchmark frequently!
Make sure you read through the project spec before starting.
This project should be done on the hive machines. The code will not compile on non-x86 machines and will likely perform differently on other machines not configured similarly to the hive machines.
We will provide support for SSE SIMD instructions. You are welcome to use AVX instructions though not all will work on the hive machines.
You are required to work with one partner for this project. You may share code with your partner, but ONLY your partner. The submit command will ask you for your partner's name, so only one partner needs to submit.
Remember to take a look at labs 6, 7, 9 and the corresponding discussions and lectures for ideas on obtaining higher speedup ratios.
The starter code we provide is correct.

Goals

In Project 1, you wrote a program that would generate a depth map from stereo images. However, the code for project 1 was fairly slow, and thus was impossible to practically use on larger images (or in real time due to latency), such as those you would see in the real world or in production. This project will let you apply the performance optimization techniques you learned to write an optimized version of calc_depth that should be able to handle much larger images in more reasonable running times.

Background

Your objective is to parallelize and optimize a depth map generator very similar to the one you wrote for Project 1. While the algorithm is largely the exact same there are a few key differences to be aware of:

The first difference is that the way we represent grayscale images. For this project, we will use floats on the interval [0, 255] instead of the chars we used in Project 1. This representation is advantageous because it allows you to use float SSE intrinsics that don't have char counterparts. Note that the input images to depth_map are still .bmp files. This is because we handle the conversion from unsigned char to and from float values for you.
The other difference is the way we calculate displacement. We are no longer limited by the range of an unsigned char so we will use absolute displacement instead of normalized displacement. Absolute displacement is given by sqrt(dx * dx + dy * dy).

Architecture

Your code will be tested on the hive machines, so keep that in mind while choosing the values to use for certain optimizations. The following is detailed information about the hive workstations; you may or may not find it useful.

The hive machines are Dell Optiplex 9020 workstations equipped with the Intel quad-core i7-4770 CPU (HT 3.4GHz turbo, 8MB, w/ HD graphics 4600). Each CPU has 4 physical cores, but since the processor is also equipped with hyperthreading so the operating system is presented with 8 virtual (logical) cores. The processors also support a wide range of SIMD instruction sets including SSE and AVX. This system also features 32GB RAM (4x8GB).

More CPU details such as the cache configuration are available at 7-CPU.

Measuring Performance

For this project, we will measuring performance in terms of the average net speedup between the naive version of the code and the optimised version of the code. The time that the algorithms take is easily calculated by measuring the difference between timestamps before and after the algorithm is ran (in this case we actually use the microprocessor's time stamp counter, which is more accurate than the system clock, but this detail is irrelevant). Once the time it takes for both naive and optimised have been calculated, we simply take the ratio between the two to get the net speedup ratio.

Part 1 (Due 10/30 @ 23:59:59)

Optimize your depth map generator using any of the techniques you learned except OpenMP (or any other threading library, such as pthreads, or otherwise use multiple cores). Your output should remain correct, and we have provided utilities for your to check the correctness of your code. Code that is incorrect will receive no credit.

Getting Started

As with project 1, we will be creating a new project repository on GitHub Classroom and pulling the starter code from GitHub. This project, however, requires a partner. If you don't have anyone in mind, talk to your TA and they can help match you with another student.

Once you've found a project partner, complete the Project 2 Repository Registration Form which will guide you through the necessary steps to create a team repository for this project.

Once you've cloned your repository, make sure to setup the starter code remote repository and pull to make sure all changes are up to date.

$ git remote add starter https://github.com/61c-teach/fa18-proj2-starter.git
$ git pull starter master

The only file you need to modify and submit is calc_depth_optimized.c. It is your job to implement an optimized version of calc_depth_naive in calc_depth_optimized.

The skeleton includes a naive (unoptimized) solution for calc_depth_naive in calc_depth_naive.c. You may find it useful to start with this file as a base, or to start with a clean slate and use the naive solution as a reference. (Feel free to use your own project 1 code since you're more familiar with its implementation; however you will need to make some changes to get it to work).

The rest of the files are part of the framework. It may be helpful to look at and modify some of the other files to more thoroughly test your code. A description of these files is provided below:

Makefile: Defines all the compilation commands.
check.c: Tests calc_depth_optimized on various size randomly generated inputs and compares the results with calc_depth_naive
benchmark.c: Tests calc_depth_optimized on increasingly large images and reports the speedup ratio.
depth_map.c: Loads bitmap images and calls calc_depth_optimized() to quickly calculate the depth map.
utils.c: Defines bitmap loading, printing, and saving functions along with other utilities.
test: Contains the a set of images for testing. You can run your code on these images with depth_map.

You are free to define and implement additional helper functions, but if you want them to be run when we grade your project, you must include them in calc_depth_optimized.c. Changes you make to any other files will be overwritten when we grade your project.

Test your changes by compiling your files with make and running the benchmark or check executable.

$ make
$ ./check
$ ./benchmark

If you encounter any errors during the compilation process, make clean will help empty your directory. Then, run make again.

Scoring

Your score for part 1 of the project will be linearly interpolated based on speedup ratio between 1x (no credit) and 2.25x (full credit) on the non-odd cases. This can be achieved using SSE instructions only; AVX instructions are not necessary.

Note that we will not necessarily be using the same sizes provided in the benchmark, however, we will be using similar sizes (nothing far bigger or smaller). Despite the fact that we will be benchmarking with similar sizes, this does not exempt you from having working code for smaller images, features, and maximum displacements. For these cases we will not be measuring performance but checking correctness only.

Here are some suggestions for improving the performance of your code:

Factor out common calculations and keep the result.
Avoid recalculating values you have already calculated.
Avoid expensive operations like divisons or square roots, when unnecessary.
Unroll loops so that you can use SSE and perform 4 operations at a time.
Restructure the loops or change the order of some operations.
Padding matrices may give better performance for odd sized matrices.

Debugging and Testing

We have added a debug target to the makefile to assist in debugging. Release/optimized code is difficult for GDB to debug as the compiler is permitted to restructure your source code. While adding in print statements can be helpful, GDB and valgrind can make debugging a lot quicker. To use the debug target, you can run the following commands (followed by running GDB, valgrind, or whatever debugging tools you may find helpful):

$ make debug

Before you submit, make sure you test your code on the hive machines. We will be grading your code there. IF YOUR PROGRAM FAILS TO COMPILE, YOU WILL AUTOMATICALLY GET A ZERO. There is no excuse not to test your code.

Portable Batch System

For this project we have implemented a batch queue system to ensure you can get an accurate measure of your codes performance even while the hive machines are bogged down with traffic. A batch queue system (with a scheduler) attempts to load-balance a set of machines so that the computational resources are balanced efficiently. In this case, we installed the PBS scheduler on hive28, hive29 and hive30.

Important: These machines have been restricted to just cs61c account logins and they are to be used only for testing code. In order to keep these machines completely free for benchmarking, you are forbidden from developing on hive28, hive29, and hive30 throughout the duration of Project 2. Only log on to these servers when you are ready to submit a job to the queue. Long-running jobs may be killed without additional forewarning.

Follow the instructions below to submit your job to the queue.

Login to hive30 with your cs61c account.
Fill in your email address in the queue.sh script on the line, #PBS -M [email]@berkeley.edu.
Submit your job to the queue by running make queue. You'll receive an email from PBS to notify you of the new job, and another email when the job finihes execution. At any time, you can check your job status using the qstat command.
Once your job finishes execution, the output that would have been printed to the terminal will be written to the result.out file.

Submission

Submitting is a two step process. You will need to submit on both the hive machine through glookup (where we will actually grade the submission) and tag your submission on GitHub in case any issues arise. The full proj2-1 is due Tuesday 10/30 @ 23:59:59. Please make sure only ONE partner submits.

Your changes should be limited to calc_depth_optimized.c.

To submit the full proj2-1, enter in the following on the hive machine:

$ submit proj2-1

To tag the commit of your submission on GitHub, run the following commands.

$ cd ~/fa18-proj2-[TEAM NAME]
$ git add -u # should add all unmodified files in proj2 directory
$ git commit -m "Project 2-1 submission" # or any other commit msg
$ git tag -f "proj2-1-sub" # The tag MUST be "proj2-1-sub". Failure to do so will result in a loss of credit.
$ git push origin master --tags # Note the --tags must be included to push tags to github

Part 2 (Due 11/6 @ 23:59:59)

Optimize your depth map generator using any of the techniques you learned including OpenMP. Your output should remain correct, and we have provided utilities for you to check the correctness of your code. Code that is incorrect will receive no credit.

Getting Started

We'll be using the same code as in part 1. However, you will need to add the following OpenMP headers to your calc_depth_optimized.c to enable OpenMP support:

#if !defined(_MSC_VER)
#include <pthread.h>
#endif
#include <omp.h>

Then, modify the CFLAGS and LFLAGS in the Makefile to enable OpenMP support.

CFLAGS = -O3 -DNDEBUG -g0 -std=c99 -Wall -march=haswell -fopenmp -pthread
LFLAGS = -lm -lpthread

The only file you need to modify and submit is calc_depth_optimized.c. It is your job to implement an optimized version of calc_depth_naive in calc_depth_optimized.

Scoring

Your score for part 2 of the project will be linearly interpolated based on speedup ratio between 2.25x (no credit) and 6x (full credit) on the non-odd cases.

Here are some suggestions for improving the performance of your code:

Use OpenMP to multithread your code.
Avoid forking and joining too many times, and avoid critical sections, barriers, and single sections.
Avoid false sharing and data dependencies between processors.

Submission

proj2-2 is due Tuesday 11/6 @ 23:59:59. Please make sure only ONE partner submits. To submit proj2-2, enter in the following:

Your changes should be limited to calc_depth_optimized.c.

$ submit proj2-2

Likewise, you will need to tag proj2-2.

$ git tag -f "proj2-2-sub" # The tag MUST be "proj2-2-sub". Failure to do so will result in a loss of credit.
$ git push origin master --tags # Note the --tags must be included to push tags to github

Optional Competition (Due 11/27 @ 11:59:59 PM)

Optimize your depth map generator using any techniques you know.

The submissions with the highest performance will receive extra credit and winners will be annnounced in class. We will be running your code on a variety of images and features of various different sizes, which are not guaranteed to be similar to those in benchmark.c.

We may announce additional tiers based on the strategies implemented. Besides simply awarding the most efficient submissions overall, the most efficient submissions which use only techniques discussed in the class, for example, might also receive recognition.

Getting Started

As with part 1 and part 2, the only file you need to modify and submit is calc_depth_optimized.c.

Submission

Competition submissions are due Tuesday 11/27 @ 11:59:59 PM. Late submissions will not be accepted, and please make sure only ONE partner submits. To submit proj2-competition, enter in the following:

$ submit proj2-competition

Likewise, you will need to tag proj2-competition.

$ git tag -f "proj2-competition-sub" # The tag MUST be "proj2-competition-sub". Failure to do so will result in a loss of credit.
$ git push origin master --tags # Note the --tags must be included to push tags to github

CS61C Fall 2018 Project 2: Performance Optimization

Clarifications/Reminders

Goals

Background

Architecture

Measuring Performance

Part 1 (Due 10/30 @ 23:59:59)

Getting Started

Scoring

Debugging and Testing

Portable Batch System

Submission

Part 2 (Due 11/6 @ 23:59:59)

Getting Started

Scoring

Submission

Optional Competition (Due 11/27 @ 11:59:59 PM)

Getting Started

Submission