Project 4: CS61kaChow
Deadline: Friday, April 28, 11:59:59 PM PT
For this project, remember that the verb form of convolution is "convolve" not "convolute". Convolute means to complicate things, but in case you forget, here's a helpful quote:
"Convolute is what I do on exams, convolve is what you do on exams."
Babak Ayazifar, EE120 Professor
Content in scope for this project: Lectures 2428, Discussion 910, Homework 7, Labs 78
Setup: Git
This assignment can be done alone or with a partner.
You must complete this project on the hive machines (not your local machine). See Lab 0 if you need to set up the hive machines again.
Warning: Once you create a Github repo, you will not be able to change (add, remove, or swap) partners for this project, so please be sure of your partner before starting the project. You must add your partner on both galloc
and to every Gradescope submission.
If there are extenuating circumstances that require a partner switch (e.g. your partner drops the class, your partner is unresponsive), please reach out to us privately.

Visit Galloc. Log in and start the Project 4 assignment. This will create a GitHub repository for your work. This will create a GitHub repository for your work. If you have a partner, one partner should create a repo and invite the other partner to that repo. The other partner should accept the invite without creating their own repo.

Clone the repository on a hive machine.
(replace username
with your GitHub username)

Navigate to your repository:

Add the starter repository as a remote:
If you run into git
issues, please check out the common errors page.
Setup: Testing
The starter code does not come with any provided tests. To download the staff tests, run the following command:
Background
This section is a bit long, so feel free to skip it for now, but please refer to them as needed since they provide helpful background information for the tasks in this project.
Vectors
In this project, a vector is represented as an array of int32_t
s, or a int32_t *
.
Matrices
In this project, we provide a type matrix_t
defined as follows:
typedef struct matrix_t;
In matrix_t
, rows
represents the number of rows in the matrix, cols
represents the number of columns in the matrix, and data
is a 1D array of the matrix stored in rowmajor format (similar to project 2). For example, the matrix [[1, 2, 3], [4, 5, 6]]
would be stored as [1, 2, 3, 4, 5, 6]
.
bin
files
Matrices are stored in .bin
files as a consecutive sequence of 4byte integers (identical to project 2). The first and second integers in the file indicate the number of rows and columns in the matrix, respectively. The rest of the integers store the elements in the matrix in rowmajor order.
To view matrix files, you can run xxd e matrix_file.bin
, replacing matrix_file.bin
with the matrix file you want to read. The output should look something like this:
00000000: 00000003 00000003 00000001 00000002 ................
00000010: 00000003 00000004 00000005 00000006 ................
00000020: 00000007 00000008 00000009 ............
The leftmost column indexes the bytes in the file (e.g. the third row starts at the 0x20
th byte of the file). The dots on the right display the bytes in the file as ASCII, but these bytes don't correspond to printable ASCII characters so only dot placeholders appear.
The actual contents of the file are listed in 4byte blocks, 4 per row. The first row has the numbers 3 (row count), 3 (column count), 1 (first element), and 2 (second element). This is a 3x3 matrix with elements [1, 2, 3, 4, 5, 6, 7, 8, 9].
Task
In this project, we provide a type task_t
defined as follows:
typedef struct task_t;
Each task represents a convolution operation (we'll go into what convolution is later), and is uniquely identified by its path
. The path
member of the task_t
struct is the relative path to the folder containing the task.
Testing Framework
Tests in this project are located in the tests
directory. The starter code does not contain any tests, but it contains a script create_tests.py
for generating tests. If you would like to make a custom test, please add it to tools/custom_tests.py
. Feel free to use the tests we provide in staff_tests.py
as examples.
Once you define a custom test in custom_tests.py
, you can run the test using the make
commands provided in the task that you're currently working on, and the test will be generated for you based on the parameters you specify.
Directory Structure
Here's what folder should look like after running the example above:
61cproj4/ # Your root project 4 folder
src/ # Where your source files are located
tests/ # Where your test files are located
my_custom_test/ # Where a specific test is located
input.txt # The input file to pass to your program
task0/ # The first task
a.bin # Matrix A used in this task
b.bin # Matrix B used in this task
ref.bin # Reference output matrix generated by our script
out.bin # Output matrix generated by your program
task1/ # The second task, with the same folder structure
...
task9/ # The tenth (last) task
tools/ # Where some utility scripts are stored
Makefile # Makefile for compilation and running the program.
The path
member of the task_t
struct for task0
would be would be tests/my_custom_test/task0
.
Testing
This project uses a Makefile
for running tests. The Makefile
accepts one variables: TEST
. TEST
should be a path a folder in tests/
that contains an input.txt
. We'll provide more guidance on this later in the project.
For example, if you would like the run the test generated above and you're working on task 1, the command would be
Debugging
While cgdb
and valgrind
may not be as helpful as project 1, they are still helpful for fixing any errors you may encounter. When you run a test through make
, it prints out the actual command used to run the test. For example, the make
command in the above section would print out the following:
Command: ./convolve_naive_naive tests/my_custom_test/input.txt
If you would like to use cgdb
, add cgdb
before the command printed. Similarly, for valgrind
, add valgrind
before the command printed. For example, the commands would be
Task 1: Naive Convolutions
In this project, you will implement and optimize 2D convolutions, which is a mathematical operation that has a wide range of applications. Don't worry if you've never seen convolutions before, it can be simplified to a series of dot products (more on this in task 1.3).
Task 1.1 (Optional): Vector Dot Product
Note: This task is NOT required but you may find it useful to implement as convolutions involve calculating dot products.
Helpful resources: Background: Vectors
Recall from project 2, Dot product is a way of multiplying vectors together to determine how well they align. Dot product multiplies the corresponding entries of vectors together and returns the sum of all the products.
Implement the int dot(uint32_t n, int32_t* vec1, int32_t* vec2)
function in compute_naive.c
. This function calculates and returns the dot product of vectors A and B.
dot 

Arguments  uint32_t n 
The length of the vectors we are taking a dot product of. 
int32_t* vec1 
Pointer to the first vector.  
int32_t* vec2 
Pointer to the second vector.  
Return values  int 
The result of the dot product. 
Task 1.2 (Optional): 1D Convolutions
Note: This task is NOT required but you may find it useful to implement 1D convolutions before 2D convolutions. 1D convolution can be treated as a special case of 2D convolution, where the number of rows is always 1.
Conceptual Overview: 1D Convolutions
Note: For this project, we will only be covering discrete convolutions since the input vectors and matrices only have values at discrete indexes. This is in contrast to continuous convolutions where the input would have values for all real numbers.
1D convolution is a special way of multiplying two vectors together in order to determine how well they overlap. This leads to many different applications that you'll explore in this project, but first, here are the mechanics for how a 1D convolution is done:
1D convolution is when you want to convolve two vectors together, vector A and vector B. You begin by flipping the elements in vector B horizontally, making the last elements first and the first elements last. Once flipped, overlap vector A and B where vector B is the furthest left possible while still completely overlapping with vector A. Then perform a dot product of the overlapped elements, and this is the first element in your resultant vector. Slide vector B to the right by 1, and repeat this process. This continues until any part of vector B no longer overlaps with vector A. You have now convolved vector A and B.
You can assume that the width of vector B is less than or equal to the width of vector A.
Note: The output matrix has different dimensions. We'd recommend working out some examples to see how the dimensions of the output matrix is related to the input matrices.
Implementation
Helpful resources: Background: Matrices, Background: Task
Implement convolve
in compute_naive.c
for 1D convolutions. You may assume that both a_matrix
and b_matrix
have exactly 1 row and the number of columns in b_matrix
will smaller or equal to the number of columns in a_matrix
(that is, a_matrix
is 1
by m
and b_matrix
is 1
by n
, and n <= m
).
convolve 

Arguments  matrix_t* a_matrix 
Pointer to the first vector. 
matrix_t* b_matrix 
Pointer to the second vector.  
matrix_t** output_matrix 
Set *output_matrix to be a pointer to the resulting matrix. You must allocate memory for the resulting matrix in convolve 

Return values  int 
0 if successful, 1 if there are any errors 
Testing and Debugging
Helpful resources: Background: Testing Framework, Background: Directory Structure, Background: Testing, Background: Debugging
To run the staff provided tests for this task:
If you'd like to create additional tests, please refer to the testing framework section for creating tests as well as the testing and debugging sections for instructions on running tests.
Make sure to create tests of different sizes and dimensions! The autograder will test a wide range of sizes for both correctness and speedup.
Task 1.3: Convolutions
Conceptual Overview: 2D Convolutions
Note: For this project, we will only be covering discrete convolutions since the input vectors and matrices only have values at discrete indexes. This is in contrast to continuous convolutions where the input would have values for all real numbers.
Convolution is a special way of multiplying two vectors or two matrices together in order to determine how well they overlap. This leads to many different applications that you'll explore in this project, but first, here are the mechanics for how convolution is done:
A convolution is when you want to convolve two vectors or matrices together, matrix A and matrix B. You begin by flipping matrix B in both dimensions. Once flipped horizontally and vertically, overlap matrix B in the top left corner of matrix A. Perform an elementwise multiplication of where the matrices overlap and then add all of the results together to get a single value. This is the top left entry in your resultant matrix. Slide matrix B to the right by 1 and repeat this process. This continues until any part of matrix B no longer overlaps with matrix A. When this happens, move matrix B first column of matrix A and down by 1 row. Repeat the entire process until reaching the bottom right corner of matrix A. You have now convolved matrix A and B.
You can assume that the height and width of matrix B are less than or equal to the height and width of matrix A.
Note: Flipping matrix B in both dimensions is NOT the same as transposing the matrix. Flipping an MxN
matrix results in an MxN
matrix. Transpose results in an NxM
matrix.
Note: The output matrix has different dimensions. We'd recommend working out some examples to see how the dimensions of the output matrix is related to the input matrices.
Implementation
Helpful resources: Background: Matrices, Background: Task
Implement (or modify your existing) convolve
in compute_naive.c
. You may assume that b_matrix
will smaller or equal to a_matrix
(that is, if a_matrix
is m
by n
and b_matrix
is k
by l
, then k < m
and l < n
).
convolve 

Arguments  matrix_t* a_matrix 
Pointer to the first vector. 
matrix_t* b_matrix 
Pointer to the second vector.  
matrix_t** output_matrix 
Set *output_matrix to be a pointer to the resulting matrix. You must allocate memory for the resulting matrix in convolve 

Return values  int 
0 if successful, 1 if there are any errors 
Testing and Debugging
Helpful resources: Background: Testing Framework, Background: Directory Structure, Background: Testing, Background: Debugging
To run the staff provided tests for this task:
If you'd like to create additional tests, please refer to the testing framework section for creating tests as well as the testing and debugging sections for instructions on running tests.
Make sure to create tests of different sizes and dimensions! The autograder will test a wide range of sizes for both correctness and speedup.
Task 2: Optimization
At this point, you should have a complete implementation in compute_naive.c
. You will implement all optimizations in compute_optimized.c
.
Task 2.1: SIMD
Helpful resources: Lab 7, Discussion 9
Optimize your naive solution using SIMD instructions in compute_optimized.c
. Not all functions can be optimized using SIMD instructions. For this project, we're using 32bit integers, so each 256 bit AVX vector can store 8 integers and perform 8 operations at once.
As a reminder, you can use the Intel Intrinsics Guide as a reference to look up the relevant instructions. You will have to use the __m256i
type to hold 8 integers in a YMM register, and then use the _mm256_*
intrinsics to operate on them.
Task 2.2: OpenMP
Helpful resources: Lab 7, Lab 8, Discussion 9, Discussion 10
Optimize your solution from task 2.1 using OpenMP directives in compute_optimized.c
. Not all functions can be optimized using OpenMP directives.
Task 2.3: Testing and Debugging
Helpful resources: Background: Testing Framework, Background: Directory Structure, Background: Testing, Background: Debugging
To run the staff provided tests for this task (note: for this task, you should use make task_2
instead of make task_1
):
If you'd like to create additional tests, please refer to the testing framework section for creating tests as well as the testing and debugging sections for instructions on running tests.
Task 3: Open MPI
Helpful resources: Lab 8, Discussion 10
Up until this point, you've been using the provided coordinator_naive.c
as the coordinator with a small number of tasks. However, coordinator_naive.c
does not have any PLP capabilities and we can optimize it for a workload that includes a larger number of tasks.
 Implement an Open MPI coordinator in
coordinator_mpi.c
.
 The structure of the code will be similar to the Open MPI coordinator you wrote in lab 8 and discussion 10.
 Copy your existing compute functions in
compute_optimized.c
tocompute_optimized_mpi.c
, and make any changes necessary to further improve the speedup.
The execute_task
function may be helpful here:
execute_task 

Arguments  task_t *task 
A task_t struct representing the task to execute. 
Return values  int 
0 if successful, 1 if there are any errors 
Testing and Debugging
Helpful resources: Background: Testing Framework, Background: Directory Structure, Background: Testing, Background: Debugging
To run the staff provided tests for this task (note: for this task, you should use make task_3
instead of make task_1
or make task_2
):
If you'd like to create additional tests, please refer to the testing framework section for creating tests as well as the testing and debugging sections for instructions on running tests.
Unfortunately, gdb
is not extremely helpful for parallel programs. We recommend using printf
s throughout your program to verify that variables contain values you'd expect. We also recommend first testing your optimized compute with the naive coordinator first since it is easier to debug.
Task 4: Partner/Feedback Form
Congratulations on finishing the project! This is a brand new project, so we'd love to hear your feedback on what can be improved for future semesters.
Please fill out this short form, where you can offer your thoughts on the project and (if applicable) your partnership. Any feedback you provide won't affect your grade, so feel free to be honest and constructive.
Benchmarks
There are two types of benchmarks:
 Optimized: uses your
compute_optimized.c
and the staffcoordinator_naive.c
 These benchmarks are run with a limit of 4 threads for OpenMP.
 Open MPI: uses your
compute_optimized_mpi.c
and yourcoordinator_mpi.c
 These benchmarks are run with a limit of 1 thread for OpenMP and configures Open MPI to use 4 processes.
Speedup Requirements
Speedup Requirements  
Name  Folder Name  Type  Speedup 
Random  test_ag_random 
Optimized  7.75x 
Open MPI  5.30x  
Increasing  test_ag_increasing 
Optimized  8.05x 
Open MPI  3.15x  
Decreasing  test_ag_decreasing 
Optimized  8.05x 
Open MPI  3.70x  
Big and Small  test_ag_big_and_small 
Open MPI  2.10x 
Performance Score Calculation
The score for the performance portion of the autograder is calculated using the followed equation: score = log(x) / log(t)
, where x
is the speedup a submission achieves on a specific benchmark, and t
is the target speedup for that benchmark.
Submission and Grading
Submit your code to the Project 4 Gradescope assignment. Make sure that you have only modified compute_naive.c
, compute_optimized.c
, compute_optimized_mpi.c
and coordinator_mpi.c
. You can submit to Gradescope as many times as you want, and the score you see on Gradescope will be your final score for this project.
Appendix: Helper Functions in io.o
We've provided a few helper functions in io.o
. However, we've only provided the object file that will be linked with your program at link time. We are not providing the source code for these functions since it may give away solutions for other projects.
read_tasks 

Arguments  char *input_file 
Path to the input file (the input.txt file) 
int *num_tasks 
Set *num_tasks to the number of tasks in the input file 

task_t ***tasks 
Set *tasks as an array of task_t * s, each pointing to a task_t struct representing a task. 

Return values  int 
0 if successful, 1 if there are any errors 
get_a_matrix_path 

Arguments  task_t *task 
Pointer to the task. 
Return values  char * 
Matrix A's path for the given task. 
get_b_matrix_path 

Arguments  task_t *task 
Pointer to the task. 
Return values  char * 
Matrix B's path for the given task. 
get_output_matrix_path 

Arguments  task_t *task 
Pointer to the task. 
Return values  char * 
The output matrix's path for the given task. 
read_matrix 

Arguments  char *path 
Path to read the matrix from. 
matrix_t **matrix 
Sets *matrix to the matrix stored at path . 

Return values  int 
0 if successful, 1 if there are any errors 
write_matrix 

Arguments  char *path 
Path to read the matrix from. 
matrix_t *matrix 
The matrix to write to path . 

Return values  int 
0 if successful, 1 if there are any errors 
Appendix: (Optional) Convolutions
The content in this appendix is not in scope for this course.
This appendix will only give an overview of the 1D discrete time convolution. Discrete time means that a function is only defined at distinct, equally spaced integer indexes. Also, the explanations below are very fast and can be complicated, so it is ok to not understand convolutions for this project. For a much deeper understanding of convolutions and how they interact with signals, be sure to take EE120.
With that, let’s jump in 😈
Mathematically, a convolution is a type of function multiplication that determines how much the two overlap. In real world applications though, a convolution is how a system responds to a signal. A signal can be any mathematical function, and a system can be anything that takes in this function and provides an output. This output is also in the form of a function.
We can describe signals and systems like so:
x(t) > H > y(t)
Where x(t) and y(t) are functions, and H is the system.
Important qualities of a system that we’ll need for convolution are linearity and timeinvariance.
Linearity: One aspect of linearity is that if you feed a signal into a system that has been multiplied by some constant, it will respond with the original output signal multiplied by the same constant.
αx(t) = x̂(t) > H > ŷ(t) = αy(t)
Additionally, linearity means that if you pass in a sum of signals, the system will output a sum of their corresponding outputs.
x_1(t) + x_2(t) = x̂(t) > H > ŷ(t) = y_1(t) + y_2(t)
Putting the two together:
αx_1(t) + βx_2(t) = x̂(t) > H > ŷ(t) = αy_1(t) + βy_2(t)
TimeInvariance: TimeInvariance means that if you feed a time shifted input into a system, it will respond with an equally time shifted output.
$αx(t  T) = x̂(t) > H > ŷ(t) = αy(t  T)$
Going forward, the systems we examine will only be linear and timeinvariant (LTI).
In order to determine how a system will respond to any signal, an impulse can be passed into the system. The impulse function used is known as the Kronecker Delta, and it is defined like so:
x(t) = 1 iff t = 0, else x(t) = 0 ∀ t  t ≠ 0
Going forward, this function will be denoted as: 𝛿(t).
An important property of 𝛿(t) is that when it is multiplied by another function x(t), the result is a new function that is 0 everywhere except for t = 0 where it equals x(0). This is due to 𝛿(t) being 0 everywhere except for t = 0 where 𝛿(t) = 1. This can be written as:
x(t)𝛿(t) = x(0)𝛿(t)
This property still holds when 𝛿(t) is shifted by some value T, so more generally:
x(t)𝛿(t  T) = x(T)𝛿(tT)
This means that we can collect a single value of a function at any integer time value t just by multiplying the function with a shifted 𝛿(t). Knowing this, we can reconstruct x(t) by adding together the functions containing a single point. That means x(t) can be represented as a sum of scaled delta functions like so:
∑x(T)𝛿(tT) from T = ∞ to ∞ = … + x(1)𝛿(t + 1) + x(0)𝛿(t) + x(1)𝛿(t  1) + …
This is a very useful property of the delta function because instead of passing x(t) itself into a system, we can pass in this sum and gain more insight about the output, but first we must investigate how a system responds to 𝛿(t).
When 𝛿(x) is fed into the system, the output is known as the impulse response.
𝛿(t) > H > h(t)
There are a few methods for finding h(t), one of which being physically testing the system with an impulse and measuring the output. Knowing this, let’s see what happens when we pass in the scaled sum of delta functions. We know that because the system is LTI, passing in a sum of scaled inputs will result in a sum of corresponding scaled outputs. Additionally, inputs shifted in time will give the corresponding output time shifted by the same value. This means that:
x(t) = … + x(1)𝛿(t + 1) + x(0)𝛿(t) + x(1)𝛿(t  1) + … > H > … + x(1)h(t + 1) + x(0)h(t) + x(1)h(t  1) + … = y(t)
The output can then be expressed as ∑x(T)h(tT) from T = ∞ to ∞, and this is defined as the convolution of x(t) and h(t) and the operator is denoted with the symbol *. The system responded to x(t) by convolving it with the impulse response. Convolution is also a commutative operation, although the proof of this is left as an exercise for the reader. The sum also shows why one of the vectors in exercise 1.2 must be flipped to directly multiply the vectors together. x(T) begins evaluating at ∞ and is multiplied with h(tT) which begins at +∞, so without flipping, you must have two pointers, one starting at the beginning of the first vector and the other at the end of the second vector, progressing in the opposite direction. By flipping h(t  T) and getting h(t + T), it is also evaluated from ∞ to +∞, allowing for the two vectors to be directly multiplied to find the output at t. Since convolution is commutative, it doesn’t matter which vector is flipped.
As a final point, notice that the input sum of ∑x(T)𝛿(tT) from T = ∞ to ∞ is also convolution, one between x(t) and 𝛿(t), and it equals x(t). That means that 𝛿(t) is the identity for convolution, and x(t)*𝛿(t) = x(t) for any x(t).