CS 194 - Computation Photography

Oleksii Volkovskyi

Project 5

Part 1: Nose Tip Detection

Data samples

Note: there is non-trivial noise in the keypoint labels, which will affect the quality of models trained on this dataset.

Training Results

Varying learning rate.

The best results were achieved with learning rates of 1e-3 and 5e-4.
Greater learning rates were did not provide stable training or converged to worse validation/training loss.
Varying kernel size of the convolutional layers.

The best training performance was achieved with a kernel size of 5.
The validation performance was surprisingly approximately the same across the 4 kernel sizes, with size 9 performing the worst.

The training curve for the final set of hyperparameters (LR = 5e-4, kernel size = 5).

Final Model Results

Red keypoint is the true label, yellow is the model prediction.
Success cases:

Failure cases:

All of the above images are taken from the validation set, meaning the model was not trained on these images.
The model fails to generalize is turned faces or random objects (such as hands/arms) in the photo, as seen by the greater distance in the two above photos
. For the top left image, I would argue that the prediction is better than the actual label, which is great to see since the labels are noisy.

Part 2: Full Facial Keypoints Detection

Data samples

I added transformations by adding ColorJitter with max brightness delta of 0.05 and a random angle rotation between -15 and 15 degrees.
I manually adjusted the above number so that the images still looked somewhat natural (large brightness color jitters were the main offender).

Model architecture

I chose learning rate and kernel size hyperparameters from the previous experiment.
However to speed up model training, I decreased the kernel size in the middle layers, taking inspiration from AlexNet.

Training Plots

Learning rate = 5e-4 chosen by validation in the previous experiment.

Model Predictions

Success cases:

Failure cases:

All of the above images are on the test(validation) set.
The model works well on data that is not strongly transformed and faces facing the camera.
The model generalizes somewhat ok to angle rotations (although not perfectly so still a failure case).
The model completely fails on faces turned to the side in an unfamiliar way.

Learned Filter Visualization

Layer1 and Layer 2:

Layer6:

Layers 1 and 2 seem to learn sharp edge boundaries, since the filters all have sharp edges in them.
The bottom right filter in layer 1 especially learns a vertical edge, while the top right filter is very hard to interpret.
Layer 6 is basically un-intelligeble since the representations are so upstream. Better visualization techniques are needed.

Part 3: Train With Larger Dataset

Data samples

The quality of samples in this dataset is much lower due to bad face bounding boxes.
I added padding to the side of each box in the dataset; however, that created a lot of black space that negatively impacts the data quality
Overall, I found working properly with this data to be much more difficult.

Kaggle Competition

My team name is Oleksii Volkovskyi, and the score is 70.

Model Architecture

I used a pretrained ResNet18. I modified it by replacing the last layer with a three-layer dense head and replaced the first filter to support black and white inputs.
Model:

Training Plots

Training took approximately 2 hours.

Model Predictions

Success cases:

Failure cases:

The model works decently well on front facing images; however, performance quickly drops off when the face is not located in the center or is partially out of the frame.