CS 194: Project 5

Theodora Worledge

Part 1: Nose Tip Detection

First, I wrote a custom dataset for the nose tip data. Below are several examples from the dataloader, with the ground truth key point in red.

I used the following architecture with 3 convolutional layers and 2 fully connected layers.

Using Adam, I ran my model for 20 epochs with a batch size of 16. My training and validation MSE loss curves are below. My model achieved a training loss of 0.0001491 and a validation loss of 0.0001552.

Nose tip training MSE loss during training:

Nose tip validation MSE loss during training:

Below are some examples of successful nose tip detection from the validation set. Here, blue is the ground-truth and red is the prediction:

In cases where the person is not facing straight-on towards the camera, the model fails to successfully identify the nose tip, likely because it has only learned to find the center of a face oriented straight-on. This is because 4/6 images in the dataset are faces facing the camera straight-on and the model learns to do this well (rather than learning the off-center cases instead) to minimize the loss. Below are some examples of unsuccessful nose tip detection from the validation set. Here, blue is the ground-truth and red is the prediction:

I tried varying both the batch size and number of epochs. For my model, increasing the batch size to 64 increased the training and validation losses to 0.0008601 and 0.0007067, respectively. Decreasing the number of epochs to 15 also led to higher training loss (0.0003423), but not necessarily to higher validation loss (0.0001247). I additionally played around with the architecture; adding a batch norm before the ReLu in each convolutional layer and going from 2 to 3 convolutional layers improved my accuracy significantly.

Part 2: Full Facial Keypoints Detection

I implemented data augmentations for the training data to flip (left-right), translate (up-down and left-right by at most 10 pixels), rotate (-15 to 15 degrees), and color jitter (before converting the image to grayscale). Below I visualize these data augmentations and their correspondingly transformed labels in red, in addition to a sample with no transformations.

I trained the following architecture with a batch size of 6 for 30 epochs.

Face training MSE loss during training:

Face validation MSE loss during training:

Below are some examples of successful full face key point detection from the validation set. Here, blue is the ground-truth and red is the prediction:

Once again, my model struggles more with cases where the person is not facing straight-on towards the camera. This is most likely because it has maximized its performance on straight-on faces due to the higher representation of straight-on faces than turned faces. Below are some examples of unsuccessful full face key point detection from the validation set. Here, blue is the ground-truth and red is the prediction:

Below, I've displayed the filters my model learned in the first layer. I find that these filters are difficult to interpret. However, I think that the two filters in the middle of the bottom row may detect diagonal edges (from top left to bottom right and bottom left to top right, respectively). Also, the last filter of the first row looks as though it might possibly help detect horizontal edges.

Part 3: Train With Larger Dataset

First, I worked on writing the dataloader. I cropped images by their bounding boxes and rescaled them to size 244x244. I then used rotations (at most 7 degrees either counter-clockwise or clockwise) and shifts (at most 5 pixels either left/right or up/down) and my data augmentation. As in my part 2, I also combine these augmentations. Below I visualize these data augmentations and their correspondingly transformed labels in red on some instances from the training data.

I used a modified version of the ResNet-18 architecture for this part. I modified the model to take in 1 channel, rather than 3, and to output 68*2=136 classes for all 68 facial keypoint coordinates. I left all of the other architecture parameters in the ResNet-18 as the original values. I trained for 18 epochs with a batch size of four. I used a learning rate of 0.0001 with an exponential decay of gamma=0.99 per epoch. I achieved a loss of 9.26115 on Kaggle under the name Tahnby.

Below are the visualizations of the training and validation test losses. The validation loss appears much smoother than the training loss partially because I recorded validation loss less frequently than the train loss.

Face MSE loss over all 18 epochs:

Face MSE loss zoomed in:

Below are some examples of full face key point detection on images from the test set, using my model. Here, red is the prediction:

Below are some examples of full face key point detection on images from my collection, using my model. My model works well on the first three faces! However, the extreme tilt in the fourth face is too much for the model to handle; this is likely because I did not include rotations by greater degrees in my data augmentation. Here, red is the prediction:

Below, I've displayed the filters my model learned in the first layer. I notice a lot of diagonal, horizontal, and vertical lines in the filters.

In this project, I enjoyed training neural nets from scratch and learning how to work on Google Colab; these are skills I am excited to use again in the future! I struggled at first with finding the correct batch size; I trained a network to convergence with a batch size of 128, which performed quite badly. I should have investigated other batch sizes before investing in this model- a good lesson for the future. At one point, I thought I would have to use a ResNet-50 to achieve better loss because it seemed like my loss on Resnet-18 had converged after a few epochs. This was not true because after continuing to train the Resnet-18 model for many more epochs, I was able to achieve lower loss without overfitting.