CS194-26 Project 5

5: Facial Keypoint Detection with Neural Networks

Part 1: Nose Tip Detection

I trained a CNN to detect the nose tips on a given dataset with pre-marked facial keypoints. While processing the data inside the DataLoader, I loaded the image with PIL, rescaled the image to 80x60, normalized the image, then converted them to tensors.

Here are 3 samples of the images loaded by the DataLoader marked with their nosetip (in pink).


Loaded Image 1

Loaded Image 2

Loaded Image 3


After training the Neural Net with 4 convolutional layers, each followed by a ReLU followed by a maxpool, followed by 2 more fully connected layers (the first of which was followed by ReLU), I ended up with this graph of the training and validation losses. I used a batch size of 6.

Train and Validation Loss

I decided to tune the hyperparameters learning rate and numbers of convolutional layers (3 vs 4). There is definitely an element of randomness to the losses every time, but in general, I thought that because a 3 layer net would produce a less complex model than a 4 layer net, it made sense that the losses for 3 layers seem to indicate a bit of underfitting and plateaued very fast. Between the 3 learning rates of 1e-2, 1e-3, and 1e-4, the fastest learning rate definitely converged first. The median rate seemed to have a more expected growth rate as both losses decreased gradully over time. However, the axis indicated this graph actually had a really small loss too. Finally, the slowest learning rate converged last, taking around 10 epochs.


3 layers & LR=0.01

3 layers & LR=0.001

3 layers & LR=0.0001


4 layers & LR=0.01

4 layers & LR=0.001

4 layers & LR=0.0001

Here are some of the final predicted keypoints. Red is the ground truth, and blue is my prediction. It definitely seemed like the predictions were struggling with faces that were not facing forward, and my guess for why this is is that most of the faces in the training dataset were probably facing forward.


Succesful Prediction 1

Succesful Prediction 2


Failed Prediction 1

Failed Prediction 2


Part 2: Full Facial Keypoints Detection

Instead of just predicting the nosetips, now I tried to predict all of the facial keypoints. Since the dataset was not that big, I performed some random data augmentation inside my DataLoader. I randomly ColorJittered the images' brightness and saturations, and I also added a random crop after rescaling.

Here are 3 samples of the images loaded by the DataLoader marked with their facial keypoints (blue).


Loaded Image 1

Loaded Image 2

Loaded Image 3


Neural Net Architecture

For this part, I created a Neural Net with 5 convolutional layers, each followed by a ReLU, and all but the third were followed by a maxpool of size 2. I then flattened the net and followed up with 2 more fully connected layers (the first of which was followed by ReLU). I used a batch size of 8, trained for 20 epochs, and ended up with this graph of the training and validation losses.

Losses


Train and Validation Loss


Here are some of the final predicted keypoints. Red is the ground truth, and blue is my prediction. As with the first part, it seemed like the model struggled to predict faces that were not facing forward. I would still guess that most of the training data probably had a larger proportion of forward-facing faces than not.


Succesful Prediction 1

Succesful Prediction 2


Failed Prediction 1

Failed Prediction 2

Learned convolutional filters

Layer 1 Conv Weight Filters after Training



Part 3: Train With Larger Dataset

In this part I used the starter code to download and unzip the dataset. I trained a model on a training set and then used the model to submit my predicted test results to Kaggle. My MAE was 14.43906.

Neural Net Architecture

I used the pretrained pytorch model ResNet18 to make my predictions. I changed the input channel to 1, and I changed the output channel to 68*2=136. Since there were a lot of images, I used a larger batch size of 128 and learning rate of 0.005 over 15 epochs.

Losses


Train and Validation Loss


Visualized Results

Here are some of the final results on the validation set. The ground truth is plotted in blue, while my predictions were plotted in orange.


Prediction 1

Prediction 2

Prediction 3


Here are some of the predictions on the test set. I think overall, the points look how I would expect them to.


Test Prediction 1

Test Prediction 2

Test Prediction 3


3 Custom Photos

I think the model did not predict any of the faces that well to be honest, but it definitely did better on the second image than it did on the first and the third. My guess is that this may be because the second face has both stronger features and also more color contrast, so the facial features are easier to detect.


Image 1

Image 2

Image 3