CS 194-26, Fall 2020

Facial Keypoint Detection with Neural Networks

Charley Huang, CS194-26-acg



Overview

In this project, I created convolutional neural networks using Pytorch, which I trained to automatically predict landmark points on faces. For the first two parts, we used the Danish computer scientist dataset from the previous project.

Part 1: Nose Tip Detection

To get accustomed to using CNNs for auto-detection, we started by only predicting the nose point. First, I extracted and visualized the ground truth nose points from a DataLoader, which uses a DataSet class for my nose points, which stores the image (slightly tranformed) and corresponding nose point.

Next, I created a convolutional neural network with 3 convolutional layers (followed by max pools) and two fully connected layers, using ReLU as the non-linear function. I trained this model using optim.Adam and a learning rate of 1e-3, using all angles of the first 32 danes in the dataset. I ran this training for 25 epochs. I made sure to include a validation dataset of the next 8 faces so I could measure validation accuracy at each epoch. Here, I'm showing the MSE loss of each epoch during training and during validation. As you can see, both losses consistently decrease as training runs, which means the model is successfully learning.

Here, I'm showing the ground truth nose points in green and the predicted nose points in red. I've shown a few images where the model predicts well and a few where it doesn't predict well. I suspect the bad predictions are unsuccessful due to angles and shadowing, since two of the failed predictions had angled faces and the other straight-on one had heavier shadowing than the successful straight-on faces.

Good Prediction
Good Prediction
Good Prediction
Bad Prediction
Bad Prediction
Bad Prediction

Part 2: Full Facial Keypoints Detection

In this section, I built upon the code I used for part 1 by expanding the CNN and including all the facial keypoints, rather than just the nose point. First, I adjusted my DataSet to extract and read all facial key points. After passing it into my dataloader, I was able to visualize these ground truth facial keypoints.

For my neural net, I added 2 more convolutional layers and 1 more max pool. I used optim.Adam again, but this time with a learning rate of 1e-4. I also added some transformations (colorjitter, random rotation, and random translation) in order to perform data augmentation. We use data augmentation to prevent the model from overfitting. Randomly adjusting the images before each epoch prevents the model from training on the same set of images too many times and giving bad validation losses. In this section, I chose to train in batches. I used batches of size 8 for the training dataset and batches of size 4 for the validation dataset. I again used all angles for the first 32 danes. My convolutional layers used a kernel of size 7 and doubled in the number of filters for each layer. Here, I've displayed the training and validation MSE losses during each epoch for my model and my network structure. Again, the losses are consistently decreasing over each epoch, which means the model is training properly.

Here, I'm showing the ground truth facial key points in green and the predicted facial key points in red. I've shown a few images where the model predicts well and a few where it doesn't predict well. A pattern in the three bad predictions was that the faces were all looking to their right (our left), while the well-predicted ones were looking to their left (our right) or the middle. It could be the case that the lack of success came from my model not training on faces facing their right (our left) enough to be as accurate on those faces.

Good Prediction
Good Prediction
Good Prediction
Bad Prediction
Bad Prediction
Bad Prediction

Here, I visualized the filters that were learned by our neural net throughout the training process. I've only visualized the filters for a few of the layers.

Layer 1
Layer 2
Layer 3

Part 3: Train with Larger Dataset

In this section, I further built upon the code I used for part 2 in Google Colab in order to be able to train on extremely large datasets. In order to get rid of excess background in the images, I cropped the images into 1.5 * their provided facial bounding boxes (and transformed their key points along the way) before resizing to a square of size 224x224. I've displayed a few images and their key points after the bounding box and resizing was applied.

I replaced my homemade neural network with a version of the torch premade model ResNet-18. My version of ResNet-18 had an input channel of size 1 and an output channel of size 136, for the x and y values of the 68 facial key points of each image. I've displayed my modified ResNet-18 architecture below.

When training, I used GPU in order to speed up the model, since it was very slow. I trained fewer epochs then before because of that, but kept my data augmentation from part 2 in order to prevent overfitting and strengthen predictions. I used optim.Adam as my optimizer, ran the training for 20 epochs, and used a learning rate of 1e-3. My batch size for the training set (80% of the images) was 16 and my batch size for my validation set (the other 20% of the images) was 8. After training 20 epochs, which each took around 15-20 minutes, I ended up with these training and validation MSE losses over each epoch.

Here, I'm showing the ground truth facial key points in green and the predicted facial key points in red. I've shown a few training images where the model predicts well and a few where it doesn't predict as well. I'm guessing the bad predictions have to do with the features being cut off of the bounding regions.

Good Prediction
Good Prediction
Good Prediction
Bad Prediction
Bad Prediction
Bad Prediction

Here, I'm showing the predicted key points that my model output for a few images in the testing set. As you can see, it performs quite well on these faces.

Here, I used my model to predict the key points on some images of my choosing. It did quite well on all of the images. There is some error in terms of some of the eye predictions and face shape, but overall much better than averaging the points.

Avengers
Oscars Group Picture
Me
Me and my Dad

Bells and Whistles: Anti-aliased Max Pool

For my bells and whistles, I switched out the ResNet-18 for a CNN that uses anti-aliased max pool (created by Richard Zhang). I reran my training for 10 epochs with the same learning rate of 1e-3 and batch size 16 for my training set and 8 for my validation set. I wanted to compare the results of the predictions. Here, I've displayed the losses after running the 10 epochs of training.

Here, I'm showing the ground truth facial key points in green and the predicted facial key points in red. I've shown a few training images and their truth points and predictions. So far, this CNN structure seems to predict very well.

Finally, I'm showing the predicted key points that my model output for a few images in the testing set. It performs very well on these images. In conclusion, training this net on only 10 epochs seems to work just as well as the original ResNet-18 did on 20 epochs, which is helpful because it saves a lot of time running the training.