Project 4

Facial Keypoint Detection with Neural Networks by Amy Hung

CS 194-26: Image Manipulation and Computational Photography, Fall 2020

Overview

In this project, I will be using neural networks for automatically detect facial keypoints.

Part 1: Nose Tip Detection

For the first part, we're using the IMM Face Database from this website. I converted the images into grayscale, normalized the pixel values to fall within [-0.5, 0.5], and resized it to be 80x60 pixels. Then taking the provided landmarks, I extracted the nose point. Some images with the ground-truth nose point displayed:

From there, I wrote a CNN with the following architecture: NoseNet( (conv1): Conv2d(1, 12, kernel_size=(3, 3), stride=(1, 1)) (conv2): Conv2d(12, 22, kernel_size=(5, 5), stride=(1, 1)) (conv3): Conv2d(22, 32, kernel_size=(7, 7), stride=(1, 1)) (fc1): Linear(in_features=93600, out_features=900, bias=True) (fc2): Linear(in_features=900, out_features=116, bias=True) ) And trained the model for 25 epochs, with learning rate 1e-3 and batch size 10, saving the best model over all epochs. A plot of the MSE loss of the model over each epoch:

After training, the model was able to accurately predict most of the noses on the images. The model tended to be more accurate on images where the subject was facing forwards, and the lighting/contrast was relatively bright/high. We can see in some predictions, that the model struggled with images where contrast was low, lighting was different from the rest of the dataset's, and/or parts of the face were obscured/warped from turning.

Predictions on dataset:

Part 2: Full Facial Keypoints Detection

Now we're moving on to detecting the entire face on the same dataset, loading in all the keypoints rather than just the nose point. Since this is a small dataset, I also performed some data augmentation by randomly rotating by max 15 degrees in either direction, randomly shifting by 10 pixels in either direction horizontally and vertically, and randomly changing the brightness, saturation, and hue via the ColorJitter transformation.

Some examples of what the images look like after these transformations, with ground truth face keypoints:

From there, I wrote a CNN with the following architecture: FaceNet( (conv1): Conv2d(1, 12, kernel_size=(5, 5), stride=(1, 1)) (conv2): Conv2d(12, 12, kernel_size=(5, 5), stride=(1, 1)) (conv3): Conv2d(12, 20, kernel_size=(5, 5), stride=(1, 1)) (conv4): Conv2d(20, 20, kernel_size=(5, 5), stride=(1, 1)) (conv5): Conv2d(20, 32, kernel_size=(5, 5), stride=(1, 1)) (conv6): Conv2d(32, 32, kernel_size=(5, 5), stride=(1, 1)) (fc1): Linear(in_features=11040, out_features=800, bias=True) (fc2): Linear(in_features=800, out_features=116, bias=True) ) And trained the model for 50 epochs, with learning rate 1e-3 and batch size 10, saving the best model over all epochs. A plot of the MSE loss of the model over each epoch:

After training, the model was able to accurately predict most subject's facial keypoints. Overall, the model still performs best on images where the subject is facing forward and well-lit. It's also able to perform much better on turned faces, and tilted faces. However, the model is unable to predict unusual facial expressions that aren't represented in the training set.

Predictions on dataset:

In particular, the model didn't do well with this subject's set of images, as their face and chin are smaller and narrower than the average face in the training dataset.

These are the model's learned filters after training:

Layer 0:

Layer 1:

Part 3: Train with Larger Dataset

Now, we're working with a larger dataset, the ibug face in the wild dataset to train a facial keypoints detector. We now have much higher variance in images, with respective boundary boxes for each image to focus on the face. Examples of the input data, with the provided bboxes in red:

From there, I process the input data by cropping the images:

I then apply the same data augmentation from part 2. randomly rotating, shifting, and changing the brightness, saturation, and hue:

With this preprocessed input data, I then used a ResNet18 model from PyTorch, modified so that the first convolutional layer has input size of 1, output size of 64, kernel_size of 5, and the final output is of size 136 (68 points * 2). I then ran the model for 50 epochs, with batch_size of 10, learning rate of 1e-3. Unfortunately I didn't have enough time/GPU to let the model finish training, but it was able to achieve these results on the training set after 7 epochs:

Plot of MSE across these 7 epochs:

The model is able to perform pretty well on the training data, and I'm sure the model would have been able to become better with more training if I had more time and access to more GPU. However, I was unable to figure out how to properly rescale image keypoints to the original image's size. Some results on the test dataset which display how I was unable to properly re-scale the predictions: