CS294-26 Project 5 Facial Keypoint Detection with Neural Networks

Jimmy Xu

 

Overview

The goal of the project is to train a neural network to detect facial keypoints (also called face landmarks) in two datasets.

 

Nose Tip Detection

In this part, I trained a neural network to detect the nose tip in the IMM Face Database.

Dataset

The dataset has 240 images of 40 persons. Each image has 58 facial keypoints. The project specs require me to use the first 32 persons as the training set and the last 8 as the validation set. Here are some sample pictures with the nose tip annotation visualized:

For preprocessing, I converted every image to grayscale, scaled every image to (80x60), and normalized the pixel values to be [-0.5, 0.5]. I also applied random rotation and random color jittering. Here are some examples of doing preprocessing on the same image, with nose tip annotated.

Network architecture

The following is the architecture of my neural network.

Loss function and optimizer

I used mean squared error (MSE) as the loss function and Adam as the optimizer with a learning rate of 0.0001. I trained for 15 epochs with a training batch size of 32.

Results

With the above configurations, I was able to achieve a loss of 0.0056 on the validation set. Here's the training loss and validation loss across 15 epochs.

I tried increasing the capacity of the network by changing the number of layers as well as varying the filter size. And they didn't significantly change the loss. Here are the results:

Extra conv+ReLU

By adding two extra layers ((4) and (5) below), I was able to achieve a validation loss of 0.0057, which is pretty similar to the net without the extra layer.

Larger filter size

By changing every filter size from (3, 3) to (7, 7), I was able to achieve a validation loss of 0.0068, which is slightly higher than previous results.

Good predictions

Here are two examples that the network performs well on. The green dot is the ground truth while the red dot is the prediction.

Bad predictions

Here are two examples that the network performs not so well on

I think there are many possible reasons why my network fail on these examples, and the most important one is the scarcity of images. Despite adding data augmentation, there are simply not enough images to train on, especially sideviews. This explains why the network performs ok on most of the frontal images while not so well on side views.

 

Full Facial Keypoints Detection

In this part, I trained a neural network to detect the 58 facial keypoints in the IMM Face Database.

Dataset

The dataset has 240 images of 40 persons. Each image has 58 facial keypoints. The project specs require me to use the first 32 persons as the training set and the last 8 as the validation set. Here are some sample pictures with all the facial keypoints visualized:

For preprocessing, I scaled every image to (240x180) and normalize the pixel values to be [-0.5, 0.5]. In this part, I also implemented various data augmentation techniques (and applied them to the previous part retrospectively): color jittering (brightness, contrast, saturation, and hue), random rotation, random cropping, and center cropping (not used).

Network architecture

The following is the architecture of my neural network.

Loss function and optimizer

I used mean squared error (MSE) as the loss function and Adam as the optimizer. The learning rate is 0.001 with a weight decay of 0.00001. The training batch size is 32.

Results

With the above configurations, I was able to achieve a loss of 0.0035 on the validation set. Here's the training loss and validation loss across 30 epochs.

I tried increasing the capacity of the network by changing the number of layers as well as varying the filter size. And they didn't significantly change the loss.

Extra layer

By adding two extra layers ((8) and (9) below), I was able to achieve a validation loss of 0.0041, which is pretty similar to the net without the extra layer.

Larger filter

By changing every filter size from (3, 3) to (7, 7) or (5, 5), I was able to achieve a validation loss of 0.0041, which is pretty similar to the net with 3x3 filters.

Good predictions

Bad predictions

I think the reason why it performs poorly on some images is pretty much the same as part 1. Despite adding data augmentation, there are simply not enough images to train on, especially sideviews.

Visualize learned filters

Here are the filters of the first conv layer of my network. I don't really know what they represent...

Train with Larger Dataset

In this part, I trained a neural network to detect the 68 face landmarks in the iBUG Face dataset.

Dataset

The dataset has 6666 images. Each image has 68 facial keypoints as well as a bounding box of the face. I used the same data augmentation as the previous part. In addition, I used the provided bounding box to crop out the region containing the face before passing each image to be preprocessed. They are then resized to (224, 224) to fit the pretrained model.

Network architecture

I used a pre-trained (on ImageNet) ResNet-50 and modified the first layer (conv1) so that it can take in a grayscale image, and the last layer so that it can output 68 points. Here's the architecture:

Loss function and optimizer

I used mean squared error (MSE) as the loss function and Adam as the optimizer. The learning rate is 0.0001 with a weight decay of 0.00001. The training batch size is 64 and the validation batch size is 8. I used 6000 images as the training set and 666 images as the validation set.

Results

I trained the net for 50 epochs. With the above configurations, I was able to achieve a mean absolute error of 8.18925 on the test set. Here's the training loss and validation loss.

Here are some results from the test set

Here are some results from images I gathered from previous projects or the Internet.

One thing I noticed is that I have to define a bounding box of the face for the neural net to work. Otherwise, the landmarks will be all over the place.

As you can see, the network performs fairly well on frontal images. However, for an animated character, it doesn't perform very well: the eyes and eyebrows are off. This is understandable as animated characters have rather different facial features from real people. I was actually surprised that it could estimate the rough positions of the facial features of this character.

 

Bells and Whistles

Automatic Face Morphing

To truly automate the face morphing process, we need a face detector (to generate the box), in addition to a face landmarks detector. I used dlib library to automate the process. Here's the result.