Facial Keypoint Detection

A catalogue of steps I took leading up to detecting in-the-wild facial keypoints using deep learning, for Berkeley's CS194-026. See the spec that this project is based on here.

Jazz Singh // December 2020


1. Nose Tip Detection (Toy)

I first recast the facial keypoint detection problem to a toy one, nose tip detection, so that I can temporarily focus on building up infrastructure, using the toy IMM Face Database consisting of 240 total facial images.

Covering the less interesting part first -- to prepare the input data, in line with the spec, I converted the images to grayscale, normalized the values to floats between -0.5 and 0.5, and rescaled the images by 1/8 so that the dimensions were 60 x 80 and thus faster to work with for toy problem purposes. I utilized nose tip keypoints normalized relative to the image dimensions. Below are two sample images (of the same individual) along with their ground truth nose tips (in blue), sampled from the dataloader:

To take a first stab at the problem, I wrote a network with a few convolutional and fully connected layers (3 conv-relu-maxpool blocks, a linear layer, relu, and a final linear layer). I picked a kernel size of 5 and kept the network small, with 12 channels for the first layer, 24 for the next, and 32 for the final layer.

Below are a few strategies that improved results. For all experiments, I used optimizer Adam, trained for <= 25 epochs, and used simple MSE loss between the predicted and ground truth keypoints.

  • * Initializing the last linear layer's bias term to 0.5, since (normalized) predictions are expected to be around that, and we don't want the net to spend all its time learning the bias.
  • * Exponential learning rate decay of 0.99, to help the net find a good optimum.
  • * Tuning the learning rate and batch size (1e-3, 8).

There were plenty of experiments that didn't work out -- below are a few of the unhelpful additions I tried.

  • * Batch normalization. I experimented with adding it before the relus, as well as after the max pools. Even though batch normalization slightly improved overall validation error, it worsened the network's performance on angled faces specifically as compared to versions without batch-norm, so I chose not to keep it for the final model.
  • * Dropout and weight decay. Both of these regularization techniques worsened results, likely because the network was already quite small, leading to enough of a regularatory effect.

The final model had a training loss of 0.0000510, and a validation loss of 0.00104. More readably, that's a final validation error of about 2.3 pixels (for image size 60x80). Here's a plot of the training and validation curves leading to this result:

And here a few images illustrating both failure and success cases. Ground truth keypoints are in gold, and predicted keypoints are in purple.

The network has trouble with angled faces more than non-angled faces. I also suspect that side lighting (as opposed to overhead lighting) impacts the models results; this might be because directional lighting influences shadows and thus impacts the model's sense of where features relevant to nose tip prediction are (e.g. nostrils, sides of nose).

2. Facial Keypoint Detection

In this section, I predict 58 face landmarks (instead of just the nose tip) using the same dataset as in the previous section.

In terms of preparing data, I modified the previous section's pipeline in two key ways. Firstly, I added a few different types of data augmentation to counteract overfitting (with mirror fill in case of potential overfitting to the orientation of black padded edges):

  • * random brightness and contrast jitter,
  • * random horizontal flipping,
  • * random rotation between -15 and 15 degrees,
  • * and random shifting up to 10% of the image's largest input dimension.

Secondly, I used a higher resolution (input image size 180x240 instead of 60x80).

I sampled a few images with all random augmentations enabled (probability of 0.5 for each independently) and displayed them, alongside their ground truth keypoints, below. Note that I did not end up utilizing all of the augmentations in the final model.

I started off with a network similar to the previous section, but with a few differences intended to relatively increase the expressive power of the net while keeping it small. I used 5 convolutional layers (instead of 3), a kernel size of 3 (instead of 5), and more hidden units in the first fully connected layer (768 instead of 48).

For training, I again used optimizer Adam and MSE loss between the predicted and ground truth face keypoints, but for this section I trained for <= 30 epochs (rather than <= 25 epochs).

Below are a few of the strategies that improved results:

  • * Increasing the kernel size to 4 and the first linear hidden layer size to 1000, for more expressive power
  • * Random shift augmentation, so the network is more capable of handling non-centered faces
  • * Like the previous section, initializing the bias of the last fully connected layer to 0.5 (since that's around where normalized keypoints are expected to be, and we don't want to spend unnecessary training time learning the bias)
  • * Like the previous section, tuning learning rate and batch size (of course!)

Below are a few of the strategies that didn't work so well:

  • * Switching to AlexNet. I wanted to help the net fit better to the training data, but AlexNet ended up arriving at a worse fit.
  • * Increasing the channels in the convolutional layers. I tried 4x'ing the number of channels at each layer subsequent layer start from 12 channels in the first layer, and I tried 3x'ing them, but neither yielded an improvement.
  • * Color jitter, random rotation, and horizontal flip augmentation. Although each of these and combinations of them did improve the training and validation accurracies relative to no augmentations at all, no combination improved it more than random shifting augmentation alone.

I also spent a bit of time speeding up run-time for this section, including transferring to GPU, pinning memory, and experimenting with workers.

Here's a detailed summary of the hyperparameters, additions to, and architecture of the final model for this section:

  • * Random shift augmentation up to 10% of the largest image dimension (with padding mode reflect)
  • * 5 convolutional layers, with channels 12, 24, 48, 96, and 192, and kernel size 4
  • * 2 final fully connected layers, with hidden layer size 1000
  • * Model architecture conv-relu-maxpool (repeated 5 times), linear layer, relu, linear layer
  • * Batch size 4
  • * Optimizer Adam
  • * Learning rate 1e-3
  • * Num epochs 30
  • * Initializing last fully connected layer's bias terms to 0.5
  • * No batch-norm, dropout, weight decay, or learning rate decay

The final model had a training loss of 0.000196, and a validation loss of 0.000361. More readably, that's a final validation error of about 4 pixels (for image size 180x240). Here's a plot of the training and validation curves leading to this result:

And here a few images illustrating both failure and success cases. Ground truth keypoints are in orange, and predicted keypoints are in blue.

In terms of the failure cases, one issue the model seems to have is that if the jawline isn't obvious enough (e.g. angular edges, beard line, etc.), then it's harder to accurate predict keypoints along this soft jawline. Faces that are extremely shifted (or extremely angled for that matter) also remain harder to accurately predict keypoints for.

Here are a few filters from the first convolutional layer, visualized:


3. In-the-Wild Facial Keypoint Detection

In this section, I utilize the in-the-wild IBUG dataset of 6666 facial images to train a facial keypoint detector for difficult images.

Data preparation for this section is very similar to the previous section, except for one pre-processing step -- I crop out square bounding boxes containing all landmarks (1.8 times the annotated bboxes in the dataset) and resize these to 224x224 (updating the landmarks accordingly), since many images in the dataset contain multiple faces and/or have faces far from the camera.

Below are a few sample images from the dataloader, with only random shifting augmentation turned on (this is the data that the model for this section will be trained on):

I modified ResNet-18 to accept single-channel grayscale images and predict 68 keypoints for this section. Other than that, I used the exact same network initialization, training, and other hyperparameters as the previous section.

The final model had a training loss of 0.000150, and a validation loss of 0.00307. More readably, that's a final validation error of about 11 pixels (for image size 180x240). Here's a plot of the training and validation curves leading to this result:

Here are a few sample images from the test set, along with the keypoints the model predicted for each:

I ran this model on 3 images of my own choosing to see how it would do -- I picked 3 images of Alexandria Ocasio-Cortez, in situations with progressively more occlusion (e.g. glasses, mask) and progressively more angling of her head. Notice how the model predictions progressively worsen!