Facial Keypoint Detection with Neural Networks

Sarah Feng, sarah.f@berkeley.edu, Fall 2021


Part 1: Nose Tip Detection

For the first part, we use the IMM Face Database to detect nose keypoints. Images are shifted to [-.5, .5] range and resized to 80x60. The last 8 x 6 = 48 images are held out the test set.
Below is a sample of faces with ground truth nose keypoints from the database.

missing

We then train our convolutional neural networks. For hyperparameters, I tried varying both the learning rate [1e-3, 1e-4, 1e-6] and number of layers [3, 4].

Here is a comparison of the losses for training and testing.

missing
lr = 1e-3, 3 layers
missing
lr = 1e-4, 3 layers
missing
lr = 1e-6, 3 layers
missing
lr = 1e-3, 4 layers
missing
lr = 1e-4, 4 layers
missing
lr = 1e-6, 4 layers

It seems that 1e-4 is the learning rate that works best, ensuring convergence and smooth decrease in both training and test loss.

Running the trained networks on the test data, I visualized the two worst ("incorrect") predictions and the two best ("correct") predictions to see how the network does. Ground truth points in green, predicted in red.

missing
lr = 1e-3, 3 layers, correct
missing
lr = 1e-3, 3 layers, incorrect
missing
lr = 1e-4, 3 layers, correct
missing
lr = 1e-4, 3 layers, incorrect
missing
lr = 1e-6, 3 layers, correct
missing
lr = 1e-6, 3 layers, incorrect
missing
lr = 1e-3, 4 layers, correct
missing
lr = 1e-3, 4 layers, incorrect
missing
lr = 1e-4, 4 layers, correct
missing
lr = 1e-4, 4 layers, incorrect
missing
lr = 1e-6, 4 layers, correct
missing
lr = 1e-6, 3 layers, incorrect

The 1e-6 learning rate simply does not work well as we have seen from the loss plots. For the other "bad" predictions, it seems like the network learns that the nose will usually be in some "center" range of the photo—from about [35, 45] for x and [30, 40] for y. When the nose position is deviated strongly from this (i.e.) the face is translated or turned at an angle, the network is unable to correspondingly change its prediction.


Part 2: Full Facial Keypoints Detection

For this part, we want to predict the full set of 58 keypoints. This requires increasing the image size from 80 x 60 to 240 x 180. The dataset is also augmented through the following transforms: ColorJitter, RandomRotation (between -15 and 15 degrees), and translations (RandomAffine with translate between 5 and 10% of the images width).

Here are some sampled data (including augented sample).

missing
lr = 1e-3, 5 layers

For hyperparameters, I tried varying both the learning rate [1e-3, 1e-4] and number of layers [5, 6].

Here is a comparison of the losses for training and testing.

missing
lr = 1e-3, 5 layers
missing
lr = 1e-4, 5 layers
missing
lr = 1e-3, 6 layers
missing
lr = 1e-4, 6 layers

It seems that using 5 layers and 1e-4 is best. I think that adding more layers doesn't necessarily seem to help because the max pooling after each layer reduces the [h, w] of the output too much even if the # of channels increases, so the output size a the fully connected layer is not large enough.

Here are the "correct" and "incorrect" predictions for the test set.


missing
lr = 1e-3, 5 layers, correct
missing
lr = 1e-3, 5 layers, incorrect
missing
lr = 1e-4, 5 layers, correct
missing
lr = 1e-4, 5 layers, incorrect
missing
lr = 1e-3, 6 layers, correct
missing
lr = 1e-3, 6 layers, incorrect
missing
lr = 1e-4, 6 layers, correct
missing
lr = 1e-4, 6 layers, incorrect

Again, it seems like the CNN is able to capture "slight" variances in the keypoint detection but not major changes of the angle of the head. I computed my loss between the keypoints as ratios between [0, 1], but seeing this issue makes me think may be I should have scaled those values by the height and width before computing loss so as to penalize larger distances between predicted keypoints and ground truth more.

I visualized some of the learned filters at each layer of the "best" (1e-4, 5 layer) network.

missing
layer 1
missing
layer 2
missing
layer 3
missing
layer 4
missing
layer 5

It seems what is learned in the filters especially the 3x3 ones is not super comprehensible (do not have obvious meanings) which makes me think that I should have used bigger filter sizes for the network architectures.


Part 3: Train With Larger Dataset

In this part, we train a network on the ibug face in the wild dataset. Images are cropped to the face bounding boxes and then resized to [224, 224].

For the network I used the architecture of resnet18 with 2 small changes, which is that the input channel to the first layer is grayscale (1) instead of RGB (2) and the output of the fully connecetd layer is 136 (for the keypoints).

The batch_size is 256 and LR is 1e-3.

missing
ibug face augmented
missing
ibug face test set - predicted points in green

It seems there is a similar issue where the CNN learns to predict the same "face" and can only adapt to slight changes in position or scale.

missing
training loss

No validation loss in this case as the test set has no labels and I did not hold out any "fold" of the training data.

Results on data from my collection.

missing
my faces

Clearly the network is not predicting correctly. I think there is some bug maybe with the coordinates or how loss is being computed?