Project 5: Facial Keypoint Detection with Neural Networks

Simona Aksman

Contents


Part 1: Nose Tip Detection

To start, I loaded data from the IMM Face Database. I used the first 192 images (32 people * 6 poses) for the training set, the last 48 images (8 people * 6 poses) for the validation set, and the nose tip point as the label. Then I created a custom dataloader that applies several transformations to the training and validation sets: normalizes the images by converting to grayscale, scales pixel values to be in range [-0.5, 0.5], resizes images to 80 x 60 pixels, and converts the images and labels to tensors. See below for several sampled training set images and ground-truth labels in red.


Next I constructed a convolutional neural network by following some of the best practices outlined in class. In particular, I used 3x3 kernels for each of the convolutional layers and followed each convolutional layer with a ReLU activation function and a 2x2 MaxPool. My network contains 3 convolutional layers, which are flattened to feed into 2 fully connected (FC) layers. A ReLU activation function was placed between the FC layers but no activation was placed at the end of the network, since the network returns continuous coordinate value outputs. The 3 convolutional layers have 12, 24, and 32 channels each, respectively. The final FC layer predicts the (x, y) coordinate values of the nose tip, and therefore produces 2 outputs.


With my dataset and CNN architecture defined, I was ready to start training a model. I used MSE loss and an Adam optimizer, starting with a learning rate of 1e-3 and training over 25 epochs. I then varied the learning rate from 1e-3 to 1e-2, and also tested adding more convolutional layers. See figures below for the results of these experiments. I was able to achieve slightly better performance by using a smaller learning rate of 1e-3 and a larger network with 4 convolutional layers.

learning rate = 1e-3, 3 conv layers Final train loss: 0.00083, validation loss: 0.00199
learning rate = 1e-2, 3 conv layers Final train loss: 0.00422, validation loss: 0.00532
learning rate = 1e-3, 4 conv layers Final train loss: 0.00057, validation loss: 0.00177

I used the best model produced, the 4 layer model with a learning rate of 1e-3, to make some predictions on the validation set. See below for some good and bad performing predictions. Ground-truth labels are red and predicted labels are teal. The model seemed to figure out some of the head-on and side-view nose points when facial shifts were minimal, but struggled with larger facial shifts. It also seemed to struggle with some people's faces more than others regardless of their facial position. Perhaps some data augmentation, such as facial rotations, would have helped the model learn more complex facial tilts. Training the model on more data could help it learn to identify noses on a broader set of faces.

Good predictions

Bad predictions

Part 2: Full Facial Keypoints Detection

Next I used the IMM dataset to train a model that predicts the (x, y) coordinates of 58 facial keypoints. After resizing images to 240 x 180, I applied data augmentations to the training dataset to help prevent the model from overfitting to certain facial positions, a problem encountered in part 1. In particular, I applied random x, y coordinate translations of between -20 and 20 pixels and random rotations between -15 and 15 degrees. A sample from the training set with ground-truth key points in red is provided below.

Next I designed a CNN. This time I used 5 convolutional layers and more channels per convolutional layer, doubling the number of channels per convolutional layer as the network got deeper. The final FC layer outputs 116 predictions (58 * 2 (x, y) coordinates). See figure below for more details about the network's architecture.

I decided to start with hyperparameters that worked well in part 1 (learning rate of 1e-3, batch size of 8, 25 epochs) to train the model. MSE loss and an Adam optimizer were also used again. See below for the results of my best model run.

This time it looks like the network learned facial shifts, probably thanks to the data augmentation I applied. I actually first trained the network without the translations and it did worse. The model still struggled with certain faces more than others. I think it seemed to do better on more "average", symmetric faces, even when those faces were tilted.

Good predictions

Bad predictions


I also visualized the 3x3 kernels learned in the first 3 convolutional layers.
conv1
conv2
conv3



Part 3: Train with Larger Dataset

Next I worked with a larger dataset, the ibug Faces in the Wild dataset, to train a more powerful facial keypoint detector. For each image in the dataset, 68 annotated keypoints as well as bounding boxes to define facial locations were provided. I used 95% of the 6666 image dataset provided for training and the remaining 5% for validation. To train the model, I first cropped the images using the bounding boxes. I increased the bounding boxes to 1.6x the original size to prevent faces from being cut off. I also did some extra preprocessing to prevent empty bounding boxes and fix poorly defined boundary boxes (i.e. those that were not square or did not have their origin at the top left). Then I normalized the images, rescaled to 224 x 224, and applied data augmentations: random rotations between -20 and 20 degrees and random translations between -20 and 20 pixels. See below for some samples of images from the training set with ground-truth labels in red.


For the model, I used a stadard ResNet18 architecture with a couple of minor modifications: I changed the input channel to 1 since the input images are grayscale, and changed the output channel to produce 136 predictions, so that the model predicts 68 * 2 (x, y) coordinates. See below for a detailed description of the network architecture.



Next I went through the process of training the model. I ended up running the model for about 18 total epochs (before my Colab notebook died and I lost my checkpoint). The first 15 epochs are shown in the plot below. I ended up saving the model and then running a few additional iterations to slightly improve the model's performance. To train the model in the first run, I used a learning rate of 1e-3, an Adam optimizer, MSE loss, and a batch size of 4. In my later run for a few more epochs, I used the same parameters except I decreased the learning rate to 1e-4 to help the model better fit to the training data, since it looked like it might be underfitting from the first iteration. For validation, I evaluated on 1 batch of validation set images to save time. Notice that the validation error is much higher than the training error. This is likely because I did not crop the validation set images with the bounding boxes as I did for the training set. Despite this, the model did well on the out-of-sample test set images (which I did actually crop using bounding boxes). See below for some example predictions on the test set. I achieved an MAE of 8.94191 in the Kaggle competition with this model. Looking back, I probably should have constructed the validation set differently, since it doesn't really reflect the performance of the model on the test set.

Performance in first 15 epochs:


Performance in additional 3 epochs:



Some test set predictions:


I also applied the model on images of myself that I used to create face morphs in project 3. It did well on images where my face was close-up, but poorly when my face was further away. It seems that the model did not learn scale variance for faces well because bounding boxes were typically tightly bound around faces in the training set. To get the network to work well on these new images, I probably should have done the same for every image.

Good predictions

Bad predictions




Bells & Whistles: Automatic Face Morphing

Finally, the part of the project I was most excited about: automatic face morphing! Using the best 3 results from the output of part 3, I created an abbreviated version of the "aging" sequence video I made for project 3. I think that the manual version turned out better for various reasons (keypoint matches were a bit better, images were higher resolution), but overall I'm pretty impressed by how well a neural network can perform at identifying facial keypoints.