Project 5: Facial Keypoint Detection with Neural Networks

Nicholas Ha

Part 1: Nose Tip Detection

For this part, I wrote a custom dataloader for the images in the IMM Face Database but only processing the nose facial keypoint.

Here is an example from the dataloader with the ground-truth nose keypoint:

missing

I then wrote a CNN to predict the nose keypoints. Here is the architecture:

missing

I used a learning rate of 1e-4 and let the model train for 25 epochs. Here is the train and validation MSE during the training process.

missing

Here are 2 facial images in which the network detects the nose correctly:

missing missing

Here are 2 images where the nose is detected incorrectly:

missing missing

Based on these examples, the model probably hasn't quite learned how to detect noses when the person is facing to the side. Perhaps if we had more images with a person facing different directions, the model could learn this.

Part 2: Full Facial Keypoints Detection

For this part, I wrote another custom dataloader with the same IMM dataset as in the previous part but this time I include all 58 facial keypoints instead of just the nose. I also incorporate data augmentation in the form of rotations and translations of the original image during the training process.

Here is a sampled image from the dataloader visualized with ground-truth keypoints:

missing

Here is an example of an augmented image:

missing

I also wrote a CNN to predict the facial keypoints of an image. I trained the model for 35 epochs and used a learning rate of 1e-4. Here is the architecture of the model:

missing

Here are 2 images where the facial keypoints are detected correctly:

missing missing

Here are 2 images where the facial keypoints are detected incorrectly:

missing missing

These images are likely incorrectly labeled because the model still has not yet quite fully learned how to properly deal with faces that are partially leaning in a direction. More data with augmentations that show these examples would likely help.

Here are the visualized learned filters of the first convolutional layer:

missing

Part 3: Train with Larger Dataset

In this part, I load the data from the larger ibug face in the wild dataset. Since the bounding boxes provided were sometimes out of bounds or the keypoints didn't fit in them, I wrote a function to make custom bounding boxes for each image based on the keypoints. I also used the same data augmentation techniques (rotations and translations) during the training process.

I used ResNet18 as my CNN model with the necessary modifications to the first and last layers. I trained the model for 20 epochs and used a learning rate of 1e-4. Here is the architecture of the model:

Here is the model's architecture.

missing

Here is the training and validation loss across iterations:

missing

Here are some images with the keypoints prediction in the testing set:

missing missing missing

Here are 2 images where the facial keypoints are detected correctly:

Here are the trained model's predictions on 3 photos from my collection:

missing missing missing
missing missing missing
For each of these 3 images, the model doesn't get the keypoints exactly right but they're somewhat close. I think that letting the model train for longer many more epochs would help the model be more accurate. It would also be useful to try more data augmentations.

My kaggle username is: NH. The submission has a loss of 47.42550