CS 194-26 Project 4: Facial Keypoint Detection with Neural Networks

Overview

In this project I used CNNs to automatically detect facial keypoints. Part 1 detects only the nose keypoint, while Part 2 detects all facial keypoints.

Part 1: Nose Tip Detection

In this section, I trained a CNN to detect the nose keypoint in images from the IMM Face Database. I trained on 192 images and used the remianing 48 images to validate my model. Below are some example images with their ground truth nose keypoint labeled by a green dot.



After experiementing with different kernel sizes, output sizes, and convolutional layers, I found the following CNN structure worked best. I trained my CNN with a learning rate of 0.0005, with 25 epochs, and a batch size of 1.

Below are some of the validation outputs of my model. The green dots represent the ground truth value and the red dots represent my model's prediction.





One reason my model had trouble with the two faces above is because the subjects of the pictures are not looking directly at the camera. Most of the training data has subject looking at the camera (as seen in success case); therefore, my model didn't have enough training data for people looking in different directions and tilitng their faces, which resulted in incorrect predictions in these cases.

Below is a graph of my training and validation losses.

Part 2: Full Facial Keypoints Detection

In this section, I trained a CNN to detect the full facial keypoints in images from the IMM Face Database. I first augmented the data by randomly adjusting the brightness of each image and adding it to my set to effectively double the size of my training and validation sets. Data augmentation was used here tp prevent overfitting to the relatively small training set. I trained on 384 images and used the remianing 96 images to validate my model. Below are some example images with their ground truth nose keypoint labeled by a green dot.



After experiementing with different kernel sizes, output sizes, and convolutional layers, I found the following CNN structure worked best. I trained my CNN with a learning rate of 0.0015, with 20 epochs, and a batch size of 1.

Below are some of the validation outputs of my model. The green dots represent the ground truth value and the red dots represent my model's prediction.





My model had trouble with the first failure face because this person's face is a lot more narrow than the other subject's faces. I believe my model did not see enough narrow faces during training, so it had trouble predicting points properly for this face. My model struggled with the second failure face because this person isn't looking at the camera, but is facing slightly to the left. Most of the training data has subject looking at the camera (as seen in success case); therefore, I believe my model didn't have enough training data for people looking in different directions and tilitng their faces, which resulted in incorrect predictions in these cases.

Below is a graph of my training and validation losses.

Below is the visualization of some of the filters from my trained model. I have included the first 5 filters from the first 3 convolutiional layers (in order). Note: It looks like the images of the filters are being blurred when upscaled to a larger size. To see the actual filters, see my iPython Notebook.





Parts 1 and 2 took me a lot longer than expected, so I unfortunately did not have time to implement Part 3 :(