CS194-26

Proj 4: Facial Keypoint Detection with Neural Networks

Overview

In this project, we experiment with using neural networks to detect facial keypoints. Specifically, we train CNNs using PyTorch, the IMM Face dataset, and the iBUG 300-W Face Landmark dataset.

Part 1: Nose Tip Detection

I wrote a torch DataLoader to load images from the IMM Face dataset. For each image, I convert the image to grayscale, normalize the values to -0.5 to 0.5, resize it to 80x60, then divide each pixel value by the standard deviation of the image. Below I plot a sample of images, along with the corresponding nose tips.

Training a simple CNN on the dataset for 25 iterations (batch size of 8, learning rate of 0.0005, MSE loss) resulted in the following training / validation losses. The final iteration achieved a validation loss of 0.00186.

Below are a few results from evaluation on a batch from the validation set. The ground-truth labels are displayed as the green dots, and the model prediction is indicated by the red dot. The model performs reasonably well for many of the images, however for a few (e.g. image 1, 3, and 6), the predictions are much further off. For image 1, it's possible that the error may be due to the raised hands, and for image 6, the low contrast between the face and the background, along with the turned / off-center head, may have caused the model to predict incorrectly.

Part 2: Full Facial Keypoints Detection

I augmented the dataset for this part with the following transformations:

  • Random rotation in the range [-15, 15]
  • Random horizontal translation in the range [-10, 10]

I reflected the image on the borders to reduce the harshness of the edges due the rotations and translations. As before, below is a random batch of the augmented faces, plotted alongside the facial keypoints.

The model architecture I used for this part is as follows:

  • 5x Conv layers: each with 12 filters, kernel size = (3, 3)
  • Linear layer with input size = 180, output size = 256
  • Linear layer with input size = 256, output size = 116

After each convolutional layer there is a relu activation function and a max pooling layer. I also feed the output of the first linear layer into relu before the second linear layer.

Training the model for 100 iterations with a batch size of 8 and a learning rate of 0.001 yields the following training losses (final validation loss of 0.00197):

Below are a few results from evaluation on a batch from the validation set. As before, the ground-truth labels are displayed as the green dots, and the model prediction is indicated by the red dot.

Images 1 and 7 highlight a couple failure cases for the model. For image 1, the shadows along the bottom-left part of the chin may have made it difficult for the model to accurately label the keypoints representing those facial landmarks. The failures in image 7 reflect the weakness of the model in accurately predicting landmarks for images with lots of rotation or with turned heads.

Below I visualize the filters learned for each layer of the model.

Part 3: Train With Larger Dataset

Using the iBUG 300-W dataset, I adjust the images (and keypoints) by:

  • Increasing the size of the facial bounding boxes by 1.3x on each side
  • Cropping the images by the adjusted bounding box
  • Augmenting the data using the same augmentations as used in Part 2

For this part, I simply used the ResNet18 implementation provided in PyTorch. I replaced the last, linear layer of ResNet18 with a linear layer with an output size of 136, to match the dataset. All other parts of the model were kept as-is.

I trained the model for 30 epochs with a batch size of 32 and a learning rate of 0.001. The training and validation loss flattens out quite quickly (final iteration had a validation loss of 0.000973):

The mean absolute error for this architecture was 11.84516. Below are some of the model's results on the test set:

Below I evaluate the model on some images outside of the dataset. The model performs pretty poorly on the images of both Obama and of the baby, but manages to do okay for Morpheus. For the first two images, the face cropping is not as tight as the cropping is for the iBUG dataset, which could be one reason for the model's worse performance. Even on morpheus, since he is wearing dark glasses over his eyes, and since the data lacks images like these, the model isn't able to predict his eyes as accurately as the rest of his face.