CS 194 Fall 2020

Project 4: Facial Keypoint Detection with Neural Networks!


Leon Ming

Overview

In this project, we attempt to detect the positions of a set of facial keypoints with neural networks in PyTorch.

  1. Part 1: Nose Tip Detection
  2. Part 2: Full Facial Keypoints Detection
  3. Part 3: Train With Larger Dataset

Part 1: Nose Tip Detection

After writing the data loading logic, I want to quickly check that it's working. Below is a sample of the dataset with the ground-truth nose keypoint.


Many iterations of training later, I observe that my training and validation loss are both to my satisfaction. I could have probably trained for more epochs to achieve more stability.


Below is a visualization of a few samples from the validation set. As you can see, the model result is quite close in most cases, especially when the subject's face is front-facing. One extreme example of an inaccurate label can be found in the third row, left image. I suspect the error here is at least partially due to the subject's lack of a hairline, as well as the angled position of their head.


Part 2: Full Facial Keypoints Detection

Like before, I want to quickly check that data loading is working. Below is a sample of the dataset with the ground-truth nose keypoint. Note that this dataset is augmented, with 5 differently angled versions of an image for each original.


After 20 epochs, I observe that my training and validation loss are both relatively stable. The model architecture is as follows:

  • A conv layer of 32 filters with height/width of 7x7, depth of 1, stride of 1x1
  • A conv layer of 64 filters with height/width of 7x7, depth of 32, stride of 1x1
  • A conv layer of 128 filters with height/width of 5x5, depth of 64, stride of 1x1
  • A conv layer of 256 filters with height/width of 5x5, depth of 128, stride of 1x1
  • A conv layer of 512 filters with height/width of 3x3, depth of 256, stride of 1x1
  • A dense layer with 20480 input features and 256 output features
  • A dense layer with 256 input features and 116 output features
  • Each of the conv layers is followed by a relu layer and a max pool layer of size 2x2. The exception is the 5th conv layer, which is not followed by a max pool layer. Additionally, the first dense layer is followed by a relu layer and a dropout layer with p=0.4. The latter is to help prevent overfitting, and I saw that it worked quite well.

    Also, I used a batch size of 8 while training and a learning rate of 0.0001.


    Below is a visualization of a few samples from the validation set. Similar to part 1, the model result is most accurate visually when the subject's face is front-facing. Interestingly, the same bald subject (row 4, right image) also has the worst results in this part. As before, I suspect the error could be due to the hairline, as the data does not represent his baldness very well. Another failure case is the second-to-last row, left image. This person's mouth was open, which I suspect is partially why the predictions for the jawline are completely off.


    Now, let's see how the filters look in the first conv layer.


    Part 3: Train With Larger Dataset

    Quickly check that data loading is working. This dataset is augmented less than the dataset in part 2. For each original image, we create 3 different rotations.


    After 9 epochs, my training and validation both seem to flatten out. The model architecture is essentially resnet18, with small modifications to fit my input data and output size. Resnet18 consists of a series of conv layers with filters of depth ranging from 64 to 512, width and height dimensions ranging from 1x1 to 3x3, and strides of 1x1 or 2x2. Conv layers are followed by batch norm layers and occasionally relu layers. Right before the dense (last) layer, there is an average pool layer. The changes I made are the following:

  • Change the top-most conv layer to use filters of depth 1 instead of 3 (since our images are grayscale.
  • Change the dense layer to output 68*2 features since we want 68 (x, y) coordinate pairs.
  • I used a batch size of 16 and a learning rate of 0.0001 during training.


    Some visualizations from the validation set, with both ground-truth and predicted keypoints.


    Some visualizations from the test set, with predicted keypoints.


    Some visualizations from my own photo album. It seems that the model is fairly accurate in these photos, with the exception being the photo where I am wearing a mask. This is to be expected, as the model was likely not trained on any photos of faces with masks. I also tried it on an image of my friend, which also worked well.