CS194-26 FA21 Project 5: Facial Keypoint Detection with Neural Networks

By Austin Patel

Overview

The goal of this project is to detect facial keypoints using convolutional neural networks. First I start with just nose keypoint detection, then move to detection with more (68) facial keypoints. Finally I apply a modified ResNet to do facial keypoint detection on the Ibug database.

Results

Part 1

I load images from IMM Face dataset and then annotate with just the nose keypoints. Here are some samples from training dataset. Note that for all results shown below green points correspond to ground truth and red points are my predictions using the neural nets. Also note that training and validation losses in the figures are normalized by the number of samples in the training and validation sets (makes it easy to compare train and validation error even if there are a different number of samples in each).

I then built a convolutional neural net. I tried three different designs for this net and have presented the results below. Here are the results for the first net. It seems like the training and validation loss converges very quickly (within 5 epochs), and stabilizes out without much improvement. Perhaps decreasing the learning rate could have helped make the convergence take slighly longer (and hopefully lead to better results). For all conv layers across all 3 designs I apply a ReLU and maxpool (size 2, stride 1) in that order to the outputs.

Neural Net design for the following figures:

Learning rate: 0.001

NoseKeypointNet1(
  (conv1): Conv2d(1, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(24, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv3): Conv2d(24, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(24, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (fc1): Linear(in_features=102144, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=2, bias=True)
)

Performance on training data is not that great... the net seems to tend towards predicting the center of the image as the nose keypoint location. This works well when people are looking forward, but not as well when looking to the side. I think this is due to the fact that we are not doing data augmentation and all of the faces in the training data are centered.

Now for results on validation data: The net performed poorly for the first row of results (all the same person). Perhaps there were not many people that looked like this person in training set, thus the network performed suboptimally. Row 3 Col 1 and Row 5 Col 1 are examples of where the network performed well!

Now for second net. Instead of having all convolutions have a kernel size of 3, I changed the sizes to be 7,5,5,3. I was hoping that larger filters would have better performance. Here are the results:

Neural Net design for the following figures:

Learning rate: 0.001

NoseKeypointNet2(
  (conv1): Conv2d(1, 24, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
  (conv2): Conv2d(24, 24, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv3): Conv2d(24, 24, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv4): Conv2d(24, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (fc1): Linear(in_features=102144, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=2, bias=True)
)

This net seems better at predicting nose location when the head is to the side compared to the first net (Row 6 Col 2, Row 3 Col 2).

For the third net I changed convolution kernels back to 3,3,3,3, but I increased the channel size from 24 to 32 for all convolutions. The results are pretty similar to the first two networks (perhaps minor improvement, but validation loss is relatively similar to the first two).

Neural Net design for the following figures:

Learning rate: 0.001

NoseKeypointNet3(
  (conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv3): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (fc1): Linear(in_features=136192, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=2, bias=True)
)

Part 2

Now I want to detect all facial keypoints, not just nose keypoints. I also do data augmentation so we do not overfit to the training data and get more training data in a variety of cases. I do +/- 15 degree rotation, +/- 10 pixel translation in x/y directions, and color jitter (brightness=0.5). Some examples of augmented training data:

Here are the results for my first network design. For all the designs for part 2 I used ReLU and Maxpool of size 2 and stride 1 after each conv layer.

Neural Net design for the following figures:

Learning rate: 0.001

AllKeypointNet1(
  (conv1): Conv2d(1, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(24, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv3): Conv2d(24, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(24, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv5): Conv2d(24, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (fc1): Linear(in_features=427800, out_features=256, bias=True)
  (fc2): Linear(in_features=256, out_features=116, bias=True)
)

My net does poorly on the first row of the validation results (seems like scale augmentation could have been useful to fix this since the prediction is smaller than the head shape). The net performs relatively better on the third row.

For the second net design I increased the kernel size from 3 to 7.

Neural Net design for the following figures:

Learning rate: 0.001

AllKeypointNet2(
  (conv1): Conv2d(1, 24, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
  (conv2): Conv2d(24, 24, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
  (conv3): Conv2d(24, 24, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
  (conv4): Conv2d(24, 24, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
  (conv5): Conv2d(24, 24, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
  (fc1): Linear(in_features=427800, out_features=256, bias=True)
  (fc2): Linear(in_features=256, out_features=116, bias=True)
)

The net seems to just keep predicting very similar locations for the keypoints (predictions aligned with person looking forward).

Predictions were not working as well as I hoped, so I changed more parameters. I set the convolutions kernel size from 3,3,3,3,3 to 3,5,7,5,3 and increased the channel size to 32 for all layers.

Neural Net design for the following figures:

Learning rate: 0.001

AllKeypointNet3(
  (conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(32, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv3): Conv2d(32, 32, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
  (conv4): Conv2d(32, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv5): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (fc1): Linear(in_features=570400, out_features=256, bias=True)
  (fc2): Linear(in_features=256, out_features=116, bias=True)
)

Performance got better! Row 3 and Row 6 show pretty good results for the validation data below!

Part 3

Now I move to Ibug dataset. I apply the same data augmentation as part 2.

I use a pretained ResNet, but modify the input to work with 1 channel (grayscale images) and modify the output to be 68*2 (since there are 68 keypoints). Results look really good on training, validation and test datasets!

Neural Net design for the following figures:

Learning rate: 0.001

BigKeypointNet(
  (model): ResNet(
    (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace=True)
    (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (layer1): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (layer2): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
          (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (layer3): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
          (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (layer4): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
          (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
    (fc): Linear(in_features=512, out_features=136, bias=True)
  )
)

Training and validation loss converge nicely over my training period of 15 epochs!

Some results from test images in the Ibug Dataset are shown below. I submitted my results on the test set to Kaggle under the name AustinPatel and received a score of 8.14450 (mean absolute error).

Results on photos of me. The left and right image look good, but the center image is slightly misaligned. I think this is because my head is at a big angle and most of the training data is not at this steep of an angle even with data augmentation.

Conclusion

I enjoyed the project and thought it was cool to try out different hyper parameters and see the impact that changes had on the overall performance of the predictions!