CS194-26 Project 4 - Facial Keypoint Detection with Neural Networks

Shreyas Patankar

Part I. Nose Tip Detection

I began the project by first writing a custom PyTorch Dataset & Dataloader in order to grab the input images. These images were converted to grayscale, normalized, and scaled to a size of 80x60. The nose tip keypoint was identified in each one. Some examples of original images & the images the dataloader uses are shown below.

Original 1
Dataloader Image 1
Original 2
Dataloader Image 2

The next step was to write a convolutional neural network in order to learn the locations of the nose keypoints. The network was constructed with the following layers:

conv1 conv2 conv3 fc1 fc2
input channels 1 12 22 1280 64
output channels 12 22 32 64 2
filter 3x3 3x3 3x3 n/a n/a

Each convolutional layer was followed first by a ReLU, then a Max Pool of size 2x2. The first fully connected layer was followed with a ReLU.

I split the data into the train and test groups--the testing set was 20% of the original data, and the remaining 80% was the training data. I trained my net with a learning rate of 1e-3, unbatched, and ran for 25 epochs. Below is the averaged training & validation loss per epoch, as well as some success & failure cases. The blue point represents the predicted result, while the red point is the ground truth keypoint.

Training & Validation MSE Loss
Success Case 1
Success Case 2
Failure Case 1
Failure Case 2

The most obvious possible reason for the failure cases is simply the lack of training data. There were under 200 images in the training set to begin with, and only a fraction of those were facing a specific orientation. Neural Nets tend to do significantly well when there is a plethora of training data. Moreover, it is worth noting that in the failure cases, the faces are generally not facing straight forward, though I believe this is a much more subtle reason for failure.

Finally, below is a visualization of the learned filters from the convolutional layers.

Learned Filters from conv1
Learned Filters from conv2

I have omitted the 3rd layer of filters because the numbers grow significantly from layer to layer, as evidenced by conv1 and conv2.

Part II: Full Facial Keypoints Detection

In this part, the goal was to continue to identify keypoints, but for the full face. Below are sampled images from a new dataloader with the ground truth keypoints. The images were scaled to 160x120, converted to grayscale, and normalized.

Original 1
Dataloader Image 1
Original 2
Dataloader Image 2

As in the previous part, the next step was to write a convolutional neural network in order to learn the locations of the nose keypoints. This time, the network had additional convolutional layers to account for the larger image size:

conv1 conv2 conv3 conv4 conv5 fc1 fc2
input channels 1 12 20 24 30 96 256
output channels 12 20 24 30 32 256 116
filter 3x3 3x3 3x3 3x3 3x3 n/a n/a

Each convolutional layer was followed first by a ReLU, then a Max Pool of size 2x2. The first fully connected layer was followed with a ReLU.

Exactly as before, I split the data into the train and test groups--the testing set was 20% of the original data, and the remaining 80% was the training data. I trained my net with a learning rate of 1e-3, unbatched, and ran for 30 epochs. Below is the averaged training & validation loss per epoch, as well as some success & failure cases. The blue points represent the predicted result, while the red points are the ground truth keypoints.

Training & Validation MSE Loss
Success Case 1
Success Case 2
Failure Case 1
Failure Case 2

The main for the failure cases is simply the lack of training data. As before, There were under 200 images in the training set to begin with, and only a fraction of those were facing a specific orientation. Neural Nets tend to do significantly well when there is a plethora of training data. Moreover, the vast majority of faces are taken from the same distance. When there is variation in this metric, the neural net doesn't respond very well. This could point to the strange shapes identified in the two failure cases. To improve this, I could have done additional data augmentation in order to learn better features.

Finally, below is a visualization of the learned filters from the convolutional layers.

Learned Filters from conv1
Learned Filters from conv2

I have omitted the remaining layers of filters because the numbers grow significantly from layer to layer, as evidenced by conv1 and conv2.

Part III: Train With Larger Dataset

The final section of the project was to detect facial keypoints using a larger training set. For our purposes, we used the ibug faces dataset. My network was based on a pre-existing neural network called resnet18. The architecture of the network is shown below:

ResNet(
  (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer2): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer3): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer4): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=512, out_features=136, bias=True)
)

The network was trainnned with a learning rate of 1e-3, as before. The batch size was 32 and ran for 5 epochs. I would have run for more, however, Colab was crashing for an unknown reason after ~7 epochs. Below is my training and validation error over the epochs.

Training & Validation MSE Loss

Here are the final results on some of the testing data.

Success 1
Success 2
Success 3
Success 4
Failure 1
Failure 2
Failure 3
Failure 4