Project 5: Facial Keypoint Detection with Neural Networks

CS 194-26 Fall 2021

Bhuvan Basireddy



Nose Tip Detection

I divided the 240 images into 2 dataloaders: one for training and other for validation. I converted the images to grayscale and normalized the float values between -0.5 and 0.5. Then, I resized the image into 80x60 size and transformed the keypoints accordingly and sent this into the CNN to get the predicted nose tip points.
Below are the images sampled from the dataloader with the ground truth keypoints:
Ground Truth Keypoints
I trained this for 20 epochs with a learning rate of 1e-3 with a batch size of 32.
Here's the plot of the train and validation MSE loss:
I varied the hyperparameters for learning rate, trying 1e-2, 1e-3, and 1e-4 and for the filter size, trying 3x3, 5x5, and 7x7. I show the graphs of the losses below. I used 1e-3 and 3x3 to be the best for the CNN since these were the most stable.
Learning Rate = 1e-2
Learning Rate = 1e-3
Learning Rate = 1e-4
Filter Size = 3x3
Filter Size = 5x5
Filter Size = 7x7
For the nose detection, a lot of points are correctly predicted, but some of the points aren't predicted well. This is probably due to different lighting, rotation of the image, different facial expressions, and other features, so our CNN can't generalize well because of its small size.
Here are some images showing the nose detection:
Good
Good
Bad
Bad

Full Facial Keypoints Detection

I do the same process as before, except with all the keypoints now. I resized the image to 240x180 size. I added data augmentation, such as random rotation of -15 to 15 degrees, random translation by 5% of the image size, and random change of brightness and saturation by 10%.
Below are the images sampled from the dataloader with the ground truth keypoints:
I ran the model for 200 epochs with a learning rate of 1e-4 and a batch size of 1.
Below is the architecture for my model:
SmallCNN(
    (conv1): Conv2d(1, 8, kernel_size=(7, 7), stride=(1, 1))
    (conv2): Conv2d(8, 14, kernel_size=(5, 5), stride=(1, 1))
    (conv3): Conv2d(14, 20, kernel_size=(3, 3), stride=(1, 1))
    (conv4): Conv2d(20, 27, kernel_size=(5, 5), stride=(1, 1))
    (conv5): Conv2d(27, 35, kernel_size=(3, 3), stride=(1, 1))
    (fc1): Linear(in_features=1575, out_features=512, bias=True)
    (fc2): Linear(in_features=512, out_features=256, bias=True)
    (fc3): Linear(in_features=256, out_features=116, bias=True)
)
Here's the plot of the train and validation MSE loss:
Many images had the keypoints predicted fairly well, but some images didn't very good predictions. This is probably due to our model being so small that it can't generalize well to all the transformations, such as rotations, color jitter, and facial expressions.
Here are some images showing the keypoints detection:
Good
Good
Bad
Bad
Here are the filters of the first conv layer visualized:

Train with Larger Dataset

The mean absolute error that I got on Kaggle is 14.11930
I trained the model for 12 epochs with a learning rate of 1e-3 and a batch size of 64. Below is the architecture for my model:
ResNet(
  (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer2): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer3): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer4): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=512, out_features=136, bias=True)
)
Here's the plot of the train and validation MSE loss:
Here are some images showing the keypoints detection on the testing set:
I tried the model on some random images I found. It does pretty well for the real people, giving fairly accurate keypoints. On the cartoon image though, the model fails because Shaggy has very exaggerated features with his small face. Here are some images from my collection with the keypoints:

B&W: Auto Face Morphing using ResNet

I used my ResNet on a group of images to automatically get the keypoints for morphing between them. Here's the gif: