CS 194-26 Project 5: Facial Keypoint Detection with Neural Networks

Name: Suhn Hyoung Kim

Project Overview

The goal of this project is to automatically detect keypoints for facial features by using neural nets. We use PyTorch as a framework to use these networks. We begin by trying to detect the tip of a person's nose and then try to detect all of the facial features on a face.

Nose Tip Detection

Dataloader

I used the dataloader from torch.utils.data.Dataloader in order to write a custom dataloader to access the images and keypoints from the IMM Face Database. Below I have displayed a few of the sample images with the nose keypoint marked in red.

Hyperparameter Tuning

After building my preliminary CNN model, I performed some hyperparameter tuning by adjusting the learning rate and the filter size. I tried adjusting the learning rate from 1e-3 to 2e-3. For changing filter size, I changed the size from 5 to 7 to see how that would affect my losses. The overall results did not seem to change significantly with this tuning.

Training and Validation MSE Loss

Below is a graph showing the Training and Validation MSE loss during the training process. The graphs with adjusted hyperparameters are also shown.

Nose Network Losses with Learning Rate 2e-3

Results

Below are two facial images where the network detects the nose correctly. The red point represents the predicted point, and the green point represents the ground truth.

Below are two facial images where the network detects the nose incorrectly. Again, the red point represents the predicted point, and the green point represents the ground truth. The network might have failed in the cases shown below because one face was tilted to the side, which may have confused the network in being able to detect the feature. For the other image, the line near the mouth may have been seen as similar enough to a nose so the model marked the bottom as the keypoint.

Full Facial Keypoints Detection

Dataloader

I used the dataloader from torch.utils.data.Dataloader again, but this time I selected all the facial keypoints to include in the dataloader. I also performed data augmentation to help prevent the network from overfitting. This was done by adding images to the dataloader that were either randomly rotated by -15 to 15 degrees or randomly cropped by a specific portion of the image. Below are some of the sample images from the dataloader labeled with the ground-truth keypoints.

Model Architecture

Below are the layers for my model architecture. The learning rate was 1e-3 and the batch size was 4.
FaceDetectionCNN(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (conv3): Conv2d(16, 32, kernel_size=(5, 5), stride=(1, 1))
  (conv4): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))
  (conv5): Conv2d(64, 128, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=4608, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=116, bias=True)
)

Training and Validation MSE Loss

Below is a graph showing the Training and Validation MSE loss during the training process.

Results

Below are two facial images where the network detects the facial keypoints correctly. The red points represent the predicted points, and the green points represent the ground truth.

Below are two facial images where the network detects the facial keypoints incorrectly. Again, the red points represent the predicted points, and the green points represent the ground truth. The network might have failed in the cases shown below because of how far left the faces are facing compared to many other images in the training set and the features aren't as consistent because of that angle. The nose keypoint detector had a difficult time with similar images.

Visualize Learned Filters

Below I have visualized some of the filters for my neural network of the first convolutional layer.

Train With Larger Dataset

Kaggle

For my Kaggle submission, my username is Sean Kim. At the time of submitting this website, my score was 10.13204.

Dataloader

This dataloader was essentially the same as the one from the previous part, but I also performed color jittering for additional data augmentation. Below are some example images from my dataloader with the keypoints labeled.

Model Architecture

Here is the ResNet model architecture I used to train with a larger dataset. I used the default ResNet18 model, but just modified the input and output channels to match the dimensions needed for this data. The learning rate was 1e-3 and the batch size was 16.
ResNet(
(conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
  (0): BasicBlock(
   (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (relu): ReLU(inplace)
   (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
(1): BasicBlock(
  (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace)
  (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer2): Sequential(
  (0): BasicBlock(
   (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
   (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (relu): ReLU(inplace)
   (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (downsample): Sequential(
  (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
  (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(   (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace)
  (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer3): Sequential(
  (0): BasicBlock(
   (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
   (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (relu): ReLU(inplace)
   (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (downsample): Sequential(
  (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
  (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
   (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (relu): ReLU(inplace)
   (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer4): Sequential(
  (0): BasicBlock(
    (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
    (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (relu): ReLU(inplace)
   (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (downsample): Sequential(
  (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
   (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(   (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace)
  (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): Linear(in_features=512, out_features=136, bias=True)
)

Training and Validation MSE Loss

Below is a graph showing the Training and Validation MSE loss during the training process for the resnet18 model for 10 epochs.

Visualize Keypoint Predictions on Test Set

Here are some of the visualized keypoint results for the resnet18 model on the test set after 50 epochs using the full dataset to train.

Visualize Keypoint Predictions on Own Collection

Here are some of the visualized keypoint results for the resnet18 model on some images of my choice after 50 epochs using the full dataset to train.

Bells and Whistles

For bells and whistes, I applied the automatic facial point detection from this project to project 3 in order to be able to automatically morph from one face to another and to compute the mid-way face more easily. The results are shown below. I performed a morph from Jisoo to Eunwoo and also computed the mid-way faces for Eunwoo with me and Jisoo.

What I Learned

I found it really cool to apply deep learning for this application of detecting facial points and making the process automatic. I really liked applying the project to my project 3 code to see this process done easily without having to select points!