Facial Keypoint Detection

with neural networks

Gautam Mittal

Nose Tip Detection

Ground-truth images sampled from the validation set's dataloader. For this task, no data augmentation was used.

Below are results after training a small neural network (see architecture below) with varying hyperparameters. Default training recipe uses 25 epochs, batch size 1, and learning rate of 0.001. A learning curve for the default training recipe can be seen below, showing both training and validation MSE over time. The x-axis indicates number of batches sampled by the network and each group of validation points is computed after each training epoch (1 full cycle through all training points).

Training loss (blue) and validation loss (orange) per gradient step.
bs: 1, lr: 1e-3 bs: 4, lr: 1e-3 bs: 32, lr: 1e-3 bs: 1, lr: 1e-4 bs: 1, lr: 1e-2
Table 1. Results from training the same architecture (see below) with different hyperparameters and results on model performance. Green points are ground-truth labels and red points are model predictions on the validation set images.

Sweeping over batch size and learning rate shows that the a smaller batch size gets a greater number of the faces sampled from the validation set correct and the learning rate of 1e-3 is best. This is because with larger batch size, the gradient has lower variance and therefore moves towards the average face, while a learning rate that is too large may overstep a loss minimum and a learning rate that is too small may cause learning to progress too slowly, all leading to degraded performance.

For the cases where the network generally fails (such as with bottom image in all columns), this may be due to a combination of overfitting since the dataset is relatively small and the low resolution of the images, preventing the network from extracting the important features that are telling of a face's nose tip.

The following model architecture was used:

NoseTipNet(
    (conv): Sequential(
        (0): Conv2d(1, 16, kernel_size=(3, 3), stride=(1, 1))
        (1): ReLU()
        (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        (3): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
        (4): ReLU()
        (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        (6): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))
        (7): ReLU()
        (8): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (mlp): Sequential(
        (0): Linear(in_features=1280, out_features=512, bias=True)
        (1): ReLU()
        (2): Linear(in_features=512, out_features=2, bias=True)
    )
)    
        

Full Facial Keypoints Detection

Ground-truth images sampled from the training set's dataloader (left). For this task, no translation-based augmentation was used where a shift chosen uniformly on [-10, 10] x [10, 10] is used each time an image is sampled. Color jitter and rotation-based augmentation was also implemented but found to be harmful to network performance. On the right are model predictions on images sampled from the validation set. Red points are predictions, green points are ground-truth for the validation set.

For the cases where the network generally fails (such as with bottom image in all columns), this is again likely due to overfitting (perhaps very few faces look like the bottom face), even with the additional augmentation. This experiment may rule out resolution as a possible cause of the failure since the images for this experiment are at twice the resolution of the nosetip detection experiment. Additionally, the performance is a bit worse on the jaw line for the forward-facing images likely due to the fact that the jaw and neck are hard to distinguish. In the case of the side-angled faces, the eyes are a bit more difficult to predict correctly since they are not directly facing the camera, which may indicate why the network has difficulty.

Default training recipe uses 30 epochs, batch size 2, and learning rate of 3e-4. A learning curve for the default training recipe can be seen below, showing both training and validation MSE over time.

Training loss (blue) and validation loss (orange) per gradient step.

The following model architecture was used:

FaceKeypointNet(
    (conv): Sequential(
        (0): Conv2d(1, 16, kernel_size=(5, 5), stride=(1, 1))
        (1): ReLU()
        (2): Conv2d(16, 32, kernel_size=(5, 5), stride=(1, 1))
        (3): ReLU()
        (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        (5): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))
        (6): ReLU()
        (7): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))
        (8): ReLU()
        (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        (10): Conv2d(32, 24, kernel_size=(3, 3), stride=(1, 1))
        (11): ReLU()
        (12): Conv2d(24, 24, kernel_size=(3, 3), stride=(1, 1))
        (13): ReLU()
        (14): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (mlp): Sequential(
        (0): Linear(in_features=4224, out_features=512, bias=True)
        (1): ReLU()
        (2): Linear(in_features=512, out_features=116, bias=True)
    )
)
        

Additionally the learned filters for the first convolutional layer of the network can be visualized below. Unfortunately, no human interpretable signals are easily gathered from these visualizations.

First convolutional layer's 16 5x5 filters visualized. Order is from left to right, top to bottom.

Train With Larger Dataset

For the Kaggle component, the mean absolute error on the test set was 11.51998. The architecture used to achieve this is below (ResNet18 from torchvision model zoo):

ResNet(
    (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace=True)
    (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (layer1): Sequential(
        (0): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (1): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
    )
    (layer2): Sequential(
        (0): BasicBlock(
        (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
            (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
            (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        )
        (1): BasicBlock(
        (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
    )
    (layer3): Sequential(
        (0): BasicBlock(
        (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
            (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
            (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        )
        (1): BasicBlock(
        (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
    )
    (layer4): Sequential(
        (0): BasicBlock(
        (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
            (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
            (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        )
        (1): BasicBlock(
        (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
    )
    (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
    (fc): Linear(in_features=512, out_features=136, bias=True)
)   
        

The training recipe hyperparameters: 10 epochs, batch size 16, learning rate 3e-4, trained on 80% of the training data with the remaining 20% used as validation data. A plot of the training/validation loss is below:

Train MSE (blue), validation MSE (orange).
Left: Predictions (red) on some validation images (groundtruth in green). Middle: Predictions on the test set (used by Kaggle to compute MAE). Right: Predictions on a set of images from my own collection.

The model performs much better with the larger training set, even on held-out images from the test set. The model seems to especially struggle when predicting keypoints on faces with glasses, likely due to the fact that the training set contains very few faces with glasses.