Part 1 Part 2 Part 3 Bells & Whistles Reference

Project 5 Facial Keypoint Detection with Neural Networks

2021 Fall CS 294-026 Xinwei Zhuang


Part 1: Nose Tip Detection

IMM Face Database is used for automatic nose tip detection. The dataset includes 240 facial images of 40 persons and each person has 6 facial images in different viewpoints. A preview of the used dataset is shown below.

Dataset for training


The first 32 persons are used as training set (total 32 x 6 = 192 images) and the images of the remaining 8 persons (index 33-40) (8 * 6 = 48 images) as the validation set.

Dataset after preprocessing


Then a convolutional neural network is constructed. The layer of the CNNs are:

convolutional layer
ReLu
max pooling
convolutional layer
ReLu
max pooling
convolutional layer
ReLu
max pooling
FC layer
ReLu
FC layer


Training loss and validation loss is shown as below for kernel size = 5 with the previous layer setting.

Mean Squared Error


The trained nose tip recognition network on test images are shown below. The red dot is the ground truth, and the green dot is the predicted nose tip.

Success cases

Failure cases



The possible reason could be that my filter kernel size is not large enough to investigate the full picture. The failure cases are detecting tip along the line of the face, which might also be a 'tip', but not a nose tip.

Some noticed influence by hyperparameter:

Part 2: Full Facial Keypoints Detection

To scale up from nose tip detection to full facial keypoints detection, the input data is labeled by 58 points instead of 1 point. Then, because the dataset is small, image augmentation is performed. The implemented image augmentation including: Sampled loading data after preprocessing are shown below.

Dataset after preprocessing

Detailed architecture of the CNN:
epoch = 20 batch size = 10 criterion = nn.MSELoss() optimizer = torch.optim.Adam(net.parameters(), lr=0.001) Net( (conv1): Conv2d(1, 8, kernel_size=(5, 5), stride=(1, 1)) (max_pool2d1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (conv2): Conv2d(8, 16, kernel_size=(5, 5), stride=(1, 1)) (max_pool2d2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (conv3): Conv2d(16, 24, kernel_size=(3, 3), stride=(1, 1)) (max_pool2d3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (conv4): Conv2d(24, 32, kernel_size=(5, 5), stride=(1, 1)) (max_pool2d4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (conv5): Conv2d(32, 40, kernel_size=(3, 3), stride=(1, 1)) (conv6): Conv2d(40, 48, kernel_size=(3, 3), stride=(1, 1)) (fc1): Linear(in_features=1728, out_features=500, bias=True) (fc2): Linear(in_features=500, out_features=200, bias=True) (fc3): Linear(in_features=200, out_features=116, bias=True) )
Training loss and validation loss is shown as below.

Mean Squared Error


The trained nose tip recognition network on test images are shown below. The red dot is the ground truth, and the green dot is the predicted feature face. I also augmented test data, so even the success cases is not super attached to the ground truth, but not too bad. It can recognize rotation and shifting.

Success cases

Failure cases

The failure case, however, is a bit messy. Some observations: image with large rotation doesn't work well, the profile photo cannot be recognized, and there is some rotation even the photo doesn't seem to be rotated. Also it can be overfitting to small data size.

Learnt feature visualisation

The first index is , and the second index is
kernel 1

kernel 2

kernel 3

kernel 4

kernel 5

kernel 6

Part 3: Train With Larger Dataset

Sampled loading data after preprocessing are shown below.

Dataset after preprocessing

ResNet-18 is used. Detailed architecture see below.
criterion = nn.MSELoss() optimizer = optim.Adam(network.parameters(), lr=0.0001) num_epochs = 20 Network( (model): ResNet( (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (1): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer2): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer3): Sequential( (0): BasicBlock( (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer4): Sequential( (0): BasicBlock( (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (fc): Linear(in_features=512, out_features=136, bias=True) ) )

Training loss and validation loss is shown as below.

Mean Squared Error

Predictions on test data.


Selected predictions on Kaggle

The mean absolute error for Kaggle is 8.36320.


Selected predictions on own selected photo


The performance is plausible. It performs better when the input image is a front face. But when it's obstructed by hair, or it's not a front face, or it's not a human face (probably shouldn't try the doge one since we don't have dog data), it performs not so good.

Bells & Whistles

Heat map regression

Sampled loading data after preprocessing are shown below.

Dataset after preprocessing


Pixel-wise fully convolutional (FC) network is used, and the face feature extraction becomes a pixelwise classification problem. Detailed architecture see below.

What I've learnt

Reference

Dataset Code Reference