Project 5 Facial Keypoint Detection with Neural Networks
2021 Fall CS 294-026 Xinwei Zhuang
Part 1: Nose Tip Detection
IMM Face Database is used for automatic nose tip detection. The dataset includes 240 facial images of 40 persons and each person has 6 facial images in different viewpoints. A preview of the used dataset is shown below.The first 32 persons are used as training set (total 32 x 6 = 192 images) and the images of the remaining 8 persons (index 33-40) (8 * 6 = 48 images) as the validation set.
Then a convolutional neural network is constructed. The layer of the CNNs are:
convolutional layer
ReLu
max pooling
convolutional layer
ReLu
max pooling
convolutional layer
ReLu
max pooling
FC layer
ReLu
FC layer
Training loss and validation loss is shown as below for kernel size = 5 with the previous layer setting.
The trained nose tip recognition network on test images are shown below. The red dot is the ground truth, and the green dot is the predicted nose tip.
The possible reason could be that my filter kernel size is not large enough to investigate the full picture. The failure cases are detecting tip along the line of the face, which might also be a 'tip', but not a nose tip.
Some noticed influence by hyperparameter:
- Large learning rate will cause diverge
- Increasing batch size will make CNN perform better
- Padding doesn't actually change the performance
- Enlarge the kernel size will cause a quicker convergence
Part 2: Full Facial Keypoints Detection
To scale up from nose tip detection to full facial keypoints detection, the input data is labeled by 58 points instead of 1 point. Then, because the dataset is small, image augmentation is performed. The implemented image augmentation including:- randomly changing the brightness in 50% percent
- randomly rotating the face between -15 to 15 degrees
- randomly shifting the face within 10% of the image size
Detailed architecture of the CNN:
epoch = 20
batch size = 10
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=0.001)
Net(
(conv1): Conv2d(1, 8, kernel_size=(5, 5), stride=(1, 1))
(max_pool2d1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv2): Conv2d(8, 16, kernel_size=(5, 5), stride=(1, 1))
(max_pool2d2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv3): Conv2d(16, 24, kernel_size=(3, 3), stride=(1, 1))
(max_pool2d3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv4): Conv2d(24, 32, kernel_size=(5, 5), stride=(1, 1))
(max_pool2d4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv5): Conv2d(32, 40, kernel_size=(3, 3), stride=(1, 1))
(conv6): Conv2d(40, 48, kernel_size=(3, 3), stride=(1, 1))
(fc1): Linear(in_features=1728, out_features=500, bias=True)
(fc2): Linear(in_features=500, out_features=200, bias=True)
(fc3): Linear(in_features=200, out_features=116, bias=True)
)
Training loss and validation loss is shown as below.
The trained nose tip recognition network on test images are shown below. The red dot is the ground truth, and the green dot is the predicted feature face. I also augmented test data, so even the success cases is not super attached to the ground truth, but not too bad. It can recognize rotation and shifting.
Learnt feature visualisation
The first index is , and the second index isPart 3: Train With Larger Dataset
Sampled loading data after preprocessing are shown below.ResNet-18 is used. Detailed architecture see below.
criterion = nn.MSELoss()
optimizer = optim.Adam(network.parameters(), lr=0.0001)
num_epochs = 20
Network(
(model): ResNet(
(conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer2): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer3): Sequential(
(0): BasicBlock(
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): Linear(in_features=512, out_features=136, bias=True)
)
)
Training loss and validation loss is shown as below.
Predictions on test data.
Selected predictions on Kaggle
The mean absolute error for Kaggle is 8.36320.
Selected predictions on own selected photo
The performance is plausible. It performs better when the input image is a front face. But when it's obstructed by hair, or it's not a front face, or it's not a human face (probably shouldn't try the doge one since we don't have dog data), it performs not so good.
Bells & Whistles
Heat map regression
Sampled loading data after preprocessing are shown below.
Pixel-wise fully convolutional (FC) network is used, and the face feature extraction becomes a pixelwise classification problem. Detailed architecture see below.
What I've learnt
- The training takes much longer time than I think, and many unexpected error happens. Will do it early.
- Batch size is highly relavent for training speed. But the training process converges much quickly than I thouhgt (only within 10 epoch).
Reference
Dataset
- IMM Face Database:
https://web.archive.org/web/20210305094647/http://www2.imm.dtu.dk/~aam/datasets/datasets.html
- 300 Faces In-the-Wild Challenge (300-W)
https://ibug.doc.ic.ac.uk/resources/300-W/
Code Reference
- https://pytorch.org/tutorials/beginner/data_loading_tutorial.html
- https://discuss.pytorch.org/t/visualize-feature-map/29597/14
- https://thecleverprogrammer.com/2020/07/22/face-landmarks-detection/
- https://www.jeremyafisher.com/augmenting-image-landmarks-along-with-images-in-pytorch.html
- https://github.com/princeton-vl/pose-hg-train
https://web.archive.org/web/20210305094647/http://www2.imm.dtu.dk/~aam/datasets/datasets.html
https://ibug.doc.ic.ac.uk/resources/300-W/