Project 5 Facial Keypoint Detection with Neural Networks

2021 Fall CS 294-026 Xinwei Zhuang

Part 1: Nose Tip Detection

IMM Face Database is used for automatic nose tip detection. The dataset includes 240 facial images of 40 persons and each person has 6 facial images in different viewpoints. A preview of the used dataset is shown below.

The first 32 persons are used as training set (total 32 x 6 = 192 images) and the images of the remaining 8 persons (index 33-40) (8 * 6 = 48 images) as the validation set.

Then a convolutional neural network is constructed. The layer of the CNNs are:


  
convolutional layer

 ReLu 

   max pooling

    convolutional layer

  ReLu 

  max pooling

 convolutional layer

  ReLu

   max pooling

   FC layer

   ReLu 

   FC layer

Training loss and validation loss is shown as below for kernel size = 5 with the previous layer setting.

The trained nose tip recognition network on test images are shown below. The red dot is the ground truth, and the green dot is the predicted nose tip.

The possible reason could be that my filter kernel size is not large enough to investigate the full picture. The failure cases are detecting tip along the line of the face, which might also be a 'tip', but not a nose tip.

Some noticed influence by hyperparameter:

Large learning rate will cause diverge
Increasing batch size will make CNN perform better
Padding doesn't actually change the performance
Enlarge the kernel size will cause a quicker convergence

Part 2: Full Facial Keypoints Detection

To scale up from nose tip detection to full facial keypoints detection, the input data is labeled by 58 points instead of 1 point. Then, because the dataset is small, image augmentation is performed. The implemented image augmentation including:

randomly changing the brightness in 50% percent
randomly rotating the face between -15 to 15 degrees
randomly shifting the face within 10% of the image size

Sampled loading data after preprocessing are shown below.

Detailed architecture of the CNN:


  
    
  epoch = 20
  batch size = 10
  criterion = nn.MSELoss()
  optimizer = torch.optim.Adam(net.parameters(), lr=0.001)
  
  Net(
    (conv1): Conv2d(1, 8, kernel_size=(5, 5), stride=(1, 1))
    (max_pool2d1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (conv2): Conv2d(8, 16, kernel_size=(5, 5), stride=(1, 1))
    (max_pool2d2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (conv3): Conv2d(16, 24, kernel_size=(3, 3), stride=(1, 1))
    (max_pool2d3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (conv4): Conv2d(24, 32, kernel_size=(5, 5), stride=(1, 1))
    (max_pool2d4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (conv5): Conv2d(32, 40, kernel_size=(3, 3), stride=(1, 1))
    (conv6): Conv2d(40, 48, kernel_size=(3, 3), stride=(1, 1))
    (fc1): Linear(in_features=1728, out_features=500, bias=True)
    (fc2): Linear(in_features=500, out_features=200, bias=True)
    (fc3): Linear(in_features=200, out_features=116, bias=True)
  )

Training loss and validation loss is shown as below.

The trained nose tip recognition network on test images are shown below. The red dot is the ground truth, and the green dot is the predicted feature face. I also augmented test data, so even the success cases is not super attached to the ground truth, but not too bad. It can recognize rotation and shifting.

The failure case, however, is a bit messy. Some observations: image with large rotation doesn't work well, the profile photo cannot be recognized, and there is some rotation even the photo doesn't seem to be rotated. Also it can be overfitting to small data size.

Learnt feature visualisation

The first index is , and the second index is

Part 3: Train With Larger Dataset

Sampled loading data after preprocessing are shown below.

ResNet-18 is used. Detailed architecture see below.


  
    

criterion = nn.MSELoss()
optimizer = optim.Adam(network.parameters(), lr=0.0001)
num_epochs = 20

Network(
  (model): ResNet(
    (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace=True)
    (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (layer1): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (layer2): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
          (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (layer3): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
          (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (layer4): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
          (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
    (fc): Linear(in_features=512, out_features=136, bias=True)
  )
)

Training loss and validation loss is shown as below.

Predictions on test data.

Selected predictions on Kaggle

The mean absolute error for Kaggle is 8.36320.

Selected predictions on own selected photo

The performance is plausible. It performs better when the input image is a front face. But when it's obstructed by hair, or it's not a front face, or it's not a human face (probably shouldn't try the doge one since we don't have dog data), it performs not so good.

Bells & Whistles

Heat map regression

Sampled loading data after preprocessing are shown below.

Pixel-wise fully convolutional (FC) network is used, and the face feature extraction becomes a pixelwise classification problem. Detailed architecture see below.

What I've learnt

The training takes much longer time than I think, and many unexpected error happens. Will do it early.
Batch size is highly relavent for training speed. But the training process converges much quickly than I thouhgt (only within 10 epoch).

Reference

Dataset

IMM Face Database:
https://web.archive.org/web/20210305094647/http://www2.imm.dtu.dk/~aam/datasets/datasets.html
300 Faces In-the-Wild Challenge (300-W)
https://ibug.doc.ic.ac.uk/resources/300-W/

Code Reference

https://pytorch.org/tutorials/beginner/data_loading_tutorial.html
https://discuss.pytorch.org/t/visualize-feature-map/29597/14
https://thecleverprogrammer.com/2020/07/22/face-landmarks-detection/
https://www.jeremyafisher.com/augmenting-image-landmarks-along-with-images-in-pytorch.html
https://github.com/princeton-vl/pose-hg-train