Facial Key Points Detection with Neural Nets

Daniel Edrisian

Project 5

Esketit

Overview

In this project, we try to detect facial key points using neural networks. It is split into three parts: Part 1 will handle a soft neural network for detecting nose key points. Part 2 will be based on detecting the entire facial points based on the same dataset used in part1. Part 3 is an extended version of part 2 where we use a larger data set and a more robust neural network architecture.

Part 1 -- Nose Detection

Based on the Danes' face dataset, we will train a neural network for one facial key point--the nose. The images are preprocessed and downsized in the dataloader.

Sampled Images from the Dataloader

The following images are sampled from the dataloader and represented with their respective ground truth labels (as the green dot).

img t

Model Architecture

CNN1(
(conv1): Conv2d(1, 32, kernel_size=(7, 7), stride=(1, 1))
(n1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(32, 32, kernel_size=(5, 5), stride=(1, 1))
(n2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))
(n3): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv4): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))
(n4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(fc1): Linear(in_features=64, out_features=512, bias=True)
(fc2): Linear(in_features=512, out_features=2, bias=True)
)

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 74, 54]           1,600
       BatchNorm2d-2           [-1, 32, 37, 27]              64
            Conv2d-3           [-1, 32, 33, 23]          25,632
       BatchNorm2d-4           [-1, 32, 16, 11]              64
            Conv2d-5            [-1, 32, 14, 9]           9,248
       BatchNorm2d-6             [-1, 32, 7, 4]              64
            Conv2d-7             [-1, 32, 5, 2]           9,248
       BatchNorm2d-8             [-1, 32, 2, 1]              64
            Linear-9                  [-1, 512]          33,280
           Linear-10                    [-1, 2]           1,026
================================================================
Total params: 80,290
Trainable params: 80,290
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.02
Forward/backward pass size (MB): 1.49
Params size (MB): 0.31
Estimated Total Size (MB): 1.82
----------------------------------------------------------------

Loss Curves and Hyperparameter Results

During experiments, I noticed that the higher the learning rate, the quicker the loss function converges. Although I observed some unexpected loss increases towards the end of the training session (around epoch #20). Increasing the number of convolution features didn't seem to drastically improve performance. Just a tiny bit.

Correct Predictions

The following 2 images were correctly predicted. We notice the subject's face is pointed straight to the camera.

Incorrect Predictions

The incorrect predictions are related to the subjects' angle towards the camera. This means the network is missing the nose information because of the camera's angle changes (and the full nose isn't shown). Also, the dataset is too small to cover those cases adequatly. No data transformation was applied.

Part 2 - Full Facial Keypoints Detection

In this section, I implemented a neural network for predicting all the facial key points. I used data augmentation for the images, like resizing, color saturation manipulation, vertical and horizontal flipping (flipation?), normalization, and rotation.

Sampled Images from the Dataloader

img t

Model Architecture

CNN2(
  (conv1): Conv2d(1, 16, kernel_size=(7, 7), stride=(1, 1))
  (n1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv2): Conv2d(16, 16, kernel_size=(7, 7), stride=(1, 1))
  (n2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv3): Conv2d(16, 16, kernel_size=(5, 5), stride=(1, 1))
  (n3): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1))
  (n4): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv5): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1))
  (n5): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc1): Linear(in_features=1056, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=116, bias=True)
)



----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 16, 154, 114]             800
       BatchNorm2d-2           [-1, 16, 77, 57]              32
            Conv2d-3           [-1, 16, 71, 51]          12,560
       BatchNorm2d-4           [-1, 16, 35, 25]              32
            Conv2d-5           [-1, 16, 31, 21]           6,416
       BatchNorm2d-6           [-1, 16, 15, 10]              32
            Conv2d-7            [-1, 16, 13, 8]           2,320
       BatchNorm2d-8            [-1, 16, 13, 8]              32
            Conv2d-9            [-1, 16, 11, 6]           2,320
      BatchNorm2d-10            [-1, 16, 11, 6]              32
           Linear-11                  [-1, 512]         541,184
           Linear-12                  [-1, 116]          59,508
================================================================
Total params: 625,268
Trainable params: 625,268
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.07
Forward/backward pass size (MB): 3.37
Params size (MB): 2.39
Estimated Total Size (MB): 5.83
----------------------------------------------------------------

Training and Validation Loss

Correct Predictions

Wrong Predictions

Learned Features

Conv1 Conv2 Conv3 Conv4 Conv5

Part 3 - Training With a Larger Dataset

ResNet18 Detailled Architecture

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 64, 224, 224]             640
       BatchNorm2d-2         [-1, 64, 224, 224]             128
              ReLU-3         [-1, 64, 224, 224]               0
         MaxPool2d-4         [-1, 64, 112, 112]               0
            Conv2d-5         [-1, 64, 112, 112]          36,864
       BatchNorm2d-6         [-1, 64, 112, 112]             128
              ReLU-7         [-1, 64, 112, 112]               0
            Conv2d-8         [-1, 64, 112, 112]          36,864
       BatchNorm2d-9         [-1, 64, 112, 112]             128
             ReLU-10         [-1, 64, 112, 112]               0
       BasicBlock-11         [-1, 64, 112, 112]               0
           Conv2d-12         [-1, 64, 112, 112]          36,864
      BatchNorm2d-13         [-1, 64, 112, 112]             128
             ReLU-14         [-1, 64, 112, 112]               0
           Conv2d-15         [-1, 64, 112, 112]          36,864
      BatchNorm2d-16         [-1, 64, 112, 112]             128
             ReLU-17         [-1, 64, 112, 112]               0
       BasicBlock-18         [-1, 64, 112, 112]               0
           Conv2d-19          [-1, 128, 56, 56]          73,728
      BatchNorm2d-20          [-1, 128, 56, 56]             256
             ReLU-21          [-1, 128, 56, 56]               0
           Conv2d-22          [-1, 128, 56, 56]         147,456
      BatchNorm2d-23          [-1, 128, 56, 56]             256
           Conv2d-24          [-1, 128, 56, 56]           8,192
      BatchNorm2d-25          [-1, 128, 56, 56]             256
             ReLU-26          [-1, 128, 56, 56]               0
       BasicBlock-27          [-1, 128, 56, 56]               0
           Conv2d-28          [-1, 128, 56, 56]         147,456
      BatchNorm2d-29          [-1, 128, 56, 56]             256
             ReLU-30          [-1, 128, 56, 56]               0
           Conv2d-31          [-1, 128, 56, 56]         147,456
      BatchNorm2d-32          [-1, 128, 56, 56]             256
             ReLU-33          [-1, 128, 56, 56]               0
       BasicBlock-34          [-1, 128, 56, 56]               0
           Conv2d-35          [-1, 256, 28, 28]         294,912
      BatchNorm2d-36          [-1, 256, 28, 28]             512
             ReLU-37          [-1, 256, 28, 28]               0
           Conv2d-38          [-1, 256, 28, 28]         589,824
      BatchNorm2d-39          [-1, 256, 28, 28]             512
           Conv2d-40          [-1, 256, 28, 28]          32,768
      BatchNorm2d-41          [-1, 256, 28, 28]             512
             ReLU-42          [-1, 256, 28, 28]               0
       BasicBlock-43          [-1, 256, 28, 28]               0
           Conv2d-44          [-1, 256, 28, 28]         589,824
      BatchNorm2d-45          [-1, 256, 28, 28]             512
             ReLU-46          [-1, 256, 28, 28]               0
           Conv2d-47          [-1, 256, 28, 28]         589,824
      BatchNorm2d-48          [-1, 256, 28, 28]             512
             ReLU-49          [-1, 256, 28, 28]               0
       BasicBlock-50          [-1, 256, 28, 28]               0
           Conv2d-51          [-1, 512, 14, 14]       1,179,648
      BatchNorm2d-52          [-1, 512, 14, 14]           1,024
             ReLU-53          [-1, 512, 14, 14]               0
           Conv2d-54          [-1, 512, 14, 14]       2,359,296
      BatchNorm2d-55          [-1, 512, 14, 14]           1,024
           Conv2d-56          [-1, 512, 14, 14]         131,072
      BatchNorm2d-57          [-1, 512, 14, 14]           1,024
             ReLU-58          [-1, 512, 14, 14]               0
       BasicBlock-59          [-1, 512, 14, 14]               0
           Conv2d-60          [-1, 512, 14, 14]       2,359,296
      BatchNorm2d-61          [-1, 512, 14, 14]           1,024
             ReLU-62          [-1, 512, 14, 14]               0
           Conv2d-63          [-1, 512, 14, 14]       2,359,296
      BatchNorm2d-64          [-1, 512, 14, 14]           1,024
             ReLU-65          [-1, 512, 14, 14]               0
       BasicBlock-66          [-1, 512, 14, 14]               0
AdaptiveAvgPool2d-67            [-1, 512, 1, 1]               0
          Dropout-68                  [-1, 512]               0
           Linear-69                  [-1, 136]          69,768
           ResNet-70                  [-1, 136]               0
================================================================
Total params: 11,237,512
Trainable params: 11,237,512
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.19
Forward/backward pass size (MB): 251.13
Params size (MB): 42.87
Estimated Total Size (MB): 294.19
----------------------------------------------------------------

Training and Validation Loss

How it Looks on Training Set

The model (excpectedly) does well in the training set.

How it Looks on Test Set

Here are some predictions our network made on the test set--which at the first glance look great. The model doesn't seem to be able to predict outside the images' boundaries and thus collapses the facial points to the image limits. This happens when the face gets cropped. I personally think this is ok. One thing I could do in the future is to overlap 2 fully framed faces with different overlapping ratios, and then see whether this effect still exists.

Some full-sized examples:

Trying Out My Own Photos

I tried 3 photos to test it out. It's working fine, except I see that the model is making fun of my eyebrows.

Bonus Cat

Bonus cat

Conclusion

Thank you for checking out my project 5 submission.