CS 194-26 Fall 2021

Project 5: Facial Keypoint Detection with Neural Networks

Vikranth Srivatsa

Overview

In this project, we use neural networks to detect important keypoints on faces. We first detect detect the keypoint on the nose, then detect the points around the face. Then, we test/train our code on larger datasets

Part 1: Nose Tip Detection

The first step to implementing the network was loading the data. The following is an image with all the keypoints and then an image with the nose.



The general Net archteicture is 3 convolutional layers, followed by a max pool and then two linear layers.

            =================================================================
            Layer (type:depth-idx)                   Param #
            =================================================================
            ├─Conv2d: 1-1                            312
            ├─Conv2d: 1-2                            8,428
            ├─Conv2d: 1-3                            22,432
            ├─MaxPool2d: 1-4                         --
            ├─Linear: 1-5                            1,147,392
            ├─Linear: 1-6                            1,026
            =================================================================
            Total params: 1,179,590
            Trainable params: 1,179,590
            Non-trainable params: 0
            =================================================================
           
            Net(
                (conv1): Conv2d(1, 12, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
                (conv2): Conv2d(12, 28, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
                (conv3): Conv2d(28, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
                (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1)
                (fc1): Linear(in_features=2240, out_features=512, bias=True)
                (fc2): Linear(in_features=512, out_features=2, bias=True)
              )
           
The code uses a learning rate of 1e-3, Adam optimizer, and trained for batch size of 16.

valid detection

valid detection

invalid detection

invalid detection

This disparity might be because the model doesn't generalize well if the face is rotated or looking off into the angle. This is fixed in later parts of the project with larger models and data augmentation. Trying different hyperparamaters

Part 2: Full Facial Keypoints Detection

Similar to part 1, we load the data on the keypoints

In order to improve generalization, I also implemented a few data augmentation transforms.
Horizontally Flipping an Image

Randomly Shifting an Image

Randomly rotating an Image

Adding Random Noise in the transformations

For this part, however, I found that applying these transformations caused the model to underfit. I used random shift, rotation, and noise in part3 of this project.

the architecture of the model follows 5 levels of convolutional with the last 2 layers max pooled, followed by 3 fully connected layers

Net(
    (conv1): Conv2d(1, 12, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (conv2): Conv2d(12, 24, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (conv3): Conv2d(24, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (conv4): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (conv5): Conv2d(64, 128, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (pool): MaxPool2d(kernel_size=(4, 4), stride=(4, 4), padding=0, dilation=1, ceil_mode=False)
    (fc1): Linear(in_features=21120, out_features=2640, bias=True)
    (fc2): Linear(in_features=2640, out_features=512, bias=True)
    (fc3): Linear(in_features=512, out_features=116, bias=True)
)
            
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
├─Conv2d: 1-1                            [-1, 12, 180, 240]        312
├─Conv2d: 1-2                            [-1, 24, 180, 240]        7,224
├─Conv2d: 1-3                            [-1, 32, 180, 240]        19,232
├─Conv2d: 1-4                            [-1, 64, 180, 240]        51,264
├─MaxPool2d: 1-5                         [-1, 64, 45, 60]          --
├─Conv2d: 1-6                            [-1, 128, 45, 60]         204,928
├─MaxPool2d: 1-7                         [-1, 128, 11, 15]         --
├─Linear: 1-8                            [-1, 2640]                55,759,440
├─Linear: 1-9                            [-1, 512]                 1,352,192
├─Linear: 1-10                           [-1, 116]                 59,508
==========================================================================================
Total params: 57,454,100
Trainable params: 57,454,100
Non-trainable params: 0
Total mult-adds (G): 3.98
==========================================================================================
Input size (MB): 0.16
Forward/backward pass size (MB): 46.17
Params size (MB): 219.17
Estimated Total Size (MB): 265.50
            

A full model architecture pipeline has been provided in the following pdf

The best run/the one used for the following results is 200 epochs, lr=1e-4, batchsize=64, adam optimizer and the rest default. With the MSE loss of 0.00019049581896979362

valid detection

valid detection

invalid detection

invalid detection

This disparity might be because the model isn't large enough to generalize on all points. However, it did to pretty well generally.

The following is the 5 conv layers visualized

Part 3: Training with a large dataset

We load the keypoints from the ibug_300W dataset. Since we are provided bounding boxes for train/test, we can crop the image. I choose the provide min(x-50,0) padding to left, top and a padding of 150 to width/height. This is because There were some cases where the keypoints were cropped and this could make it worse for model to generalize/get the correct answer.

The following are two faces visualized after bounding box cropping and with the keypoints

The different architectures of the model I tried were using resnet18 pretrained and changing the first convolutional layer to use one channel and changing the last fully connected layer. I also repeated this for resnet50, resnet101. I also tried using pretrained weights with different datasets such as CelebA or vgg-face2.

The image below is using resnet18 with a lr of 1e-4 with a batch size of 64 over 50 epochs and everything else default

The current best result is a kaggle score of 11.72269 using a resnet101 model and running it for 120 epochs, lr 1e-4, batch size 32.

The following are 4 photos in the test dataset. The keypoints are very close in these images. However, it doesn't fully work on cartoon images because of the exagerrated features.

The following are 4 photos not in the dataset. Three of them are from "This face does not exist.com". The other is a image of aang from the avatar tv show. The keypoints are very close in these images

Part 4: Bells and Whistles: Video

Video of Avengers cast with project 3 code face morphed with the model from part3. Link To Video: https://youtu.be/cJQsgz6Wnlo



Video of Presidents cast with project 3 code face morphed with the model from part3.

Part 4: B&W UNet

Unets are a way to create a mask that holds the probability of an area being a landmark versus the exact value.

The following are some gaussian masks created from the dataset

The architecture of the Unet involves first downsampling the result with conv layers and then upsampling. There are also some skip layers that bring results from the earlier stages to later stages

Following is architecture diagram extracted from tensorboard

                | Name                | Type       | Params
                ---------------------------------------------------
                0 | down_sampling_convs | ModuleList | 18.8 M
                1 | upward_convs        | ModuleList | 9.4 M 
                2 | upsampling_convs    | ModuleList | 2.8 M 
                3 | pool                | MaxPool2d  | 0     
                4 | conv_last           | Conv2d     | 65    
                5 | loss                | MSELoss    | 0     
                6 | drop                | Dropout    | 0     
                ---------------------------------------------------
                31.0 M    Trainable params
                0         Non-trainable params
                31.0 M    Total params
            

The results with training the image weren't great. This was run for 8 epochs with lr 1e-4 with a loss of 1.7e-5

Predicted Gaussian Mask 1 on top. Image without augmentation in middle. Original Gaussian mask

Predicted Gaussian Mask 1 on top. Image without augmentation in middle. Original Gaussian mask

The predicted images above have the contrast increased for easier viewing, but it can be seen that the mask sort of picks up the landmark areas, but fails by trying to generalize outside that area

Sample Loss vs Train diagram across a few runs ran

Predicted Gaussian Mask 1 on top. Image without augmentation in middle. Original Gaussian mask

The next step involved running an fc layer with these paramaters. With the size of the Unet, it ended up being too large to train. I tried using amp mixed precision, but that didn't help. I also tried applying pooling layers but that didn't help with the loss. This step fails due to the inaccuracy of the previous step.

Following is paramaters used

            | Name       | Type      | Params
            -----------------------------------------
            0 | unet_model | UNet      | 31.0 M
            1 | conv1      | Conv2d    | 312   
            2 | fc1        | Linear    | 397 M 
            3 | fc2        | Linear    | 1.4 M 
            4 | fc3        | Linear    | 69.8 K
            5 | pool       | MaxPool2d | 0     
            6 | loss       | MSELoss   | 0     
            -----------------------------------------
            429 M     Trainable params
            0         Non-trainable params
            429 M     Total params
            1,719.421 Total estimated model params size (MB)