Project 04: Facial Keypoint Detection with Neural Networks

Vanessa Lin

Overview

In this project, I built convolutional neural networks to help me detect keypoints on a face, from just the nose to the whole face structure (eyes, eyebrows, nose, lips, face outline). To detect these facial keypoints, I first built two toy CNN models to train on the Danish computer scientists dataset. Afterwards, I moved onto a larger dataset, the ibug face dataset, to train and detect facial keypoints.

Nose Tip Detection

In this part of the project, I built a simple convolutional neural network that takes in an image as input and outputs the coordinates of a nose tip. Before training, I created a Dataset that creates a sample of {'image': image, 'landmarks': landmarks}, where the image was resized to a smaller size of 80 by 60 and the landmarks would just be one point for the nose. Next, I created a DataLoader to iterate through the data during training and validation. Below are some images from the training set with the ground truth labels in red.

For the neural network, I used 3 convolutional layers and 2 fully connected layers:

a convolutional layer with 5x5 filters and 15 output channels followed by a relu and maxpool2d of 2x2

a convolutional layer with 5x5 filters and 20 output channels followed by a relu and maxpool2d of 2x2

a convolutional layer with 5x5 filters and 25 output channels followed by a relu and maxpool2d of 2x2

a fully connected layer with input size of 600 and output size of 200 followed by a relu

a fully connected layer with input size of 200 and output size of 2

I trained the neural network for 25 epochs and used the MSE loss function with an Adam optimizer, using a learning rate of 0.001 and a batch size of 8. Below are the results of the training and validation losses across the 25 epochs.

Training Process of Nose Keypoint Detection

Here are a few good examples, where the neural network was able to match exactly or near/overlapping the ground truth label. The predicted point is labeled in red, while the ground truth label is labeled in green. (Note: It may be hard to see depending on how large your screen. If you zoom in on webpage to enlargen, it will be easier to see the points.)

Here are a few bad examples, where the neural network was completely off on the predicted nose keypoint. As you can see in these images, the faces are either turned away from the camera by a large amount or tilted more towards one direction, which makes it hard to detect the nose point accurately, because the nose is in a different spot due to the turn and tilt. Also, since the nose point is typically located in the lighter area right below the tip of the nose, the different lightings of the images at the tip of the nose may have also caused the neural network to incorrectly predict.

Full Facial Keypoints Detection

For the full facial keypoints detection, I built a similar convolutional neural network like the nose point detection neural network, but with two more convolutional layers. Also, since we have a small dataset, I added data augmentation, like rotating the image randomly with either [-10, -5, 0, 5, 10] degrees and shifting the image horizontally with either [-10, -5, 0, 5, 10] pixels, and updated the keypoints respectively to each transformation. The images were also resized to 160 by 120. Below are some of the sampled images created with the data augmentation and the ground truth labels in red.

For the neural network, I used 5 convolutional layers and 2 fully connected layers:

a convolutional layer with 7x7 filters and 15 output channels followed by a relu

a convolutional layer with 5x5 filters and 30 output channels followed by a relu

a convolutional layer with 3x3 filters and 25 output channels followed by a relu and maxpool2d of 2x2

a convolutional layer with 7x7 filters and 20 output channels followed by a relu

a convolutional layer with 5x5 filters and 15 output channels followed by a relu and maxpool2d of 2x2

a fully connected layer with input size of 10560 and output size of 5280 followed by a relu

a fully connected layer with input size of 5280 and output size of 116 (for the 58 facial points)

I trained the neural network for 20 epochs and used the MSE loss function with an Adam optimizer, using a learning rate of 0.001 and batch size of 4. Below are the results of the training and validation losses across the 20 epochs.

Training Process of Facial Keypoints Detection

Here are a few good examples, where the neural network worked pretty well in detecting the facial keypoints although not exactly the same keypoints as the ground truth labels. The predicted point is labeled in red, while the ground truth label is labeled in green.

Here are a few bad examples, where the neural network failed to mark the keypoints at the right places, and created an outline that did not seem to outline any of the facial features. These images, especially the first and third one, probably failed due to how far right they are looking to and not facing the camera directly, since it's hard to distinguish between the nose keypoints and the general chin outline points in those images. Also, the middle photo failed in matching the eyebrows as the neural network probably perceived the shadows of his eyelids as his eyebrows instead because the man was raising his eyebrows higher than the rest of the photos.

Here are some learned 7x7 filters from the first convolutional layer.

Train With Larger Dataset

Now for training on a larger dataset, like the ibug face in the wild dataset (6666 images and 68 facial points per image), I used the same data augmentation techniques from part (2) for this part during training, and I added another transform that randomly changed the brightness of the photo, which would rescale the intensity through skimage.exposure.rescale_intensity if the rescaling brightness boolean was chosen. As the dataset came with bounding boxes for each image, I increased the bounding box by a factor of 1.25x, cropped each image to the bounding box, and resized the image to 224 by 224 for training. While resizing the images, I also updated the keypoint labels accordingly. For the convolutional neural network, I used ResNet 18 from torchvision.models.resnet18(pretrained=False). I modified the first convolutional layer to take in 1 channel for the input as the image input's shape was (1, 224, 224), and I modified the last fully connected layer to an output size of 136, for the 68 facial points.

Here is the detailed architecture of the model:


ResNet( 

  (conv1): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 

  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

  (relu): ReLU(inplace=True) 

  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) 

  (layer1): Sequential( 

    (0): BasicBlock( 

      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 

      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

      (relu): ReLU(inplace=True) 

      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 

      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

    ) 

    (1): BasicBlock( 

      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 

      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

      (relu): ReLU(inplace=True) 

      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 

      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

    ) 

  ) 

  (layer2): Sequential( 

    (0): BasicBlock( 

      (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) 

      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

      (relu): ReLU(inplace=True) 

      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 

      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

      (downsample): Sequential( 

        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False) 

        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

      ) 

    ) 

    (1): BasicBlock( 

      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 

      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

      (relu): ReLU(inplace=True) 

      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 

      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

    )

  ) 

  (layer3): Sequential( 

    (0): BasicBlock( 

      (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) 

      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

      (relu): ReLU(inplace=True) 

      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 

      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

      (downsample): Sequential( 

        (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False) 

        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

      ) 

    ) 

    (1): BasicBlock( 

      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 

      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

      (relu): ReLU(inplace=True) 

      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 

      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

    ) 

  ) 

  (layer4): Sequential( 

    (0): BasicBlock( 

      (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) 

      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

      (relu): ReLU(inplace=True) 

      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 

      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

      (downsample): Sequential( 

        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) 

        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

      ) 

    ) 

    (1): BasicBlock( 

      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 

      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 
 
      (relu): ReLU(inplace=True) 

      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) 

      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 

    ) 

  ) 

  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) 

  (fc): Linear(in_features=512, out_features=136, bias=True) 

)

Again, I trained the model with an MSE loss function and an Adam optimizer, using a learning rate of 0.001 and a batch size of 12. Below are the results of the training and validation losses across the 20 epochs.

On Kaggle, the mean absolute error (MAE) of my model is 9.73940.

Training Process of Large Dataset Facial Keypoints Detection

Here are some images from the testing set with the predicted facial keypoints detection.

Here are some images from my collection that I ran the model on. Surprisingly, the model did not work as well on the photos of the Asian actors and actresses, like Luo Yunxi, Bai Lu, Park Seo Joon, and Kim Dami; however, it worked on John Krasinski the best and reasonably well for Steve Carell and Jenna Fischer (go The Office cast!) I think one possible reason why the model failed on some of the photos is that for Bai Lu and Kim Dami, their noses are tilted a bit upwards so the tip of the nose is higher then what the usual nose tips of the training set are. Also, the model completely failed at determining the keypoints for Luo Yunxi's photo because he was tilted too far from the camera giving mostly a side profile, which is hard for the model to detect, and the model attempts to place facial points on his cheek.

Bai Lu

Kim Dami

Park Seo Joon

Steve Carell

Jenna Fischer

John Krasinski

Park Seo Joon

Luo Yunxi

Final Thoughts

This project was a lot more interesting than I expected, because I was not looking forward to building the Dataloaders, Datasets, and waiting for the models to run (that part is still not fun); however, it was really cool seeing the results of my model work on some images and be able to pinpoint the specific facial features correctly. The last model would have been pretty useful for the last project in morphing everyone's images, but I don't think I would want to train that long before creating my morphing sequence. Either way, no pain no gain. Also, before I only knew how to use TensorFlow, so it was quite interesting using PyTorch instead.