facial keypoint detection with neural networks

Nadia Hyder

 

OVERVIEW

In this project, I created convolutional neural networks to automatically detect facial keypoints, using PyTorch as the deep learning framework. This project consists of 3 parts: nose tip detection, full facial keypoint detection, and training with a large dataset.

 

PART 1: NOSE TIP DETECTION

For the first part, I used the IMM Face Database to train an initial model for nose tip detection, treating it as a pixel coordinate regression problem where the input is a grayscale image, and the output is the nose tip positions (x,y). The (x,y) points are represented the ratio of image width and height, so the values are between 0 and 1.

 

DATALOADER

I wrote a custom PyTorch dataloader to load the images and nose tip points, then applied transformations to convert the images to black and white, convert pixel values to normalized float values between -0.5 and 0.5, and resize to 80x60.

Here are a few sample images and their nose tip points:

A picture containing graphical user interface

Description automatically generated

A screen shot of a person

Description automatically generated

Graphical user interface, application

Description automatically generated

A picture containing graphical user interface

Description automatically generated

 

CONVOLUTIONAL NEURAL NETWORK

After creating the dataloader, I wrote a 3-layer convolutional neural network in PyTorch with the following (sequential) design:

·       Convolutional layer 1: 15 channels, with a 5x5 kernel, followed by ReLu, and 2x2 max pooling

·       Convolutional layer 2: 20 channels, with a 5x5 kernel, followed by ReLu, and 2x2 max pooling

·       Convolutional layer 3: 25 channels, with a 5x5 kernel, followed by ReLu, and 2x2 max pooling

·       Fully connected layer 1, followed by ReLu

·       Fully connected layer 2

 

LOSS FUNCTION AND OPTIMIZER

Finally, I used mean squared error to calculate loss, and trained the network using the Adam optimizer with a learning rate of 0.001 and ran the training loop for 25 epochs with a batch size of 8. Over time, we see the loss converging to 0.01.

The following plot shows the progression of training and validation loss during the training process:

Chart, histogram

Description automatically generated

RESULTS

The CNN is ready for use. Below are a few sample images where the network performed well, where the red points are predicted nose tips and the green points are the correct point.

A picture containing graphical user interface, application

Description automatically generated

A picture containing photo, sitting

Description automatically generated

A picture containing indoor, photo, monitor, room

Description automatically generated

A picture containing application

Description automatically generated

 

However, there are also a few cases where the CNN failed to correctly predict nose tip position. Below are a few examples. The CNN failed to account for face positioning when predicting the position of the nose tip.

A picture containing indoor, photo, monitor, person

Description automatically generated

A picture containing graphical user interface

Description automatically generated

A picture containing application

Description automatically generated

 

ins

PART 1: FULL FACIAL KEYPOINT DETECTION

For the next part, I used the same database to train a network to detect all 58 facial landmarks.  

 

DATA AUGMENTATION AND DATALOADER

First, I augmented the dataset to prevent overfitting, randomly applying transformations to copies of images including color jittering, rotation, and vertical and horizontal shifts (all of which are randomized). I created a new dataloader to load the images and all 58 landmark points, then applied transformations to convert the images to black and white, convert pixel values to normalized float values between -0.5 and 0.5, and resize to 160x120.

These are a few of the augmented images along with their ground truth points:

A person posing for the camera

Description automatically generated

A screen shot of a person

Description automatically generated

A screen shot of a person

Description automatically generated

 

CONVOLUTIONAL NEURAL NETWORK

After creating the dataloader, I wrote a 5-layer convolutional neural network in PyTorch with the following (sequential) design:

·       Convolutional layer 1: 15 channels, with a 7x7 kernel, followed by ReLu

·       Convolutional layer 2: 30 channels, with a 5x5 kernel, followed by ReLu, and 2x2 max pooling

·       Convolutional layer 3: 25 channels, with a 3x3 kernel, followed by ReLu, and 2x2 max pooling

·       Convolutional layer 4: 20 channels, with a 7x7 kernel, followed by ReLu, and 2x2 max pooling

·       Convolutional layer 5: 15 channels, with a 5x5 kernel, followed by ReLu, and 2x2 max pooling

·       Fully connected layer 1, followed by ReLu

·       Fully connected layer 2

 

LOSS FUNCTION AND OPTIMIZER

Again, I used mean squared error to calculate loss, and trained the network using the Adam optimizer with a learning rate of 0.001 and ran the training loop for 25 epochs with a batch size of 8. Here we see the loss tends to 0.01.

The following plot shows the progression of training and validation loss during the training process:

Chart, histogram

Description automatically generated

RESULTS

Successes

After training, I tested the neural network. Below are a few examples in which the network performed relatively well, where the red points are predicted landmark points  and the green points are the correct points.

A picture containing graphical user interface

Description automatically generated

A picture containing indoor, sitting, table, display

Description automatically generated

A picture containing indoor, looking, front, person

Description automatically generated

A picture containing indoor, small, face, sitting

Description automatically generated

 

Failures

Below are a few cases in which the network performed poorly. This happens mainly when the subject’s mouth is open or facing in a different direction than forwards. When the subject is facing away from the camera, not all facial landmarks can be properly seen and therefore are more difficult to infer.

A close up of a person

Description automatically generated

A picture containing chart

Description automatically generated

A picture containing graphical user interface

Description automatically generated

 

Here is a visualization of the learned 7x7 filters from the 15 channels in convolutional layer 1:

A picture containing qr code

Description automatically generated

 

PART 3: TRAIN WITH LARGER DATASET

For the final part of the project, I used a larger dataset (ibug face in the wild dataset) to train a facial keypoints detector. The dataset contains 6666 images of varying sizes, and each image has 68 annotated facial keypoints. I used Google Colab with GPU to train the model.

 

DATA AUGMENTATION AND DATALOADER

Because the faces in the dataset may occupy only a small fraction of the entire image, I cropped the image during training and fed only the face portion (resized to 224x224) to the network. I split the data into training and validation sets with a ratio of 9:1. Additionally, I augmented the training and validation data to prevent overfitting.

These are a few images from the dataset and their ground truth points after being cropped to the bounding box:

A picture containing glasses, dark, table, decorated

Description automatically generated

 

 

CONVOLUTIONAL NEURAL NETWORK

I chose to use ResNet18 as the CNN model with a few modifications: the first layer has an input channel of size 1 (because the inputs are grayscale) and the final layer has an output channel of 136 x,y coordinates (68 landmark points*  2).

ResNet18 has the following architecture:

Proposed Modified ResNet-18 architecture for Bangla HCR. In the... |  Download Scientific Diagram

 

LOSS FUNCTION AND OPTIMIZER

Again, I used mean squared error to calculate loss, and trained the network using the Adam optimizer with a learning rate of 0.0001. I played around with the number of epochs and batch size and ultimately found the best performance with a batch size of 64.

My optimal network contained copies of each image with augmentations, so the dataset was rather large. For the sake of time, I ran the network for 10 epochs.

Chart, line chart

Description automatically generated

 

RESULTS

My model received a score of 10.4 on Kaggle. Here are a few results from running the model on a separate training set of 1008 images:

 

 

 

 

I also ran the model on a few of my own images. The model did not perform very well on the 3rd and 4th photos; for the 3rd it is likely because of the angle of the face and in the 4th image, possibly because of the glasses obstructing the eyes, the model detected crow’s feet as the eyes

and thus shifted all the points downwards.  

A person wearing glasses

Description automatically generated

 


Overall, I really enjoyed this project. Given more time, I would have trained my model for a greater number of epochs.