facial keypoint detection
with neural networks
Nadia Hyder
OVERVIEW
In this project, I created convolutional neural networks to
automatically detect facial keypoints, using PyTorch as the deep learning
framework. This project consists of 3 parts: nose tip detection, full facial
keypoint detection, and training with a large dataset.
PART 1: NOSE TIP DETECTION
For the first part, I used the IMM Face Database to train an
initial model for nose tip detection, treating it as a pixel coordinate
regression problem where the input is a grayscale image, and the output is the
nose tip positions (x,y). The (x,y)
points are represented the ratio of image width and height, so the values are
between 0 and 1.
DATALOADER
I wrote a custom PyTorch dataloader to load the images and
nose tip points, then applied transformations to convert the images to black
and white, convert pixel values to normalized float values between -0.5 and
0.5, and resize to 80x60.
Here are a few sample images and their nose tip points:
|
|
|
|
CONVOLUTIONAL NEURAL NETWORK
After creating the dataloader, I wrote a 3-layer convolutional
neural network in PyTorch with the following (sequential) design:
· Convolutional layer 1: 15 channels, with a 5x5 kernel,
followed by ReLu, and 2x2 max pooling
· Convolutional layer 2: 20 channels, with a 5x5 kernel,
followed by ReLu, and 2x2 max pooling
· Convolutional layer 3: 25 channels, with a 5x5 kernel,
followed by ReLu, and 2x2 max pooling
· Fully connected layer 1, followed by ReLu
· Fully connected layer 2
LOSS FUNCTION AND OPTIMIZER
Finally, I used mean squared error to calculate loss, and
trained the network using the Adam optimizer with a learning rate of 0.001 and
ran the training loop for 25 epochs with a batch size of 8. Over time, we see
the loss converging to 0.01.
The following plot shows the progression of training and
validation loss during the training process:
RESULTS
The CNN is ready for use. Below are a few sample images where
the network performed well, where the red points are predicted nose tips and
the green points are the correct point.
|
|
|
|
However, there are also a few cases where the CNN failed to
correctly predict nose tip position. Below are a few examples. The CNN failed
to account for face positioning when predicting the position of the nose tip.
|
|
|
ins
PART 1: FULL FACIAL KEYPOINT DETECTION
For the next part, I used the same database to train a
network to detect all 58 facial landmarks.
DATA AUGMENTATION AND DATALOADER
First, I augmented the dataset to prevent overfitting, randomly
applying transformations to copies of images including color jittering,
rotation, and vertical and horizontal shifts (all of which are randomized). I created
a new dataloader to load the images and all 58
landmark points, then applied transformations to convert the images to black
and white, convert pixel values to normalized float values between -0.5 and
0.5, and resize to 160x120.
These are a few of the augmented images along with their
ground truth points:
|
|
|
CONVOLUTIONAL NEURAL NETWORK
After creating the dataloader, I
wrote a 5-layer convolutional neural network in PyTorch
with the following (sequential) design:
· Convolutional layer 1: 15 channels, with a 7x7 kernel,
followed by ReLu
· Convolutional layer 2: 30 channels, with a 5x5 kernel,
followed by ReLu, and 2x2 max pooling
· Convolutional layer 3: 25 channels, with a 3x3 kernel,
followed by ReLu, and 2x2 max pooling
· Convolutional layer 4: 20 channels, with a 7x7 kernel,
followed by ReLu, and 2x2 max pooling
· Convolutional layer 5: 15 channels, with a 5x5 kernel,
followed by ReLu, and 2x2 max pooling
· Fully connected layer 1, followed by ReLu
· Fully connected layer 2
LOSS FUNCTION AND OPTIMIZER
Again, I used mean squared error to calculate loss, and
trained the network using the Adam optimizer with a learning rate of 0.001 and
ran the training loop for 25 epochs with a batch size of 8. Here we see the
loss tends to 0.01.
The following plot shows the progression of training and
validation loss during the training process:
RESULTS
Successes
After training, I tested the neural network. Below are a few
examples in which the network performed relatively well, where the red points
are predicted landmark points and the
green points are the correct points.
|
|
|
|
Failures
Below are a few cases in which the network performed poorly. This
happens mainly when the subject’s mouth is open or facing in a different
direction than forwards. When the subject is facing away from the camera, not
all facial landmarks can be properly seen and therefore are more difficult to
infer.
|
|
|
Here is a visualization of the learned 7x7 filters from the
15 channels in convolutional layer 1:
PART 3: TRAIN WITH LARGER DATASET
For the final part of the project, I used a larger dataset (ibug face in the wild dataset) to train a facial keypoints detector. The dataset contains 6666 images of
varying sizes, and each image has 68 annotated facial keypoints.
I used Google Colab with GPU to train the model.
DATA AUGMENTATION AND DATALOADER
Because the faces in the dataset may occupy only a small
fraction of the entire image, I cropped the image during training and fed only
the face portion (resized to 224x224) to the network. I split the data into
training and validation sets with a ratio of 9:1. Additionally, I augmented the
training and validation data to prevent overfitting.
These are a few images from the dataset and their ground
truth points after being cropped to the bounding box:
|
|
|
|
CONVOLUTIONAL NEURAL NETWORK
I chose to use ResNet18 as the CNN model with a few
modifications: the first layer has an input channel of size 1 (because the
inputs are grayscale) and the final layer has an output channel of 136 x,y coordinates (68 landmark points* 2).
ResNet18 has the following architecture:
LOSS FUNCTION AND OPTIMIZER
Again, I used mean squared error to calculate loss, and
trained the network using the Adam optimizer with a learning rate of 0.0001. I
played around with the number of epochs and batch size and ultimately found the
best performance with a batch size of 64.
My optimal network contained copies of each image with
augmentations, so the dataset was rather large. For the sake of time, I ran the
network for 10 epochs.
RESULTS
My model received a score of 10.4 on Kaggle. Here are a few
results from running the model on a separate training set of 1008 images:
|
|
|
|
|
|
I also ran the model on a few of my own images. The model did
not perform very well on the 3rd and 4th photos; for the
3rd it is likely because of the angle of the face and in the 4th
image, possibly because of the glasses obstructing the eyes, the model detected
crow’s feet as the eyes
and thus shifted all the points downwards.
|
|
|
|
Overall, I really enjoyed this project. Given more time, I
would have trained my model for a greater number of epochs.