Project 4 -- Facial Keypoint Detection with Neural Networks

By Myles Domingo

Overview

Given a set of reference images and keypoints, I constructed a neural network consisting of several convolution and RELU layers to generate predicted keypoints on unfamiliar images.

Part 1: Nose Tip Detection

Firstly, I parsed the IMM Face Database to retrieve image files and their keypoints, as well as separate the images into training and validation sets. I computed several transforms, scaling images to 80x60 pixels, coverting to greyscale, and normalizing pixel values to (-0.5, 0.5). I inputted the transfomed images into a dataloader to load batches on call.

Here are a few sample images with nose keypoints highlighted --

For the CNN, I used PyTorch to create a convolutional neural network within the spec. I used 3 convolution layers, followed a RELU layer and a maxpool. I connected these layers to two fully connected layers, with the first also connected with a RELU. After experimenting with different channels and kernel sizes, I’ve found that this worked pretty well --

I trained the net under this model for 25 epochs, using an Adam optimizer with a learning rate of 1e-3. I’ve tried experimenting with different learning rates, but I’ve found 1e-3 to be the best as it learns the fastest while not diverging or becoming noisy. I calculated the loss for both training and validation sets, as seen here.

Here are images taken from the validation set, where red represents the ground-truth keypoints, and lightgreen is predicted.

These samples above do generally well. The ones below generally do not, for potentially a variety of factors. I believe the neural net can predict well when (1) the face shape is close to the average shape (2) head is facing straight or mildly tilted — otherwise, the neural net tends to miss its prediction. Otherwise, they can be rather off, as shown below.


Part 2: Full Facial Keypoints Detection

Building the neural network for full facial keypoints detetection is similar to part 1, except there are now 58 points as outputs instead of 1.

Because we a have a small dataset, I performed data augmentation to increase the size of my dataset. For each image in my dataset, I generated a random transformation by randomly rotating it (-10, 10) degrees and translating the image (-10, 10) pixels. Along with the image, I transformed its respective keypoints. For each image, I generated 3 transforms, having a total of 192 * 3 images to process for training and 48 * 3 images for validation.

Here are reference images with ground-truth keypoints for faces.

For the neural network, I used 5 convolutional layers, with the first 4 followed by a RELU and a maxpool, and 2 fully connected layers.

I trained the neural network on the augmented dataset for 20 epochs, using an Adam optimizer with a learning rate of 1e-3.

Here are the learned filteres from the resulting net for the first two convolution layers.

From here, we can look at our images sampled from our validation set, with predicted points represented by green and original as red. Images that had good predicted points had a clear facial expression and minimal rotation. Abnormalities such as non neutral expressions can cause error. In addition, rotation from augmentation causes the neural net to mispredict as a turned head instead of a straight-faced one. The last two of the set have such features, and as such, do poorly in terms of loss amount.