CS 194: Facial Keypoint Detection with Neural Networks

Project 5

Derek Wu


This project aims to help students practice the development and tuniing of convolutional neural networks by training CNNs on different facial data sets and attempting to predict the nose keypoint and facial keypoints of faces.

Part 1: Nose Tip Detection

In this part of the project, I try to predict the nose keypoints of faces from the Dane dataset. To do this, I use the Dataset and Dataloader functions provided by Pytorch to setup a split of the dataset such that a sample contains the grayscaled normalized image data and a tuple for the ground truth nose keypoint values. Below of some of the random ground truth images.

Dataloader 1
Dataloader 2
Dataloader 3
Dataloader 4
Dataloader 5

The CNN is trained on the dataset for 25 epochs. The train and validation loss is depicted below.

Below are 2 good examples of the CNN successfully predicting the nose keypoints:

Learning Rate 0.001
Learning Rate 0.001
Learning Rate 0.0001
Learning Rate 0.0001

Below are 2 bad examples of the CNN unsuccessfully predicting the nose keypoints:

Learning Rate 0.001
Learning Rate 0.001
Learning Rate 0.0001
Learning Rate 0.0001

The nose keypoint predictor may have more difficulty interpreting faces with more animated facial expressions such as the first failure case and more prominently turned faces (leads to a loss of information as half of the face is not visible).

Part 2: Full Facial Keypoints

To build on the previous part of the project, this part attempts to predict all 58 facial keypoints that the Dane dataset provides. Since we would need a large dataset to provide enough training, we augment the existing set of 240 images to train. To do this, I randomly generate input values to the Pytorch functional transformation functions (angle, shift, etc.). Using these random inputs, I can also transform the original ground truth keypoints to compensate for the affine transformations. Below are some of the randomly sampled ground truth images.

The CNN I used to achieve the results that are represented in the set of images below is the following setup: 6 convolutional layers with 32 channels each. This first layer uses a 7x7 filter, the next 2 layers incorporate 5x5 filters, then the rest are 3x3 filters. Each layer except the first employ 1 pixel length of padding. All layers are followed by a Relu operation and only the first 2 layers include a MaxPool2d operation of the Relu result. Lastly, two fully connected layers output a single tensor of size [1,116] which are the 58 (x,y) coordinate pairs that determine the positions of the 58 facial keypoints.

Network Details

Additionaly, this neural net was trained for 40 epochs, batch-size 25, and learning rate=0.0005. Below is a plot detailing the train and validation loss of one instance of the training process for this CNN.

Here are 2 cases where the CNN correctly identified the facial keypoints:

Here are 2 cases where the CNN incorrectly identified the facial keypoints:

Similar to the nose keypoint case, there is some difficulty in identifying the turned faces keypoints.

Lastly, here are some of the learned filters from the convolutional layers of 7x7 layer in the CNN.

Part 3: Train With Larger Dataset

In this part of the project, we are tasked with doing learning from a larger dataset (6666 images). Using the boilerplate code, I loaded and processed the images from the dataset like in the previos parts, making sure to randomly rotate + shifted images for data augmentation and ensuring the associated keypoints are similarly transformed. One main difference from the previous parts is the use of bounding boxes. I used this information from the downloaded files to crop + transform the facial keypoints. Because some of the bounding boxes resulted in keypoints lying outside of the 224x224 image range, I transformed the data in the bboxes array to increase the width and height of the crop by a factor of 1.4 and shifted the left corner of the image by a factor of 0.8. This changed the number of eligible training/validation set images from 384 to 4193. This set was used to train the CNN. The exact architecture specifications are displayed below.

Here is the training + validation loss plot. The model was trained with a batch size of 2, learning rate of 0.001 and for 10 epochs.

Here are some of the predicted images with keypoints.

Here is the trained CNN run on some of the photos from my own collection.

The 2nd and 3rd images seem to have facial keypoints relatively corresponded to the facial structure + position of the personal images. The 3rd one of Jimmy O. Yang is especially good. The CNN, however, performs rather poorly on my face.