Project 5: Facial Keypoint Detection with Neural Networks

Part 1: Nose Tip Detection

In this part, I build a CNN to try to predict the nose tip of face images from the IMM Face Database.

Below are three examples of face images with ground-truth nose keypoints

bears
bears
bears

I use a 3-layer CNN whose network architecture is shown below.

bears

For hyperparameter tuning, I tested different learning rates.

For a learning rate of 0.001, the loss vs. epoch graph is shown below.

bears

Successful Cases:

bears
bears

Unsuccessful Cases:

bears
bears

For a learning rate of 0.01, the loss vs. epoch graph is shown below.

bears

Successful Cases:

bears
bears

Unsuccessful Cases:

bears
bears

The learning rate of 0.001 is better than the learning rate of 0.01. Although the model with the learning rate of 0.01 converged to a local optimum quicker, it performed worse on the images because I think it was overfitted to the training images and picked close to the center of the image for every image. For the unsuccessful cases for the model with the learning rate of 0.001, the person is facing towards one side or making a facial expression. It could also be unsuccessful because of bad lighting and shadows.

Part 2: Full Facial Keypoints Detection

In this part, I build a CNN to try to predict all the facial keypoints of face images from the IMM Face Database.

I performed color jittering, rotation, and shifting on the images for data augmentation. Below are three examples of face images with ground-truth facial keypoints

bears
bears
bears

I use a 5-layer CNN whose network architecture is shown below.

bears

The loss vs. epoch graph is shown below.

bears

Successful Cases:

bears
bears

Unsuccessful Cases:

bears
bears

Similar to part 1, for the unsuccessful cases, the person is facing towards one side or making a facial expression. It could also be unsuccessful because of bad lighting and shadows.

Here are some learned filters for the first convolutional layer:

bears

Part 3: Train With Larger Dataset

In this part, I build a CNN to try to predict all the facial keypoints of face images from a much larger dataset with bounding boxes.

Below are three examples of images with ground-truth facial keypoints

bears
bears
bears

I use a ResNet18 model. I replaced the first layer with a layer with a kernel size of 5 and 64 neurons, and the final fully-connected layer to output to 136 outputs.

I achieved an MAE loss of 13.33706 on Kaggle. My username is Eric Cheng.

The loss vs. epoch graph is shown below.

bears

Example of Predicted Points:

bears
bears
bears

Running model on my images:

bears
bears
bears

The predictions work pretty well on Trump and Chancellor Christ since their face is pretty clear. However, it doesn't work that well on Obama maybe because my bounding box was too big so his face was too small.