Project 5: Facial Keypoint Detection with Neural Networks

Part 1: Nose Tip Detection

In this part, I build a CNN to try to predict the nose tip of face images from the IMM Face Database.

Below are three examples of face images with ground-truth nose keypoints

I use a 3-layer CNN whose network architecture is shown below.

For hyperparameter tuning, I tested different learning rates.

For a learning rate of 0.001, the loss vs. epoch graph is shown below.

Successful Cases:

Unsuccessful Cases:

For a learning rate of 0.01, the loss vs. epoch graph is shown below.

Successful Cases:

Unsuccessful Cases:

The learning rate of 0.001 is better than the learning rate of 0.01. Although the model with the learning rate of 0.01 converged to a local optimum quicker, it performed worse on the images because I think it was overfitted to the training images and picked close to the center of the image for every image. For the unsuccessful cases for the model with the learning rate of 0.001, the person is facing towards one side or making a facial expression. It could also be unsuccessful because of bad lighting and shadows.

Part 2: Full Facial Keypoints Detection

In this part, I build a CNN to try to predict all the facial keypoints of face images from the IMM Face Database.

I performed color jittering, rotation, and shifting on the images for data augmentation. Below are three examples of face images with ground-truth facial keypoints

I use a 5-layer CNN whose network architecture is shown below.

The loss vs. epoch graph is shown below.

Successful Cases:

Unsuccessful Cases:

Similar to part 1, for the unsuccessful cases, the person is facing towards one side or making a facial expression. It could also be unsuccessful because of bad lighting and shadows.

Here are some learned filters for the first convolutional layer:

Part 3: Train With Larger Dataset

In this part, I build a CNN to try to predict all the facial keypoints of face images from a much larger dataset with bounding boxes.

Below are three examples of images with ground-truth facial keypoints

I use a ResNet18 model. I replaced the first layer with a layer with a kernel size of 5 and 64 neurons, and the final fully-connected layer to output to 136 outputs.

I achieved an MAE loss of 13.33706 on Kaggle. My username is Eric Cheng.

The loss vs. epoch graph is shown below.

Example of Predicted Points:

Running model on my images:

The predictions work pretty well on Trump and Chancellor Christ since their face is pretty clear. However, it doesn't work that well on Obama maybe because my bounding box was too big so his face was too small.