Facial Keypoint Detection with Neural Networks

Francis Pan

Part 1: Nose Tip Detection

To start, we will be using the IMM face dataset to do nosetip detection before moving on to full facial keypoints. Below are some samples from the dataset along with their ground truth nosetip landmark.

Sample 0
Sample 1
Sample 2
Sample 3

I used a CNN with 3 convolutional layers, each with 12 channels, and a kernel size of 3x3. Below is a plot of the training and validation loss over 20 epochs.

Following are some results: 2 bad and 2 good, ground truth is in green, red is prediction

Bad 1
Bad 2
Good 1
Good 2

For the 2 bad prediction, I believe it is due to the face orientation (truned to the side) and the hair. The dataset is very small, and doesn't have many people that are blad, or have curly blond hair.

Part 2: Full Facial Keypoints Detection

Now to do the same thing as part 1, but with the full facial keypoint and not just the nose tip. This time, since the dataset is very small, we will have to do some data augmentation to train a better model and avoid overfitting. Below are some sample results with data augmentation. I chose to simple to a random crop followed by a random roatation of up to 15 degrees

Sample 0
Sample 1
Sample 2
Sample 3

For this part, I used a CNN with 5 convolutional layers, each with 32 channels, and a kernel size of 5x5. I ran it for 30 epochs, with a learning rate of 1e-3, using MSE to compute loss. Below is a plot of the training and validation loss over 30 epochs.

Following are some results: 2 bad and 2 good, ground truth is in green, red is prediction

Bad 1
Bad 2
Good 1
Good 2

For the 2 bad prediction, once again, bald man makes his return. For the second one, I believe it is because the man has a very skinny face, as most of the dataset has pretty average face shapes. Below are a few of the learning filters from the model.

Part 3: Train With Larger Dataset

Now it's time to move to the big boy ibug dataset. Due to the larger dataset, we now move to google colab to utilize their gpu.

For dataloading, we have to do everything we did in pert 2 (including the data augmentation) but now we also have to take into account the bounding boxes provided by the dataset. The faces in the images are often very small or in the background, so bounding boxes help us focus on the right areas. As usual, below are some sample from the dataset post-data augmentation, along with their ground truth keypoints.

Sample 0
Sample 1
Sample 2
Sample 3

For this part, I used a pretrained resnet-18 model, only changing the 1st and last layers. I changed the first layer to have a shape of (1, 64) to fit out data set, and the last layer to have an output of 68 * = 132 to account for the x, y coordinates of the 68 keypoints. I ran my model for 10 epochs, with a learning rate of 1e-3, using MSE to compute loss (just like in part 2), ending with a mean absolute error of 23.25380 on our classes kaggle leaderboard. Below is a plot of the training and validation loss over 10 epochs.

Following are some results: ground truth is in green, red is prediction

The over all shape of the faces for my model's predictions seem to be pretty good (although sometimes on the larger side). However, the smaller details are a mess, like around the eyes and mouth where the points are more concentrated. Below are some predictions on images of my own choosing:

Sana from Twice
Shang Chi
Sokka from ATLA

My model seems pretty garbage T_T, always predicitng the face to be much bigger than it should be. This problem seems to be WAY worse on Sokka, a cartoon character.

Final Thoughts

I should have started way earlier so that I would have more time to train and refine my model, as you can see, it's pretty bad on images that are not part of the dataset (maybe it's just a bounding box issue)