Facial Keypoint Detection with Neural Networks

Part 1: Nose Tip Detection

I created the dataloader for the images and their nose points plotted on them, and then displayed a few.

Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7

I then trained my model and plotted the train and validation MSE loss during the training process.

Train VS Validation MSE Loss, Part 1

I tried some hyperparameter tuning and changed the model's channel size from 14 to 12, as well as changed the learning rate from 0.001 to 0.0001. The hypertuned model gave me almost twice the average training loss that I had with the original model, but around the same average validation loss. Here is the train and validation losses graph:

Train VS Validation MSE Loss (Hypertuned), Part 1

Here are the 2 images that work well and then the 2 that detect incorrectly (red points are ground truth and blue points are predictions). For the 2 that detect incorrectly, I believe it's because the first incorrect one is super off-center, even though the woman is facing the camera head-on. The second one is because his face is extremely turned to the side, and the nose tip is nowhere near the center, which makes it much harder to detect.

Prediction1 Prediction2 Prediction3 Prediction4

Part 2: Full Facial Keypoints Detection

I made the dataloader for the images and all of their facial keypoints plotted on them, and then displayed a few (ground truths). I also followed the tutorial given for data augmentation and ended up using transforms.ColorJitter, modifying the brightness, contrast, and saturation.

Sample1 Sample2 Sample3 Sample4 Sample5 Sample6

Here is my model's detailed architecture:

Here is my average training loss, average validation loss, training and validation loss graph over each epoch.

Part 2 MSE Graph

Here are 2 predictions that work well, followed by 2 that don't work as well. The red points are the ground-truths and the blue points are the predictions from the model. The 1st one that doesn't work well has a face that's tilted to the left that's also proportionally larger in the frame than the other photos. The 2nd one that doesn't work as well is bad because the head is shifted down the frame so much compared to the rest of the dataset.

Prediction1 Prediction2 Prediction3 Prediction4

Here is my visualization of the learned filters:

Part2_Filters

Part 3: Train with Larger Dataset

The detailed architecture of the model is: resnet18() parameters with a first layer of 1, 64 and a last layer of 512, 136 (in terms of in-features and out-features). This is my graph of losses:

Part 3 MSE Graph

Here is the model run on some of my own images: