Facial Keypoint Detection with Neural Networks

Part 1: Nose Tip Detection

I created the dataloader for the images and their nose points plotted on them, and then displayed a few.

I then trained my model and plotted the train and validation MSE loss during the training process.

I tried some hyperparameter tuning and changed the model's channel size from 14 to 12, as well as changed the learning rate from 0.001 to 0.0001. The hypertuned model gave me almost twice the average training loss that I had with the original model, but around the same average validation loss. Here is the train and validation losses graph:

Train VS Validation MSE Loss (Hypertuned), Part 1

Here are the 2 images that work well and then the 2 that detect incorrectly (red points are ground truth and blue points are predictions). For the 2 that detect incorrectly, I believe it's because the first incorrect one is super off-center, even though the woman is facing the camera head-on. The second one is because his face is extremely turned to the side, and the nose tip is nowhere near the center, which makes it much harder to detect.

Part 2: Full Facial Keypoints Detection

I made the dataloader for the images and all of their facial keypoints plotted on them, and then displayed a few (ground truths). I also followed the tutorial given for data augmentation and ended up using transforms.ColorJitter, modifying the brightness, contrast, and saturation.

Here is my model's detailed architecture:

Number of Layers: 6
Channel Size: 14
Kernel Size: 5
Learning Rate: 0.001

Here is my average training loss, average validation loss, training and validation loss graph over each epoch.

Here are 2 predictions that work well, followed by 2 that don't work as well. The red points are the ground-truths and the blue points are the predictions from the model. The 1st one that doesn't work well has a face that's tilted to the left that's also proportionally larger in the frame than the other photos. The 2nd one that doesn't work as well is bad because the head is shifted down the frame so much compared to the rest of the dataset.

Here is my visualization of the learned filters:

Part 3: Train with Larger Dataset

The detailed architecture of the model is: resnet18() parameters with a first layer of 1, 64 and a last layer of 512, 136 (in terms of in-features and out-features). This is my graph of losses:

Here is the model run on some of my own images: