Automatic Keypoint Detection

Nose Keypoint Detection

Below is a sampled groundtruth image for our CNN which tries to find the nose keypoint of a given face.
For hyperparameter tuning, I tested learning rates of .001 and .01 and saw very little difference in training effect as both models were able to learn the dataset efficiently. The model with an LR of .001 seemed to be a bit more stable as the training loss typically reduced more regularly, compared to the model with an LR of .01 which bounced up/down slightly in training loss most likely due to it overshooting the loss minimum. I also tested changing the size of my first convolution layers kernel to be 3 or 5. Changing this had little effect on the validation loss, but the final training loss was very slightly lower for the net with kernel size 5. Below is a comparison of their 3 training/validation loss graphs.
Below are some samples of the groundtruth in red and our nets output in green. The right 2 images are clearly failures. I believe the 3rd image failed because the nose was not very clearly pronounced so the net defaulted to guessing the middle of the image. And I believe the right image failed because it confused the corner of the subject's eye with its nose.

Full Facial Keypoint Dataset

Below is a sampled groundtruth image for our CNN which tries to find all keypoints of a given face.
My model's architecture is summarized below. I used an Adam optimizer with LR .001
The training/validation loss is pictured below
Below are some samples of the groundtruth in red and our nets output in green. The right 2 images are clearly failures. The furthest right chin keypoints are detected incorrectly, possibly because it expected the subject to be slightly more centered or it interpreted some shadows as the chin. The second furthest right fails to find the eyes correctly possibly because it was not good at identifying faces at an angle.
Finally, below we have a visualization of each convolutional layer's learned filters after training. For multi-dimensional filter, it is averaged along the input channels to reduce the number that we see.

Training With a Larger Dataset

Below is the model architecture, I used the ResNet18 with the required modifications.
To train it, I paid for a GPU instance with lambda labs here. Since I was short on time and HATE using Colab for deep learning. I trained for 25 epochs using the pretrained model and the training/validation loss is pictured below.
3 prediction samples from the test set are pictured below. All indicate successful keypoint detection
3 prediction samples from my own personal collection. All 3 examples showed poor predictions most likely because the faces were not centered/cropped.
My final Kaggle absolute error was 33.503