Automatic Keypoint Detection
Nose Keypoint Detection
Below is a sampled groundtruth image for our CNN which tries to find the nose keypoint of a given face.
For hyperparameter tuning, I tested learning rates of .001 and .01 and saw very little difference in training
effect as both models were able to learn the dataset efficiently. The model with an LR of .001 seemed to be a bit
more stable as the training loss typically reduced more regularly, compared to the model with an LR of .01 which
bounced up/down slightly in training loss most likely due to it overshooting the loss minimum. I also tested
changing the size of my first convolution layers kernel to be 3 or 5. Changing this had little effect on the
validation loss, but the final training loss was very slightly lower for the net with kernel size 5. Below is a
comparison of their 3 training/validation loss graphs.
Below are some samples of the groundtruth in red and our nets output in green. The right 2 images are clearly
failures. I believe the 3rd image failed because the nose was not very clearly pronounced so the net defaulted to
guessing the middle of the image. And I believe the right image failed because it confused the corner of the
subject's eye with its nose.
Full Facial Keypoint Dataset
Below is a sampled groundtruth image for our CNN which tries to find all keypoints of a given face.
My model's architecture is summarized below. I used an Adam optimizer with LR .001
The training/validation loss is pictured below
Below are some samples of the groundtruth in red and our nets output in green. The right 2 images are clearly
failures. The furthest right chin keypoints are detected incorrectly, possibly because it expected the subject to
be slightly more centered or it interpreted some shadows as the chin. The second furthest right fails to find the
eyes correctly possibly because it was not good at identifying faces at an angle.
Finally, below we have a visualization of each convolutional layer's learned filters after training. For
multi-dimensional filter, it is averaged along the input channels to reduce the number that we see.
Training With a Larger Dataset
Below is the model architecture, I used the ResNet18 with the required modifications.
To train it, I paid for a GPU instance with lambda labs
here. Since I was short on time and HATE using
Colab for deep learning. I trained for 25 epochs using the pretrained model and the
training/validation loss is pictured below.
3 prediction samples from the test set are pictured below. All indicate successful keypoint detection
3 prediction samples from my own personal collection. All 3 examples showed poor predictions most likely because the faces were not centered/cropped.
My final Kaggle absolute error was 33.503