PROJECT 4:
Facial Keypoint Detection
Part 1. Nose Tip Detection
For this part, I used images and keypoints from the IMM Face Database, resizing inputs to
the recommended 80x60 after converting them to grayscale images in the -0.5 to 0.5 range.
These images were loaded into two separate custom PyTorch Datasets for male and female
faces, respectively. A random sampling of images from each dataset can be seen below, with
their associated training labels (nose tip keypoints, in this case) shown in light blue.
Randomly selected female faces & associated nose tip landmarks:
Randomly selected male faces & associated nose tip landmarks:
The male faces dataset was loaded into a PyTorch Dataloader for training, using a batch
size of 1 alongside an 80/20 train-test split. After conducting initial trials with LeNet, fine-
tuning efforts ultimately yielded the neural network architecture defined below, with max-
pooling layers and ReLU activation functions following each fully-connected layer.
25 epochs of training were conducted, with Adam optimization and a learning rate of 1e-3.
Training and validation MSE loss recorded across these epochs are shown below.
The model was validated on a subset of male faces during each epoch, with strong
performance across the board. Success and failure cases among these are shown below.
Successful predictions in the male face dataset:
Failed predictions in the male face dataset:
In the spirit of transfer learning, I also tested my final trained model on unseen data in the
form of female faces. Considering that it had never seen female faces throughout the
training process, the model managed reasonable performance. Success and failure cases
among these examples are shown below.
Reasonably successful predictions in the unseen female face dataset:
Failed predictions in the female face datset:
It’s difficult to say exactly why the model struggles with certain faces more than others.
The most likely explanation would be some combination of unfamiliar facial structure and
rotated position, both of which can throw off the model in inference if not handled during
training. To address this moving forward, I incorporated randomized image
transformations to help diversify future training inputs.
Part 2. Full Facial Keypoints Detections
I once again used inputs from the IMM Face Database here, resizing images to the
recommended 160x120 after converting them to grayscale within the -0.5 to 0.5 range.
These images were then augmented randomly, with a combination of rotations within a 30
degree range as well as brightness and saturation adjustments applied to diversify inputs.
Separate PyTorch Datasets were again created for male and female faces, respectively.
Some images selected from each dataset can been seen below along with the full set of
associated training labels (facial keypoints, in this case) shown in light blue.
Randomly selected female faces & associated facial landmarks:
Randomly selected male faces & associated facial landmarks:
A PyTorch Dataloader with all male faces and associated labels was again used for training,
with a batch size of 1 and an 80/20 train-test split. I built upon the architecture from the
nose tip detection task to maximize performance, yielding the neural network architecture
defined below. Like before, max-pooling layers and ReLU activation are both also present.
25 epochs of training were conducted, with Adam optimization and a learning rate of 1e-3.
Training and validation MSE loss recorded across these epochs are shown below.
Performance is reasonably strong across the male dataset once again, particularly
considering the complexity of the task. Performance challenges can likely be attributed to
unfamiliar facial features, which can be addressed by a deeper network capable of learning
additional complexities. More training epochs would also ensure the model truly converges.
Successful predictions in the male face dataset:
Failed predictions in the male face dataset:
The learned filters for each convolutional layer can be visualized sequentially below.
Learned filters for the first convolutional layer:
Learned filters for the second convolutional layer:
Learned filters for the third convolutional layer:
Learned filters for the fourth convolutional layer:
Learned filters for the fifth convolutional layer:
Part 3. Train With Larger Dataset
For this part, I used inputs from ibug’s Faces In The Wild dataset, resizing cropped faces to
the recommended 224x224 after converting them to grayscale images in the -0.5 to 0.5
range. Data augmentation was also applied, with the same combination of random
rotations within a range and brightness & saturation adjustments used to diversify inputs.
These images were loaded into a single custom PyTorch Dataset for training. Select images
can be seen below, with their associated labels (full facial keypoints) shown in light blue.
Randomly selected faces & associated facial landmarks:
This augmented dataset was loaded into a PyTorch Dataloader, with a batch size of 1 but
no train-test split since a separate test set was already provided. A pre-trained ResNet18
model was fine-tuned for this task, with default values adjusted as shown below to both
accommodate the input data properly as well as optimize performance.
25 epochs of training were conducted, with Adam optimization and a learning rate of 1e-4.
Mean squared error loss as recorded across these training epochs is shown below.
Per the Kaggle results below, my model’s predictions achieved an MSE loss of 7.48595
across the test set not quite good enough for extra credit, but pretty strong nonetheless!
These strong performance metrics are backed up by the results for randomly selected
inputs from the test set shown below, all of which yielded fairly accurate predictions.
Predicting keypoints for portraits I used in the previous project yielded more strong results,
with generally accurate predictions across all three images. The model struggled slightly
with Rihanna, unable to accurately outline her cheek area as accurately as other inputs.
Similar issues were encountered with Stephen Curry, though to a lesser extent.