Part 1: Nose Tip Detection

To train a model to identify nose keypoints from images, we first create a PyTorch Dataset that holds images and nose keypoints. Here are three examples of images/keypoints from our training set.

For training this model, I used 3 convolutional layers with 20, 16, and 12 neurons each. Then I used a fully connected layer with 120 neurons, finally projecting onto 2 output neurons (for the x and y position of the nose). I used the Adam optimizer and MSE loss. My training and validation loss per epoch is shown below.

Here are some of the validation set results with the ground truth (green) and predicted (red) keypoint for the nose. Images 1 and 2 were very successful, but 3 and 4 had significant error. This might be because of the lightning difference (images 3 and 4 had significantly less contrast between face and background), which might make the nose more difficult to detect for our model.

Part 2: Full Facial Keypoints Detection

For this section, I'll be moving onto creating a model to detect all 58 keypoints. Here are some example photos (with ground truth keypoints in green) from within the training set.

For training this model, I used 5 convolutional layers with 20, 16, 14, 13, and 12 neurons each. Then I used a fully connected layer with 120 neurons, finally projecting onto 2 * 58 output neurons. I used the Adam optimizer and MSE loss. My training and validation loss per epoch is shown below.

Here we have examples showing ground-truth (red) and predicted (green) keypoints. We predict well for the first two images, but struggle with the later two. It likely fails because of large head tilt, which the model doesn't handle as well as a straight-on headshot.

Now, we visualize some of the "learned filters" that our model has developed during training. Here they are for the first convolutional layer.

Part 3: Train With Larger Dataset

Now, we work with a much larger dataset with bounding boxes. Here are some of the example faces (along with ground-truth keypoints in green).

For training, I used the ResNet18 model, but replaced the first layer with a convolutional layer with a kernel size of 5 and 64 neurons. I also replaced the final fully-connected layer to output to 68 * 2 outputs to fit the x, y coordinates of keypoints. Below is the training and validation loss per epoch.

After training, I apply my model onto the test set. Here are two example images (with predicted keypoints in green). On Kaggle, my mean squared error is $22.02754$.

Now, we try this model on my faces. It seems to work decently well, although it's hard to tell the accuracy when it's so zoomed out. It seemed to get the second photo pretty right, while failing on the first and third one.