CS 194-026 Project 5: "Facial Keypoint Detection with Neural Networks"

Author: Joshua Fajardo

Project Overview

In this project, we use Neural Networks to predict key point/landmark locations by training on images that are already labeled. The networks we build here have convolutional layers and fully-connected layers. This project relies heavily on the use of PyTorch, which includes a series of libraries that makes building neural networks a lot easier!

"Part 1: Nose Tip Detection"

To get build up to the task of detecting facial keypoints, we first begin by detecting nose tips.


First, we need to load the data to feed it into our neural networks. Here's a few images that have been sampled from the dataloader, including the keypoints.


Convolutional Layers:

My CNN uses 3 convolutional layers and 2 fully connected layers. After each convolutional layer, there is a ReLU and a max pool. My first convolutional layer uses a 7x7 kernel (stride 2), the second uses a 5x5 kernel, and the last uses a 3x3 kernel. The number of channels starts at 1, goes to 12 after the first convolutional layer, 24 after the second, and stays at 24 after the third. The padding I use decreases from 2 to 1 to 0. All max pool layers are identical, using a 3x3 kernel and a stride of 2.

Fully Connected Layers:

My first fully connected layer has an input size of 48 and has 16 output neurons. The second layer has 16 input neurons and 2 output neurons. I wanted to keep the number of neurons small, scaling with the complexity of nose detection.

"Loss Function and Optimizer", "Hyperparameter Tuning"

Here's the losses for the network that I settled on.

Let's see how learning rate affects our network. Here, I've plotted the validation and training losses for models trained with learning rates of 1e-3 and 5e-3. Even though here, it looks like we get slightly better training and validation loss for the larger learning rate, I still ended up using a learning rate of 1e-3 (as recommended by the spec).

Here's how kernel size affects our network. Here, I've done the same type of plot, changing the first convolutional layer to use a 5x5 kernel with padding 1 and stride 2. As we can see, the 7x7 filter led to a very very slight increase in performance. I ended up sticking with the 7x7 filter.

Here's some sample predictions that I got from my chosen model. The model predicts the two images in the top left corner pretty well. This is likely because the man has a prominent nose and is front-facing. We can see in the two images in the bottom left corner that the man's face is at an angle from the camera, which seems to throw off our prediction. This may be because we have fewer angled images than front-facing images in our training set.

"Part 2: Full Facial Keypoints Detection"

"Sampled image from your dataloader visualized with ground-truth keypoints."

For the training data, I augment the data using random color jitter (brightness and saturation), rotations, and translations.

"Report the detailed architecture of your model. Include information on hyperparameters chosen for training and a plot showing both training and validation loss across iterations."

Convolutional Layers

A ReLU is applied after each convolutional layer.

Note: All pooling layers have a 3x3 kernel and stride 3.

Here are the layers in sequential order:

conv1: 1 input channel, 16 output channels, 7x7 kernel, stride = 2, padding = 2

conv2: 32 output channels, 5x5 kernel, stride = 1, padding = 2

conv3: 64 output channels, 5x5 kernel, stride = 1, padding = 2


conv4: 128 output channels, 5x5 kernel, stride = 1, padding = 1

conv5: 128 output channels, 3x3 kernel, stride = 1, padding = 1


Fully Connected Layers

A ReLU is applied after the first two FC layers.

fc1: 13824 input neurons, 2048 output neurons

fc2: 512 output neurons

fc3: 116 output neurons


I chose a learning rate of 5e-5, and chose to train the model for 50 epochs.

Here are the losses, plotted below. Though the initial drop in training loss looks very steep, I felt as though this wasn't a huge deal; the loss seems to steadily decrease after the first epoch.

"Show 2 facial images which the network detects the facial keypoints correctly, and 2 more images where it detects incorrectly. Explain why you think it fails in those cases."

In this example, the two best keypoint detections are in 1) the image in the third row from the top, third column from the left, and 2) the image in the bottom right corner. The network clearly fails in the first two images (in the top left corner). While there are plenty of reasons why the model could have failed on these images, I think that part of the reason may have been due to the man's facial hair and thick eyebrows.

"Visualize the learned filters."

Here are the learned filters from the first convolutional layer.

"Part 3: Train With Larger Dataset"


Let's look at some sample augmented data!

"Report the mean absolute error by uploading your predictions on the testing set to our class Kaggle competiton!"

Username: Joshua Fajardo

Score: 14.01975

"Report the detailed architecture of your model. Include information on hyperparameters chosen for training and a plot showing both training and validation loss across iterations."


I based my model off of Resnet18. In order to modify Resnet18 to fit our data (black and white 224x224 data) and objective (identify keypoints), I changed the number of input channels of the first convolutional layer from 3 to 1, and changed the number of output channels in the fully connected layer from 1000 to 136.


I used a learning rate of 1e-3 and trained the model for 10 epochs. For the training-validation split, I went for 80-20.


Since I found that a lot of the landmarks were outside of the bounding box, I decided to universally increase the size of the bounding box by a scale of 1.2 in each dimension. Additionally, I augmented the data by applying random changes in brightness and saturation, applying random rotations, and applying random translations.

Here's the training and validation losses. I could have trained the model for longer to get better results, but I ran out of time.

"Visualize some images with the keypoints prediction in the testing set."

First, let's look at some images up-close.

It's honestly quite shocking to me, how good the results are.

"Try running the trained model on no less than 3 photos from your collection. Which ones does it get right? Which ones does it fail on?"

Here's my network trying to detect keypoints from my face throughout the years. Arguable, it did a pretty solid job for the first image and the last. It looks like my glasses may have interfered with the network's performance around my eyes, however. The network did an adequate job at detecting the second image, but has some points missing the bottom of my face.


All quotes: