CS 194-26 - Project 5 (Facial Keypoint Detection with Neural Networks) [Andrew Ke]

In this project, I use convolutional neural networks in pytorch to predict facial keypoints on images. CNNs are a subtype of neural network designed for classification/regression tasks on images. A typical CNN contains a combination of convolution, relu, and pooling layers, followed by fully connected layers at the end of the network.

Part 1 (Nose Tip Detection)

Using the IMM Face Database, I trained a CNN for nose tip detection.

The images were converted to grayscale, the pixel values were normalized to between -0.5 and 0.5, and the image resized to 80px x 60px.

Sample Training Data

Model Structure

Below is the design of the network I settled for. I have three convolutional layers, all with kernel size 3 and stride 1. Every convolution is followed by a Relu, and the last two convolutions have a MaxPool after the Relu. I did not add a MaxPool for the first conv in order to avoid scaling down the image dimensions too early.

Finally, I added two fully connected layers to the end. The final layer has output of size 2 (the x and y coordinates of the nose)

  (conv1): Conv2d(1, 24, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(24, 32, kernel_size=(3, 3), stride=(1, 1))
  (MaxPool2d kernel_size=2)
  (conv3): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))
  (MaxPool2d kernel_size=2)
  (fc1): Linear(in_features=7488, out_features=256, bias=True)
  (fc2): Linear(in_features=256, out_features=2, bias=True)


I trained my network to minimize MSELoss using the Adam optimizer with learning rate 0.001 over 25 epochs using a mini-batch size of 15. The train/validation loss graph is shown below (calculated at the end of each epoch using nn.MSELoss(reduction='sum')).

Correct Examples
(Blue is Ground Truth Label, Red is Prediction)
Incorrect Examples

I think my network fails in these cases because there are few females and bald individuals in the training dataset. Overall, the dataset is very small and not very diverse, so it is difficult for the model to extrapolate properly.

Hyperparameter Variations

LR = 0.01 (too high)
50% reduced number of channels
  (conv1): Conv2d(1, 12, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(12, 16, kernel_size=(3, 3), stride=(1, 1))
  (conv3): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1))

Part 2 (Nose Tip Detection)

Data Preparation

Using the same Dane dataset, we now attempt to predict all 58 keypoints. I used the same data loader as part 1, but scaled the images to 240x180. To prevent the model from overfitting on a small dataset, I added rotation, shift, color jitter, and horizontal flip augmentations. I had to make sure to update the position of the keypoints so they were consistent with the augmentations applied to the image.

Model Structure

I used a similar structure to Part 1, but now increasing the number of convolutional layers to 5. I increased the number of channels to 64 and then decreased them back down to 32.

Every Conv2d is followed by a ReLu, along with a MaxPool2D(kernel_size=2) with the exception of convs[0] which only has a ReLu.

  (convs): ModuleList(
    (0): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1)) # +ReLu
    (1): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1)) # +ReLu/MaxPool
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1)) # +ReLu/MaxPool
    (3): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1)) # +ReLu/MaxPool
    (4): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1)) # +ReLu/MaxPool
  (fc1): Linear(in_features=3744, out_features=256, bias=True)
  (fc2): Linear(in_features=256, out_features=116, bias=True)


I trained my network to minimize MSELoss using the Adam optimizer with learning rate 0.001 over 500 epochs using mini batch size 32. The train + validation (labeled as test) loss graph is shown below (calculated every 4 epochs). The loss is summed over all the keypoints using nn.MSELoss(reduction='sum'), which is why it starts so high.

Correct Examples
(Blue is Ground Truth Label, Red is Prediction)
Incorrect Examples

I think the model has a hard time with these two cases because both individuals have long hair, which is uncommon in the dataset.

Here is a visualization of the filters in the first conv layer. Since they are only 3x3, there is not much interesting about them.

Part 3: Train With Larger Dataset

In this part, I trained a ResNet18 model to predict keypoints on images from the ibug dataset.

Data Preparation

I used the same augmentations as part 2, but without flips since the dataset already has flipped images. I also only randomly picked one of rotate and shift augmentations at a time to make sure the points don't get moved further off the sides of the image.

I did not modify the points outside the bbox, since our training set needs to be representative of the test set. In the test set, our model is expected to predict points outside of the bbox, and we are also unable to resize the bbox in the test set since we don't know the ground truths.

Because the size of the images in ibug are large, loading the images from disk to memory each epoch takes a long time and slows the training process down. To resolve this, I cached the resized/cropped images so the image loads are fast after the initial access.

Model Structure

I used a pretrained ResNet18 model as a starting point.

For the first conv layer, I changed the number of input channels to 1 in order to accept greyscale images. To keep the filters from the pretrained model, I summed the weights for the old first layer along the input channels axis (torch.Size([64, 3, 7, 7]) -> torch.Size([64, 1, 7, 7])), and put them in the new first layer. This is important since if we set the first layer to random weights, then the rest of the pre trained layers become useless since they are dependent on specific learned filters from the first layer. I also modified the final layer to output 136 values, which is the number of x,y keypoints.

The structure of my model is otherwise identical to ResNet18 (full setup below).

net = models.resnet18(pretrained=True)

state = net.conv1.state_dict()
state['weight'] = state['weight'].sum(dim=1, keepdim=True)
net.conv1 = nn.Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)

net.fc = nn.Linear(in_features=512, out_features=136, bias=True)


I trained my network to minimize MSELoss using the Adam optimizer over 30 epochs. I started my LR at 0.0001 and decreased it to 0.00005 after 25 epochs. I used a mini batch size of 8.

Thanks to the image cache described earlier, each epoch took only 30 seconds to train (after the initial epoch that populates the cache).

The train + validation (labeled as test) loss graph is shown below.

The test loss starts out smaller than the train loss because the training loss was calculated on a running basis each minibatch, while the test loss was only calculated at the end of the epoch. I calculated the loss using reduction=mean.


My model achieved a MAE score of 8.03412. My kaggle profile is andrewke2

Example Images

Examples from the test set

Even when the face is occluded, or has an unusual expression (tongue sticking out), the model still performs very well.

My own images
Logan Couture

The results are good on my own face, since the photo is pretty standard. On Logan Couture, they are also quite good, even with the hockey visor/equipment in the way.

On Woody, it gets the nose and mouth right, but doesn't get the eyes and face shape since they are unrealisticlly sized.

Morphing Large Datasets (Bells and Whistles)

We can apply the CNN model from part 3 to Project 3 to automatically morph together large datasets, without having to pick the points by hand.

This part is similar to what I did for Project 3, but now using my own keypoint model instead of dllib's.

The dataset I worked with was headshots of Major League Baseball players, and more specifically SF Giants players. I scraped the images of all players on each team's active roster from the MLB website. In total my dataset had 839 photos. I used dlib to predict the bounding box of the faces, which is the same as what the creators of the ibug dataset did.

Morphing Movie

Using these keypoints, I created a morph movie that blends together all players on the Giants active roster. The transition between no smile and smile at 0:11 in the video is my favorite part.

Mean Faces

I also calculated the mean image of the Giants, and checked how it compared to other's teams averages, like the LA Dodgers, and the Houston Astros (who have the most number of international players on their roster of any team).

Finally, I morphed the face of all 839 MLB players to the mean shape to find the face of the average baseball player.