CS 194-26: Project 5

Facial Keypoint Detection with Neural Networks

Emily Ma

Summary, What I Learned: In this project, I used convolutional neural networks to predict keypoints on faces from images. I went from predicting just the nose tip point to a larger dataset. I try these results on some of my images as well. I learned how to construct cnns as well as how to tune hyperparameters after analyzing the training and validation loss over time.


Part 1: Nose Tip Detection

To train a neural net to predict the entire set of facial keypoints, I first started with training a network to predict just the nose tip. I used a custom dataloader to resize my images in grayscale and normalize all of my pixel values and keypoint coordinates. Below are some examples of the ground-truth points sampled from the dataloader.


After setting up my dataloader, I proceeded to set up my cnn for 1 input channel and batch size of 16. Here is a picture of the setup for my cnn with a learning rate of 0.001. I also show three good results as well as three bad results. The reason why some results were not as good is probably because the cnn overfit to the face looking straight forward at the camera since most of the images in the dataset are neutral faces. This is why photos with different lighting and different orientation of the head had worse results overall.


I used MSE loss to keep track of how the cnn was doing at predicting the nose tip keypoint for the training set and validation set for each epoch (total 25 epochs). Here is the graph of training loss and validation loss.

I also attempted to change the hyperparameters and see how it would change my model. For the first loss graph below, I changed the cnn so that there were conly 3 convolutional layers instead of 4. It seems like this actually imporved the fit of the bit as the losses on the graph are overall lower. For the second graph below, I changed the learning rate to 0.1. This seemed like a bad move since the losses are huge now since the weights are adjusting too much each time.

Part 2: Full Facial Keypoints Detection

To train a neural net to predict the entire set of facial keypoints, I first started with training a network to predict just the nose tip. I used a custom dataloader to resize my images in grayscale and normalize all of my pixel values and keypoint coordinates. Below are some examples of the ground-truth points sampled from the dataloader.


After setting up my dataloader, I proceeded to set up my cnn for 1 input channel and batch size of 16. Here is a picture of the setup for my cnn with a learning rate of 0.001. I also show three good results as well as three bad results. The reason why some results were not as good is probably because the cnn overfit to the face looking straight forward at the camera since most of the images in the dataset are neutral faces. This is why photos with different lighting and different orientation of the head had worse results overall.


I used MSE loss to keep track of how the cnn was doing at predicting the nose tip keypoint for the training set and validation set for each epoch (total 25 epochs). Here is the graph of training loss and validation loss as well as the learned filter visualized.

Part 3: Train With Larger Dataset

To work with a larger dataset, I used the ibug dataset with 6666 images. For each image, I had to crop to the bounding box and gave a margin of 20% since a lot of landmarks went past the bounding box. I resized all of my images to 224x224 in grayscale. I loaded the images and landmarks in the same was as Part 2 and also did data augmentation through rotation and color jitter. Here are some sample images from the dataloader below.



I used MSE loss to keep track of how the cnn was doing at predicting the nose tip keypoint for the training set and validation set for each epoch (total 40 epochs). I used a batch size of 128 and a learning rate of 0.001 as well as the same optimizer as Part 1. For the cnn, I used resnet18 as recommended with the following architecture of the layers:

ResNet( (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (1): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer2): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer3): Sequential( (0): BasicBlock( (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer4): Sequential( (0): BasicBlock( (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (fc): Linear(in_features=512, out_features=136, bias=True) )

The results did decently overall and was able to classify most images well. The results above are the predictions on the test set. I did notice that the cnn did not do as well with images with busy backgrounds and children since those faces were usually smaller and wider than adult faces which were the majority of the dataset. There was also some trouble with faces that had different expressions and glasses and faces obscured. Here is the graph of training loss and validation loss.

When submitting my predictions to Kaggle, I got a MAE of 32.18074. I submitted my predictions under the username Emily Ma. I also used the model to predict the keypoints of my own images. Overall, the cnn was able to get the overall face shape, although was not very accurate to the outline of the face shape. The cnn seemed to do worse on the image where my friend's face is turned to the side rather than straight on.