CS 294-26: Intro to Computer Vision and Computational Photography, Fall 2022

Project 5: Facial Keypoint Detection with Neural Networks

Katherine Song (cs-194-26-acj)



Overview

In this project, we explore the use of convolutional neural networks to detect facial keypoints. We use PyTorch to load datasets and to design and train CNNs.

Part 1: Nose Tip Detection

First, relying heavily on the PyTorch tutorial in recitation and provided PyTorch tutorials, I first created a class for the the images and their nose keypoints. Within this class, I convert the image into grayscale, normalize pixel values to [-0.5, 0.5], and resize the image into a more manageable 80x60 size. I then split up the provided IMM Face Database into a training dataset, comprising the first 192 images, and a validation dataset, comprising the last 48 images, and used torch.utils.data.DataLoader to load the data. Below are some sampled images from the training dataset with ground-truth nose keypoints.
Sampled images from dataloader visualized with ground-truth keypoints
I next wrote my convolutional neural network (CNN). I used 3 convolutional layers: the first layer has an input of 1 and and output of 12, and the other 2 convolutional layers have input and output channel sizes of 12. Following recommendations from the VGG paper described in class, I fixed the kernel size of each layer to be 3x3. Each convolutional layer is followed by a ReLU and then a 2x2 maxpool. Using the facts that convolution with the 3x3 filter decreases the size of the image by 2 pixels on each side and that a 2x2 max pooling operation halves the size of the image, after 4 convolutional layers + maxpools, the input to the first fully connected layer was set to 12*8*5=480 (where 8x5 is the resulting size of the 80x60 image after the previous layers) with an output of 8 features. A ReLU was applied to this layer, and then a second fully connected layer was used to deliver the final 2 features (x and y coordinates of the nose keypoint). I used a batch size of 2 when loading my training dataset (and also made sure to set the shuffle parameter to True) and batch size of 64 when loading my validation dataset. I cut off training after 10 epoches because the model seemed to converge well before this point.

Varying hyperparameters

I tried varying a few hyperparameters -- specifically the learning rate and channel size -- to see how these affected results. When the learning rate is too low, the CNN learns too slowly. Below are results with a way too low learning rate of 1E-6 from epoch to epoch. The predicted nose keypoint stays in a corner and doesn't perceptibly move towards the correct solution.
Learning rate = 1E-6 (way too low)
On the other hand, when the learning rate is too high, the model is prone to converging on a sub-optimal solution. The results of a too-high learning rate of 0.1 from epoch to epoch are shown below - after the 4th epoch, the nose keypoint prediction barely moves, but it is inaccurate.
Learning rate = 0.1 (too high)
I also tried varying the channel sizes of the layers of the CNN. When I multiplied channel sizes (except the output of the last layer) in all layers by 10 just as an extreme example, I observed that the CNN performed pretty similarly to baseline (i.e. a pretty good solution for most images was reached within 10 epochs). An example of predictions epoch to epoch is below:
Channel sizes 10x
On the other extreme, making all the channel sizes in the convolutional layers = 1, I observed that the predicted nose keypoints were all very wrong -- the model converged to a point that was in roughly the same point on peoples' foreheads in pretty much every image. Below is an example of this epoch to epoch.
Convolutional layer channel sizes=1
From this very small experiment, it seems that channel size must be increased to be "large enough," but past some point, perhaps the gains are minimal.

Results of final model

Below is the curve of my final model with a training rate of 2E-4:
Training rate = 2E-4 (good)
After tuning, below are samples from the validation set of images where the CNN predicted nose keypoints well and 2 images where the CNN did not predict nose keypoints well. I think the fact that there weren't many images of people with volumnous hair in the training dataset contributed to the poor performance on subject 40. The predictions generally weren't great for that subject, though the example shown above is the worst, possibly because in addition to not having many volumnous-hair people, the training dataset doesn't have many people making similar expressions to the one of subject 40 with a wide gaping mouth and eyes rolled back. Similarly, bald people weren't represented in the training dataset, so the predictions for subject 33 from the validation dataset where the subject was looking off to the side were especially bad.
Good predictions
Poor predictions (40-6m and 33-3m)

Part 2: Full Facial Keypoints Detection

This part was conceptually similar to part 1. Below are sampled images from the training dataloader visualized with ground-truth keypoints in green:
Images with ground-truth keypoints
A few key modifications were that the CNN needed to be larger to accommodate the larger image size, and the training dataset needed to be augmented. I used a net with 5 convolutional layers and 2 fully connected layers. The layers of this part's model are below:

(conv1): Conv2d(1, 16, kernel_size=(3, 3), stride=(1, 1))
(conv2): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
(conv3): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
(conv4): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1))
(conv5): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1))
(fc1): Linear(in_features=3840, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=116, bias=True)

Similar to part 1, I kept the filter size for each convolutional layer at 3x3, but I did increase the number of output channels in each layer to 16, 32, 64, 128, and 1024. Each convolutional layer is followed by a leaky ReLU and a 2x2 maxpool, and the first fully connected layer is followed by a leaky ReLU. Getting rid of maxpools slowed the code down significantly without really giving noticeable improvements, but replacing all the ReLUs by leaky ReLUs seemed to help mitigate an initial issue that I was seeing where the predictions for every image appeared to be almost identical (and wrong). As for other hyperparameters, I chose a learning rate of 1E-3 and a batch size of 4. I trained the model for 12 epochs. For augmentation, I chose to use color jittering (varying brightness and saturation), followed by a random crop (essentially shifting the image), followed by a random rotation between -12 and 12 degrees, before finally rescaling the images to 240x180. I added a second batch of augmentations that squished the faces horizontally as well (I noticed that my model initially performed fairly poorly for narrow faces). I wrote my own classes for these, using the source code of torchvision.transforms as a base and adding a few lines to also update landmarks as needed.
Training and validation losses
Good predictions
Poor predictions
It turns out that some of the images with poor predictions from the first part also resulted in poor predictions in the second part. For some, the image augmentations I did helped. However, for others, such as the particular ones that failed above, the image augmentations did not help. None of the augmentations I did, for example, made people more or less bald or made their faces turn to the side more or less. Similarly, none of the augmentations added people with a lot of volumnous hair or changed their facial expressions. In the end, these faces were still ill-represented even in the augmented dataset. With each round of training, the model "learns" filters that correspond to characteristic features in the images. As an example, below are the 16 3x3 filters from the first convolutional layer:
Learned filters from 1st convolutional layer

Part 3: Train with Larger Dataset

For this part, I wrote my code on Google Colab with GPU to train it on a larger model. The dataloader for the part was largely the same as for Part 2, except images were first cropped to their given bounding boxes. Some of the bounding boxes lay beyond the image boundaries or did not include all the landmarks, so I did a few adjustments to ensure that the bounding boxes were valid and expanded them as necessary to include all the image keypoints. Images were then cropped to 224x224 pixels. I used color jitter and random rotation for data augmentation. Below are some examples of images from the dataloader with ground-truth keypoints:
Images with ground-truth keypoints
As suggested by the assignment description, I used ResNet18 for my CNN architecture. I used the default parameters for the model, except I changed the input channel to 1 and the output channel number of the last channel to be 68*2=136. For hyperparameters, I used a learning rate of 1E-3, batch size of 4, and 15 epochs. The detailed architecture is below: ResNet(
(conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
 (0): BasicBlock(
   (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (relu): ReLU(inplace=True)
   (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 )
 (1): BasicBlock(
   (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (relu): ReLU(inplace=True)
   (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 )
)
(layer2): Sequential(
 (0): BasicBlock(
   (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
   (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (relu): ReLU(inplace=True)
   (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (downsample): Sequential(
     (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
     (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   )
 )
 (1): BasicBlock(
   (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (relu): ReLU(inplace=True)
   (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 )
)
(layer3): Sequential(
 (0): BasicBlock(
   (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
   (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (relu): ReLU(inplace=True)
   (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (downsample): Sequential(
     (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
     (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   )
 )
 (1): BasicBlock(
   (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (relu): ReLU(inplace=True)
   (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 )
)
(layer4): Sequential(
 (0): BasicBlock(
   (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
   (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (relu): ReLU(inplace=True)
   (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (downsample): Sequential(
     (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
     (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   )
 )
 (1): BasicBlock(
   (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (relu): ReLU(inplace=True)
   (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 )
)
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): Linear(in_features=512, out_features=136, bias=True)
)

Below I plot the training and validation losses over 15 epochs. To be honest it's not clear to me why the validation error increased in the 13th epoch...I thought it might have been overfitting. I compared the results from after 10 and 15 epochs on the test set, though, and the results after 15 epochs was still better, so I stuck with that. The validation error continued to decrease after that as well, so it might have just been a "bad" epoch (from what I understand, ideally you would throw this epoch out and train from the previous, but I didn't do that here).

Training and validation losses over 15 epochs
Below are some images from validation with ground-truth and predicted landmarks:
Ground-truth (green) and predicted (red) landmarks
And below are some images from the test set with predicted landmarks:
I then ran the trained model on some images from my own collection:
The model's predictions for the 2nd and 3rd photos above look pretty good. The model is also particularly good on faces from the test set that are looking more or less straight, don't have too many accessories, and are of people who are middle-aged. They are less good for photos like the first photo in my personal collection. That is a picture of me as an infant (and is also a pretty dark picture), and I don't think infants were well-represented in the training dataset, so it makes sense that the model would be less accurate there.

Part 4: Pixelwise Classification

For this part, we turn the prediction regression problem into a pixelwise classification by turning each keypoint into a heatmap that represents a probability distribution. To generate heatmaps, I used a 2D Gaussian with kernel size=11 and sigma=5. A normalized Gaussian (with sum = 1) yielded pretty poor results, so I ended up scaling the Gaussian such that its peak value was 1. Theoretically, for each keypoint, I would want to convolve this filter with a 2D array that is 1 at the location of the keypoint and 0 everywhere else. However, doing this in the getitem() function made training extremely slow, so I simply used subindices to place the Gaussian on an empty heatmap centered on the keypoint coordinates. I also skipped data augmentation for this part to help speed up training. The accumulated heatmaps of all landmarks on a few images from the training dataloader are shown below (I used a linear red + alpha channel to visualize these, because I find the classic "rainbow" colormap often used for heatmaps to be less intuitive):
Accumulated heatmaps of all landmarks
For the CNN for this part, I used a pre-trained UNet from here. I left all parameters at their default except the number of output channels, which I changed to 68 (for both the "classifier" and "auxillary classifier") such that the model would output 68 heatmaps for each image. Since the model takes 3 channels as input, I modified my dataloader to take colored images. Below is the detailed architecture of the model:

  

UNet( (encoder1): Sequential( (enc1conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (enc1norm1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (enc1relu1): ReLU(inplace=True) (enc1conv2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (enc1norm2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (enc1relu2): ReLU(inplace=True) ) (pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (encoder2): Sequential( (enc2conv1): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (enc2norm1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (enc2relu1): ReLU(inplace=True) (enc2conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (enc2norm2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (enc2relu2): ReLU(inplace=True) ) (pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (encoder3): Sequential( (enc3conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (enc3norm1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (enc3relu1): ReLU(inplace=True) (enc3conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (enc3norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (enc3relu2): ReLU(inplace=True) ) (pool3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (encoder4): Sequential( (enc4conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (enc4norm1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (enc4relu1): ReLU(inplace=True) (enc4conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (enc4norm2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (enc4relu2): ReLU(inplace=True) ) (pool4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (bottleneck): Sequential( (bottleneckconv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bottlenecknorm1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (bottleneckrelu1): ReLU(inplace=True) (bottleneckconv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bottlenecknorm2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (bottleneckrelu2): ReLU(inplace=True) ) (upconv4): ConvTranspose2d(512, 256, kernel_size=(2, 2), stride=(2, 2)) (decoder4): Sequential( (dec4conv1): Conv2d(512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (dec4norm1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (dec4relu1): ReLU(inplace=True) (dec4conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (dec4norm2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (dec4relu2): ReLU(inplace=True) ) (upconv3): ConvTranspose2d(256, 128, kernel_size=(2, 2), stride=(2, 2)) (decoder3): Sequential( (dec3conv1): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (dec3norm1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (dec3relu1): ReLU(inplace=True) (dec3conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (dec3norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (dec3relu2): ReLU(inplace=True) ) (upconv2): ConvTranspose2d(128, 64, kernel_size=(2, 2), stride=(2, 2)) (decoder2): Sequential( (dec2conv1): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (dec2norm1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (dec2relu1): ReLU(inplace=True) (dec2conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (dec2norm2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (dec2relu2): ReLU(inplace=True) ) (upconv1): ConvTranspose2d(64, 32, kernel_size=(2, 2), stride=(2, 2)) (decoder1): Sequential( (dec1conv1): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (dec1norm1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (dec1relu1): ReLU(inplace=True) (dec1conv2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (dec1norm2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (dec1relu2): ReLU(inplace=True) ) (conv): Conv2d(32, 68, kernel_size=(1, 1), stride=(1, 1)) )

For hyperparameters, I used a learning rate of 1E-3, batch size of 4, and 10 epochs. I cut training off after 10 epochs mainly because it was taking so long to run (each epoch took about an hour). Below is a plot of training and validation loss across 10 iterations:

Training and validation loss using heatmaps
Predicted heatmaps are turned back into keypoints by taking a weighted average of each heatmap (i.e. multiplying the normalized heatmap by pixel locations in x and y). Here are a few images with the keypoints predictions in the testing set:
Predictions using heatmaps on test set images
And here are a few images of the trained model run on 3 photos from my own collection.
Predictions using heatmaps on my own images
The model does well on images that are well-lit with faces that are looking more or less at the camera, which is a lot of the images in the given dataset. It doesn't do too well on the images that are dark or have faces that are significantly rotated (especially seen on the left and right images above from my own collection). This might have been mitigated by the data augmentation techniques that were used in previous parts.

Part 5: Kaggle

I submitted the results from my Part 3 model to Kaggle. I didn't make any modifications to the architecture described previously, but I did let the model train for more epochs - 40 in total. The mean absolute error is 9.61931, which is not great, but to be honest I felt that I had struggled enough with this project, so I left it at that. :) My Kaggle username is kws (I show up in the Leaderboard as Katherine).

Bells and Whistles

I tried replacing the Gaussian in part 4 with 1 and 0 masks (I actually did this before using Gaussians because the 1/0 masks resulted in faster training so faster debugging). Without much extra optimization, the results are way worse. The model kind of stopped learning after 2 iterations but still had a lot of error:
Training and validation loss using 1/0 masking
Below are results on a few of my images. You can see that the facial keypoints are starting to take shape, but the predictions are still pretty bad overall:
Bad predictions using 1/0 masking
Looking closer at the generated heatmaps, this technique would often times result in heatmaps that had more than 1 local maxima, so I think when taking the weighted average, the resulting keypoint ended up being some nonsensical point in the middle of the maxima:
Example heatmap using 1/0 masking