CS194-26 Project 5: Facial Keypoint Detection with Neural Networks

Rishi Upadhyay, rishi.upadhyay@berkeley.edu, 3033975663

In this project, we used neural networks to create various tools to detect facial keypoints. In part 1, we use a simple CNN network to detect nose locations before extending this to detecting 58 facial keypoints in part 2. In part 3, we use a significantly larger dataset to train a larger CNN network to detect 68 facial keypoints.

Part 1: Nose Tip Detection

In this part, we used a simple CNN to detect nose tip locations from the IMM Face Database. Here are two examples of input images along with the ground truth nose tip locations:



These images look low-resolution because they were downsampled to 80x60 before processing but were visualized at a higher resolution. The green dot is the nose-tip location. The neural network used for this task had 3 convolution layers followed by 3 linear layers. The training and validation losses during training looked like this: The network was trained for 30 epochs with a learning rate of 1e-3 and MSE Loss. I also used random flipping and shifting as data augmentation.



Here are some results from the model:

Successes:


Failures:


Although it is hard to know exactly why these failed, the left one likely failed because of the head tilt. The network predicted a much more central location than the actual one. Similarly, the network was too central for the one on the right although it performed better in that case. A way to tackle this would be to try to get more images that look like that.

Part 2: Full Facial Keypoint Detection

In this part, we expanded on part 1 to detect all 58 facial keypoints instead of just the nose. Here are two example images with ground truth locations:



Although these are visualized in color, we used grayscale images as our input. The neural network used for this part had 5 convolutional layers and 2 linear layers. The network layout was as follows:
--------------------------
Conv1 - 5x5 kernel, 1 input channel, 12 output channels, stride=1
--------------------------
MaxPool - 2x2 pooling
--------------------------
Conv2 - 3x3 kernel, 12 input channels, 20 output channels, stride=1
--------------------------
MaxPool - 2x2 pooling
--------------------------
Conv3 - 3x3 kernel, 20 input channels, 32 output channels, stride=1
--------------------------
MaxPool - 2x2 pooling
--------------------------
Conv4 - 3x3 kernel, 32 input channels, 64 output channels, stride=1
--------------------------
MaxPool - 2x2 pooling
--------------------------
Conv5 - 3x3 kernel, 64 input channels, 128 output channels, stride=1
--------------------------
Flatten
--------------------------
Linear - 384 inputs, 128 outputs
--------------------------
ReLU
--------------------------
Linear - 128 inputs, 116 outputs
--------------------------

This network was trained for 20 epochs with a learning rate of 1e-3 and a batch size of 4. The images were 160x120 when inputted to the network and were subjected to both random flipping and random shifting for data augmentation purposes. Here is a graph of the training and validation losses while training:


Here are some results visualized:

Successes:


Failures:


Similarly to the nose-tip detection, it seems likely that the right image was incorrect because of the head-tilt. However, the network also had a successful image with the tilt, which is confidence boosting. The other image is more confusing. It is straight-on, so it is likely the the network got it incorrect because of other facial features that confused it. More training could help these issues.
To further examine the network, we have visualized the CNN filters from the first convolutional layer:



These filters have significant variation amongst them, suggesting they will be good filters.

Part 3: Train with a Larger Dataset

In this part, we expanded on part 2 to train a neural network on a dataset with 6666 images up from 240 in the last dataset. To make this time-feasible, the code was all run remotely on Google Colab. The results were submitted to Kaggle, where I got a score of 32.36. The neural network is based almost entirely on Resnet18 with two modifications: the first layer accepts images with 1 channel as opposed to 3 and the output of the network is changed to 136 units. Everything else is kept the same. The network was trained with a batch size of 128 and a learning rate that start at 1e-3 and decayed by 0.1 every 5 epochs. It was trained for 10 epochs. Here is a plot of the training and validation losses during training:

Here are some examples of outputs from the network when run on the test set:



As we can see, the right-most picture has the worst results, likely because of the tilt in the image. The other two are much better results. As a test, these 3 images were also run through the network:


The model performed alright on all three but not as well as they did on the other dataset