CS 194-26: Intro to Computer Vision and Computational Photography

Yukai Luo

Project5: Facial Keypoint Detection with Neural Networks



Overview

In this project, you will learn how to use neural networks to automatically detect facial keypoints.

Part 1: Nose Tip Detection

For the first part, we will use the IMM Face Database for training an initial toy model for nose tip detection. All images are annotated with 58 facial keypoints. We used all 6 images of the first 32 persons (index 1-32) as the training set (total 32 x 6 = 192 images) and the images of the remaining 8 persons (index 33-40) (8 * 6 = 48 images) as the validation set.

We first convert the image into grayscale and convert image pixel values in uint8 from 0 to 255, to normalized float values in range -0.5 to 0.5 (image.astype(np.float32) / 255 - 0.5). After that, we resize the image into smaller size 80x60. After we have the dataloader, we sample a few images and display them along with the nose keypoints as below:


08-3f.jpg
12-4f.jpg
22-2f.jpg
30-2f.jpg

Now we have the dataloader, we will define our CNN architechture. We used three convolutional layers, with 32, 24 and 16 channels each, all with a kernel size of 3x3. Each convolutional layer is followed by a ReLU, then a maxpool. Finally, there are two fully connected layers. We applied ReLu after the first fully connected layer but not after the last fully connected layer.

We used mean squared error loss as the prediction loss and Adam optimizer. During hyperparameter tuning, we tried out different combination of parameters to find out the best combinations. Here are some examples we tried during the hyperparameter tuning process:






lr=1e-3 epoch=10
lr=1e-3 epoch=25
lr=1e-2 epoch=10
lr=1e-2 epoch=25
lr=5e-2 epoch=10
lr=5e-2 epoch=25

After a lot of experimentation, we think that a learning rate of 1e-3 and 20 epoches seem to work the best:

lr=1e-3, epoch=20
predictions against ground truth (validation)

Notice that we have some photos that the CNN works very well, but some it does not. Here are some examples of both cases:


bad prediction: 23-4m
bad prediction: 31-4m
good prediction: 18-1m
good prediction: 30-2f

I suspect that lighting and face angles are two factors that caused bad prediciton. As we can see, the bad predictions have different(either heavier or lighter) lighting and face angles than the front facing faces. These difference may introduce variables that the CNN is not familiar with and such makes bad prediction. I think training the CNN on larger dataset may improve this issue.

Part 2: Full Facial Keypoints Detection

In this section we want to move forward and detect all 58 facial keypoints/landmarks. Dataloader for in this section is similar to part 1, but this time, we tried a larger input image size of 160x120. Since it is a small dataset, we will also need data augmentation to prevent the trained model from overfitting. We randomly rotated the face for -15 to 15 degrees, and randomly shifting the face for -10 to 10 pixels. Note that we also updated the keypoints so that they reflect the changes above.

Below are some examples of the image after data augmentation:


08-3f.jpg
12-4f.jpg
22-2f.jpg
30-2f.jpg

For the CNN, I added two more convolutional layers, and I put a maxpool layer after every two convolutional layers. I used a batch size of 8, 40 epoches and a learning rate of 5e-4. I still used MSE loss and Adam optimizer as the previous part. Below is an overview of my CNN architechture and the training and validation loss during the process.

CNN architechture
loss

Below are some results I have after plotting the prediction together with the ground truth on the validation data. I used blue dots as the ground truth and red dots as the predition made by the CNN. We can notice that similar to part 1, front facing faces seem to have more accurate predictions than faces that are looking away. This may be that the trainning data set does not have enough side looking faces to properly train the model to make accurate prediction. Training the model on a larger dataset may resolve this issue.


bad: 33-4m.jpg
bad: 37-4m.jpg
good: 35-1f.jpg
good: 38-6m.jpg

I also visualized the filters learned by the CNN throughout the training process. Since there are many layers, I only visualized layer1, 3, and 5 here for reference.

layer1
layer3
layer5

Part 3: Train With Larger Dataset

In this section, we built upon the code for part 2 in Google Colab in order to be able to train on larger datasets using the GPU. In order to get rid of excess background in the images, we cropped the images into 1.5 * the provided facial bounding boxes, before resizing to a square of size 224x224, then applied similar data augmentation in part 2. The facial landmarks are also transformed along the way. Below are a few example of the input images with ground-truth keypoints:

For this part, I used ResNet18 with some modifications: I changed the input channel number to 1 (since we are working on grayscale images), and the output channel number to 69 * 2 = 136, i.e., the (x, y) coordinates of the 68 landmarks for each face.

Below is a summary of the modified ResNet18:

For optimizer and loss function, I still use Adam with a learning rate of 1e-3 and MSE loss as part 2. I trained the model for 20 epoches on the training set, and plotted the training and validation error throughout the process as shown below:

Now we have a trained network. I ran the network on the test data (with batch size = 8) and visualized some of them as shown below. We can see that the convolutional neural network works pretty well on the test data.


I also used the model to predict the facial keypoints on some photos of me and my friend. As shown below, I think it did pretty well on all of the images.


Final Thoughts

CNN is really powerful! I wish I have learned more about it before taking this class.