Fall 2020

CS194 Project 4: Facial Keypoint Detection with Neural Networks

April Sin

Overview

In the last project - Project 3: Face Morphing, we had to manually click the keypoints to identify a face. To step up our game, we are making this project that automatically detect the face for us! How we acheive this is by using convolutional neural networks.

Part 1: Nose Tip Detection

Dataset and Dataloader

Before detecting the whole face, we will try detecting just the nose. We will be using the IMM Face Dataset.
All the images are resized to 240 x 320 and no data augmentations are done.

For the dataloader, I used batch_size = 64.

Here are some sampled images and the truth nose keypoint:

Sample 1
Sample 2
Sample 3

CNN Model

I used a neural network with 5 convolution layers, each followed by ReLU and max pooling. They are then followed by two fully connected layres. I used kernel_size = (3, 3) for all the layers that needed one.

Here are all the conv layers:

  1. C1 = Conv2d(1, 24, 3)
  2. C2 = Conv2d(24, 30, 3
  3. C3 = Conv2d(30, 20, 3)
  4. FC1 = Linear(20 * 7 * 10, 128)
  5. FC2 = Linear(128, 2 * 58

To train my model, I used MSELoss and the Adam optimizer with learning_rate=3e-4 and with num_epochs = 20.

Here are the results:

Success Cases
Success 1
Success 2
Success 3
Failure Cases
Fail 1
Fail 2
Fail 3

Some reasons why these images failed could be because of the nose is generally around the center of an image. The model is not good at detecting the nose when it is not the case -- when the person's head is rotated, and especially for the case when the person is not standing at the middle and his head is very rotated.

Training and Validation Accuracy

Part 1 Accuracy

Part 2: Full Facial Keypoints Detection

Dataset and Dataloader

Now we have succeeded in detecting one point, we can do that for more keypoints - 58 keypoints to be exact. Again, we are using the IMM Face Dataset.
All the images are resized to 240 x 320. For training, data augmentation is done by randomly rotating the images (at most 12 degrees).

For the dataloader, I used batch_size = 64.

Here are some sampled images and the ground-truth face keypoint:

Sample 1
Sample 2
Sample 3

CNN Model

I used a neural network with 3 convolution layers, each followed by ReLU and max pooling. I used kernel_size = (3, 3) for all the layers that needed one.

  1. C1 = Conv2d(1, 18, 3)
  2. C2 = Conv2d(18, 24, 3
  3. C3 = Conv2d(24, 30, 3)
  4. C4 = Conv2d(30, 30, 3)
  5. C5 = Conv2d(30, 25, 3)
  6. FC1 = Linear(25 * 21 * 30, 128)
  7. FC2 = Linear(128, 2 * 58

Similarly to Part 1, I trained my model using MSELoss and the Adam optimizer with learning_rate = 15e-5 and with num_epochs = 50.

Here are the results:

Success Cases
Success 1
Success 2
Success 3
Failure Cases
Fail 1
Fail 2
Fail 3

Similarly to part 1, the model is not good at detecting rotations of the head. It also fails to detect as accuratyly when the person's head it not at the very center of the input image. This problen is improved when I lowered the learning rate and increased the number of epochs. Maybe it can be further improved with more data augmentation, especially shifting.

Training and Validation Accuracy

Part 2 Accuracy Plot

Learned Filters Visualized

Here are the learned filters from the first two layers of my trained CNN model.

first conv layer filters
first conv layer filters
second conv layer filters
second conv layer filters

Part 3: Training with a Larger Dataset

Dataset, Dataloader, and CNN Model

In this part we will be using a way bigger dataset! (It has 6666 training images and 1008 test images.)

It would take forever to run if we were to train our model with our laptops. Luckily, we have access to GPUs such as Google Colab. With batch_size = 512, it took about an hour to run!

For the CNN, I used ResNet18. And to train, I again used MSELoss and the Adam optimizer with learning_rate = 3e-4. But this time I have num_epochs = 100.

Here are the results:

Success 1
Success 2
Success 3
Success 3
Success 3
Success 3

My Collection :)

My Collection 1
My Collection 2
My Collection 3
My Collection 3

Bells and Whisles

Using anti-aliased max pool antialiased_cnns.BlurPool from the work of Richard Zhang (Github) in replace of torch.nn.MaxPool2d, I was able to produce better results for part 2. As we can see from the graph, the loss drop quicker than before.

Part 2 Accuracy Plot
Accuracy from before, without anti-aliased cnn.

I also used the trained model from part 3 to find keypoitns of my face from various years, and created a morphing video out of it!
Here it is: https://youtu.be/0nH9xbJT-m0