Project 5:
Facial Keypoint Detection with Neural Networks


Author: Isaac Bae
Class: CS 194-26 (UC Berkeley)
Date: 11/16/21


Part 1: Nose Tip Detection


In this part, I had to build a fairly simple convolutional neural network (CNN) to detect nose tips for a database of faces.

First, I had to confirm that my data was being loaded correctly. Here are some samples from the dataloader that I created.


p1_samples

Now, I was ready to train my simple CNN with the data! Here are the plots for the training and validation loss.


p1_orig_train
p1_orig_test

To show how hyper parameters affect the results, I decided to change the learning rate from the original 1e-3 to 1e-2 and 1e-4. Here are the results.


p1_lr_train
p1_lr_test

As you can see, when the learning rate is increased to 1e-2, the curve drops and flattens out very quickly. In contrast, the higher learning rate of 1e-4 has a much slower drop until it begins to flatten out to similar values as those of 1e-3 and 1e-4.

I also decided to remove the last layer of the CNN and see how that affects the results.


p1_conv_train
p1_conv_test

It seems that removing a layer got the loss to drop quicker than with the extra layer. This was interesting because I expected a worse performance in some way (e.g. curve flattening out to a higher loss.)

Now, let's see some actual results! Here are two successes and two failures.


Successes

p1_succ1
p1_succ2

Failures

p1_fail1
p1_fail2

It is pretty clear to see that when a person's nose is generally in the middle area of the image, the point is almost spot-on. However, when a person's face is oriented to the side, the point seems to move only slightly towards the nose from the middle. I believe this is due to many different factors such as not having enough epochs, using a suboptimal CNN, etc.


Part 2: Full Facial Keypoints Detection


For this part, I have to deal with not just the nose tip, but all facial landmarks for the same database of faces.

Here are some samples of my dataloader for this part.


p2_samples

This is the CNN architecture that I used:

  1. conv1 (in_channels=1, out_channels=16, kernel_size=7, stride=3)
  2. relu
  3. max_pool (kernel_size=5, stride=3)
  4. conv2 (in_channels=16, out_channels=16, kernel_size=3, padding=1)
  5. conv2 (in_channels=16, out_channels=16, kernel_size=3, padding=1)
  6. relu
  7. conv3 (in_channels=16, out_channels=16, kernel_size=3, padding=1)
  8. relu
  9. conv4 (in_channels=16, out_channels=32, kernel_size=3, padding=1)
  10. relu
  11. conv5 (in_channels=32, out_channels=32, kernel_size=3, padding=1)
  12. relu
  13. max_pool (kernel_size=7, stride=3)
  14. fc1 (in_features=256, out_features=128)
  15. relu
  16. fc2 (in_features=128, out_features=116)
Hopefully this is intuitive enough, but just in case, the layers that are seen here are convolutional, ReLU, max pooling, and fully connected layers. My idea was to aggressively downsample at the beginning, increase the channels by a factor of 2 every 2 convolutional layers, and pooling at the end. This was inspired by standard CNNs such as VGG and GoogleNet.

Also, I used a learning rate of 9e-4, 75 epochs, and a batch size of 4. Here are the results.


p2_orig_train
p2_orig_test

Here are the visualizations of the learned filters of the first convolutional layer.


p2_filter0
p2_filter1
p2_filter2
p2_filter3
p2_filter4
p2_filter5
p2_filter6
p2_filter7
p2_filter8
p2_filter9
p2_filter10
p2_filter11
p2_filter12
p2_filter13
p2_filter14
p2_filter15

Here are two successes and two failures.


Successes

p2_succ1
p2_succ2

Failures

p2_fail1
p2_fail2

Judging from my successes and failures, it seems that my CNN is trying to find the best area in the image where the face will be, and minimize the loss for the orientation of the face (though I may be wrong).


Part 3: Train With Larger Dataset


The standard CNN that I decided to go with was ResNet-18. I used it with pretrained weights, and modified the first and last layers to accomodate for grayscale images and the number of facial landmarks. Specifically, I changed the in_channels to 1 in the first convolutional layer, and changed the out_features to 136 for the last fully connected layer. I also used a trick where I modified each filter of the first convolutional layer to be a 1 x K x K rather than a 3 x K x K. Here are the results.

For reference, I decided to use a learning rate of 1e-4 for 25 epochs (batch size of 32) for "Run 1", but as I saw the learning rate going lower and lower, I decided to lower the learning rate (as I saw some stagnation despite the decrease) and try to go for another 10 epochs for "Run 2".


p3_p1_train
p3_p1_test
p3_p2_train
p3_p2_test

In hindsight, yellow was not necessarily the best color choice for my following test image results, but here they are anyways!


p3_test1
p3_test2
p3_test3

Also, here are some results on my own images!


p3_pic0
p3_pic1
p3_pic2