CS 194-26 Project 5

Facial Keypoint Detection

Angela Liu

Overview

In this project, I trained multiple neural networks to identify keypoints on face images.

Part 1: Nose Tip Detection




Dataloader

Here is a sample of four images and labeled keypoints that are loaded from my dataloader.
Points
Average Triangulation
Points
Average Triangulation


Network architecture

The architecture I used for nose detection consisted of the following layers:

Performance

Using the default architecture described above with a learning rate of 1e-3, I noticed decent performance. The loss graph and selected results are shown below.
Points
Good example
Points
Average Triangulation
Points
Average Triangulation
Bad example
Points
Average Triangulation
Points
Average Triangulation
The green dot refers to the true point while the red dot is the predicted. (Sorry, didn't realize they would be this small but if you zoom in, they should all be visible.)

Reflecting on the above examples, it seems like the model performed very well on the first face. One reason for this might be that the first face looks a lot like the "average dane" face that I created a couple projects ago. Since the face is very normal-looking in shape and expression, the model did a much better job of learning it. The second face on the other hand has a longer chin and a much more expressive mouth which is likely why the model got easily confused.


Different Learning Rates

The first hyperparameter I tried was using different learning rates. The results for 1e-2, 1e-3, and 1e-4 are shown below.
Loss Graphs
Points
Loss graph for 1e-2
Average Triangulation
Loss graph for 1e-3
Points
Loss graph for 1e-4
In the graphs, we can see that the higher learning rate (1e-2) had the worst validation loss at well over 0.005 whereas the learning rate of 1e-3 was able to reach around 0.002 at the end and the learning rate 0f 1e-4 had a final validation loss of 0.0009. (This final value is not obvious in the graph due to the poor axis labels but I printed out the loss at each step and that's what it ended with). So according to the loss, it appears that the lower learning ate performed the best.
Good example
Points
Detection for 1e-2
Average Triangulation
Detection for 1e-3
Points
Detection for 1e-4
In these good examples, the 1e-2 learning rate performed a bit poorly but both the two lower learning rates performed very well.
Bad example
Points
Detection for 1e-2
Average Triangulation
Detection for 1e-3
Points
Detection for 1e-4
For the bad examples, the difference in learning rate is very obvious, with 1e-2 being pretty far off while each decrease in learning rate brought the prediction point a bit closer to the true label.


Different Filter Sizes

The second hyperparameter I tried was using different filter sizes for each conv layer. The results for [(10,10), (7,7), (5,5)], [(7,7), (5,5), (3,3)] - default, and [(5,5), (3,3), (2,2)] are shown below.
Loss Graphs
Points
Loss graph for (5,5)
Average Triangulation
Loss graph for (7,7)
Points
Loss graph for (10,10)
Looking at the graphs, it seems like the model with the largest filter sizes performed relatively better since the smallest filter size had a very inconsistent validation loss that was not strictly increasing whereas the largest filter size was a lot more stable and increased lower to the lowest final loss.
Good example
Points
Detection for (5,5)
Average Triangulation
Detection for (7,7)
Points
Detection for (10,10)
In these good examples, the smallest filter size performed a bit poorly but both the two higher filter sizes performed very well.
Bad example
Points
Detection for (5,5)
Average Triangulation
Detection for (7,7)
Points
Detection for (10,10)
For the bad examples, it actually seems like the lowest filter size performed a bit better on this one since it's closer to the nose region; however, all of the models are still quite a bit off.

Part 2: Full Facial Keypoints Detection




Dataloader

Here is a sample of four images and labeled keypoints that are loaded from my dataloader with random rotation augemntation.
Points
Average Triangulation
Points
Average Triangulation


Network architecture

The architecture I used for nose detection consisted of the following layers: The motiviation behind this architecture was to keep it simple and try to build off of the architecture that had worked well for me with the nose detection. At first, I had tried doing more max pool layers between convs and also having an additional conv layer; however, I found that this network was too complicated and caused my model to predict the same types of faces. Therefore, by removing some more layers, my model was able to perform significantly better.

Convolution filters

Here are the 16 convolution filter weights from the first layer of my network. We can see that each filter has a slightly different weight magnitude at each square which represents the different feature types that each filter is learning.
Points
Performance

Using the default architecture described above with a learning rate of 1e-3, I noticed decent performance overall. The loss graph and selected results are shown below.
Points
Sorry, forgot to update the title of the graph but it should be face_1e3.

Good example
Points
Average Triangulation
Points
Bad example
Points
Average Triangulation
Points
The green dots refers to the true points while the red dots is the predicted.

Looking at the good vs bad keypoints, it appears that the model for the most part performed well when there was a clearly defined face with hair and facial features. The model performed particularly poorly on male 33; this was likely due to the fact that this person was bald (lacking hair which helped provide edge boundaries) and also his angled face is a lot more angled so these two things together made it very difficult for the model to find the appropriate points.

Part 3: Train With Larger Dataset




Dataloader

Here is a sample of four images and labeled keypoints that are loaded from my dataloader with random rotation augemntation.
Points
Average Triangulation
Points
Average Triangulation


Network architecture

The architecture I used for nose detection was a Resnet18 model with a modified first and final layer to take in a black/white image and output 58 detected points:

(conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
 (0): BasicBlock(
  (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 )
 (1): BasicBlock(
  (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 )
)
(layer2): Sequential(
 (0): BasicBlock(
  (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (downsample): Sequential(
   (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
   (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
 )
 (1): BasicBlock(
  (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 )
)
(layer3): Sequential(
 (0): BasicBlock(
  (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (downsample): Sequential(
   (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
   (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
 )
 (1): BasicBlock(
  (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 )
)
(layer4): Sequential(
 (0): BasicBlock(
  (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (downsample): Sequential(
   (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
   (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
 )
 (1): BasicBlock(
  (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 )
)
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): Linear(in_features=512, out_features=136, bias=True)

I printed out the model's named children and copied it above to show the names of each layer in the model. We can also see the dimensions of the model, where it is clear that I modified the input layer to take in a dimension-1 image (since my inputs are black and white) and also changed the output to have 136 features which represents a flattened 2-d array of 58 points.


Performance

Using the Resnet18 architecture described above with a learning rate of 1e-3, I scored a 12.09849 mean absolute error (https://www.kaggle.com/aliu917). The loss graph from training is shown below:
Points
Sorry, forgot to update the title of the graph but it should be train_1e3.

Looking at the graph, the loss decrease shape looks very good with both train and validation loss decreasing pretty in sync. It looks like the loss has mostly flattened out in the end but perhaps training for longer may have seen the loss graph go even closer towards 0 since it appears to still be on a slightly decreasing trend at the end of 25 epochs.

Good example
Points
Average Triangulation
Points
Bad example
Points
Average Triangulation
Points
Overall, my model performed very well on most faces even of different orientations and I'm quite impressed by the results that turned out. There were a few cases where the model seemed to be a bt off from the actual face which I pictured above. In each bad example case, the faces all exhibit a slightly unusual shape or angle (first one is very angled with hair, second one has a very long face, and third one is very light/pale colored with a very round face shape).

Part 3: Training on my own Data

Good example
Points
Average Triangulation
Bad example
Points
Average Triangulation


It appears that for the most part, the model did okay on my own data but did perform noticeably worse than on the actual train/test set data provided. This is lkely due to multiple reasons: some of the older pictures (those in the bad example) have a lot lower resolution which makes the image a bit blurrier and likely contributed to the model's difficulty in learning and identifying facial features. Another reason might be that my bounding boxes (which I estimated myself) were not as accurate which resulted in poor predictions. But overall, the model was able to recognize some of the face features at least!