Facial Keypoint Detection with Neural Networks

COMPSCI 194-26: Computational Photography & Computer Vision

Professors Alexei Efros & Angjoo Kanazawa

October 31st, 2021

Nick Kisel


1

Nose tip detection

I trained a convolutional neural network NoseNet to recognize noses by processing face images and keypoints using the following parameters:

Sample picture & its corresponding nose keypoint.





Some hits:




Before I explain the misses, let's be clear that we're detecting the base of the nose, rather than the tip.
I think these are primarily misses because of two reasons: first, the model tends to think that a nose should be centered in any photo, since all of the photos it's learned from have the face centered, and therefore the nose centered. This particularly explains why the left photo is off; the nose is expected in a central location, and the top of the cheek emulates the shading of the nose closest for that region of the photo.

Second, the lighting of the faces is relatively consistent between all photos. That means the lighting surrounding each nose is roughly the same - namely in that noses cast shadows in this dataset. Thus, the network learns to look for dark areas, or a transition from light to darkness, but sometimes doesn't have the context necessary to discern the shadow of a nose the shadow of say, the mouth, or an eyesocket, or a chin. Thus, we end up with predictions like the middle photo.
I might call the right photo a combination of the two errors: it tries to bias towards the center and gets caught on the side of the nose, rather than the tip.


Some misses:

Here, the X marks my prediction, and the green O marks the actual point!

Hyperparameter tuning: learning rate

After doing a few trial runs where I modified the learning rate, I settled on 5e-5. I had initially found that the model was converging on a "solution" quickly, meaning it would come close to a minimum loss point, but then consistently overshoot when it tried to re-correct and hit the minimum (see the graph at right, "Default learning rate run"). Thus, I decided to slow the learning rate down so that it would more gracefully and closely hit that minimum as I kept training.
Notice that the default 1e-3 learning rate begins very favorably, and its first loss is low enough to bring the scale down to 0.002 per vertical axis bar; however, this is misleading; since the learning rate is so high compared to the other two models, the losses down the line continue fluctuating (stop decreasing) forever. While this model got to a similar prediction as the low LR model in a shorter amount of time, it ultimately was unable to optimize and get more accurate after ten or so epochs. I observed that the validation loss for both 5e-5 models ran below 0.002 as the epochs went on, while the default model converged at 0.002 and didn't sink any lower, regardless of how long I trained.

Low (5e-5) learning rate run


After increasing the number of channels on conv1 to 16 from 12
5e-5 learning rate


Default learning rate run


Hyperparameter tuning: channels

I first ran conv1 with 12 channels; increasing that number to 16 appears to improve the "bumpiness" and "variability" of the validation loss, according to the graph; in other words, the network is overall more consistent, but the final results after many epochs aren't actually much different in the two 5e-5 models.

I used a batch size of 1 for all runs.


2

Full Facial Keypoints Detection

Next, we included all of the face points. A few runs exposed that the learning rate should be lowered again, so I dropped it to 0.00002 (2e-5). I used a batch size of 1 for all runs. In all, I trained for 140 epochs until my validation loss stopped decreasing.

FaceNet(
(conv1): Conv2d(1, 12, kernel_size=(7, 7), stride=(1, 1))
(conv2): Conv2d(12, 20, kernel_size=(5, 5), stride=(1, 1))
(conv3): Conv2d(20, 25, kernel_size=(5, 5), stride=(1, 1))
(conv4): Conv2d(25, 32, kernel_size=(5, 5), stride=(1, 1))
(conv5): Conv2d(32, 32, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=4032, out_features=300, bias=True)
(fc2): Linear(in_features=300, out_features=116, bias=True)
(features): Sequential(
(0): Conv2d(1, 12, kernel_size=(7, 7), stride=(1, 1))
(1): ReLU()
(2): Conv2d(12, 20, kernel_size=(5, 5), stride=(1, 1))
(3): ReLU()
(4): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(20, 25, kernel_size=(5, 5), stride=(1, 1))
(6): ReLU()
(7): Conv2d(25, 32, kernel_size=(5, 5), stride=(1, 1))
(8): ReLU()
(9): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
(10): Conv2d(32, 32, kernel_size=(5, 5), stride=(1, 1))
(11): ReLU()
(12): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
)
)




The X marks my prediction, and the green O marks the actual point!

Some hits:




Some misses:

A miss while looking away.

A miss while looking away.

A miss while looking away.

But why did it miss?

Firstly, there are a lot more points to predict this time. One difference that this invites is that once you've chosen a single point, generally, the positioning of other points falls into place somewhere relative to that point; in effect, points are dependent.
This invites a lot of errors based on how we've trained the model - particularly that, if some point is thought to be the tip of the nose, it'd only make sense for the model to map a few points around that as the sides of the nose, a few points diagonally upper-left as the right eye, and so on. When one such prediction goes wrong, a number of others might also be influenced to fall in line and map to different points.

While there are some extreme translations that don't make sense, (i.e. flipping a face horizontally without re-marking the points), we haven't drastically modified most of the faces that the model is training on outside of a few degrees of rotation and a few translations. That means that - taking an extreme translation like an upside-down face as an example - the model would probably still try to map points right-side-up. Though that case doesn't exist in our dataset, but it's replaced by a similar "rotated face" problem, where a face turned away from the camera isn't mapped particularly well by the model.
However, not all is lost, because the first slightly turned face is quite impressively mapped!


3

Training on a larger dataset

Lastly, I downloaded the large ibug dataset and trained on all the annotated faces by cropping images and making slight adjustments to the images (rotations, translations, etc.). I took the advice on the project page and used ResNet18 as a base, modifying the first convolutional layer conv1 to accept just one color channel, and the last fully connected layer to output 136 outputs - one for each of the 58 points' x & y coordinates.

I chose a learning rate of 0.0001 (1e-4), and trained for 25 epochs.

Model Architecture

ResNet(
(conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): BasicBlock(
  (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): BasicBlock(
  (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer2): Sequential(
(0): BasicBlock(
  (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (downsample): Sequential(
   (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
   (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 )
)
(1): BasicBlock(
  (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 )
)
(layer3): Sequential(
 (0): BasicBlock(
  (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (downsample): Sequential(
   (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
   (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
 )
 (1): BasicBlock(
  (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 )
)
(layer4): Sequential(
 (0): BasicBlock(
  (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (downsample): Sequential(
    (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
    (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
 )
 (1): BasicBlock(
  (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 )
)
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): Linear(in_features=512, out_features=136, bias=True)
)

Losses over each epoch


Raw data (validation):

153.71337484885117,
128.88900953281424,
91.8845456488832,
64.59337344141063,
41.3287937569761,
34.247558496669384,
27.970341104233338,
25.27063581186854,
28.61103214903506,
22.455415200330542,
20.843517247074377,
19.391145770421286,
20.757094388236542,
19.84007723531323,
19.1822521408161,
18.67110346962592,
18.38411842920109,
18.7877002369144,
17.66790440839208,
17.341433541503495,
17.93859914177192,
17.44737537892279,
17.890463032408388,
18.242590165423774,
17.913578881475026






The small red dots identify my predicted points.

Some hits:




Some misses:

He's not that chubby.

But why did it miss?

I don't think there's any one strong reason for the misidentifications in this set. In fact, you can see that many of my "good" images above contain some of the same characteristics of these "bad" ones. Personally, I think a lack of strength in image augmentation led to these circumstances - maybe I could have focused on more examples where the faces are in unnatural orientations and lighting scenarios.

The most obvious errors are those where the edge of the face is mapped off of the face. This was probably cause by a number of face points being outside of the bounding boxes or obscured in the original photo, and the model remembering that some face points could be identified despite that the full face isn't visible.
As you can see from the "bad" predictions, but even many of the "good" ones, my model tends to overshoot how large a face really is.

Training on my own images

You'll notice the same overshoot problem as with the regular images.

Obama's face actually looks quite great!



My face, though, is a little bit of a mixed bag. It seems to have grasped onto some of the facial features correctly, but it misidentified the forest for the side of my face.



As for Gigachad, the neural network has never seen such gorgeousness in a single photo, and especially not with his head turned. I suspect that this misidentification is due to both the lower contrast with the background and the uncharacteristic squareness of his face (most faces are rounder).