Facial Keypoint Detection with Neural Networks

By Joshua Levine

*Note: in all images below that show the predictions of a network, green points are the ground-truth labels, while red points are the outputs form the network. Additionally, all loss graphs have the training loss in orange and the test loss (which is actually the validation loss) in blue, unless otherwise indicated. Lastly, assume that any parameters to the pytorch nn layers that weren't specified used the pytorch default parameters. I pasted the function calls themselves, so if I used the default parameters, I didn't explicitly include them below. In particular, these are the structures of the function calls, from the documentation:

Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros', device=None, dtype=None)

MaxPool2d(kernel_size, stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=False)

Linear(in_features, out_features, bias=True, device=None, dtype=None)

Part 1: Nose Tip Detection

Here are some images from my data loader, which resized images to be 80x60:

img img img

For this part, I used a batch size of 64, with a network containing: nn.Conv2d(1, 16, 3, padding=1) nn.Conv2d(16, 32, 3, padding=1) nn.Conv2d(32, 32, 3, padding=1) nn.Conv2d(32, 16, 3, padding=1) nn.Linear(240, 128) nn.Linear(128, 2) With each convolution followed by a ReLU and a max_pool2d(x, 2, 2). The last two linear layers were also separated by a ReLU. I used a batch size of 1e-3 and trained for 20 epochs. Below is a plot of my losses, followed some predictions.

img

Good images

img img

Bad images

img img

I think it failed in the cases above because the images were abnormal, one with an exaggerated expression, and the other is turned a lot. These were rare in the training set, so the network is less familiar with these types of images.

Here is a loss graph that I got when I increased the learning rate to 1e-2. You can see that while it converges faster than my final result, the loss doesn’t get as low, likely because it gets stuck in a local minimum.

img

Here is a loss graph with a learning rate of 1e-3, as I used in my final result. However, this model switches all convolution layer channel counts to 16 except for the out_channels of the third layer and the in channels of the fourth layer. It also has a smaller linear layer, with replacing 128 with 32. It isn’t easy to tell from the graph below, but the loss of this model was slightly higher than my final result. The small difference numerically made a big difference in how the results looked when I visualized them.

img

This part was fairly simple, because the restrictions on the net that were in the spec limited the room for error. I was a bit dissapointed in the model's performance on turned faces, but it went well otherwise.

Part 2: Full Facial Keypoints Detection

For this section, I resized all images to 240x180. I also augmented the training data using a random affine transformation, with a random rotation between -15 and 15 degrees, a random shear, with an angle between -15 and 15 degrees, and a random translation between -10 and 10 pixels in either direction. I also applied a random color jitter, with 0.3 as the input for all parameters. Below are some augmented images from my data loader.

img img img

I used a learning rate of 1e-4, and a batch size of 64, with an Adam optimizer. I trained for 60 epochs. My network architecture was:

nn.Conv2d(1, 32, 3, padding=1)
max_pool2d(x, 2, 2)
nn.Conv2d(32, 64, 3, padding=1)
nn.Conv2d(64, 128, 3, padding=1)
nn.Conv2d(128, 256, 3, padding=1)
nn.Conv2d(256, 256, 3, padding=1)
nn.Conv2d(256, 256, 3, padding=1)
F.max_pool2d(x, 2, 2)
nn.Linear(307200, 1024)
nn.Linear(1024, 1024)
nn.Linear(1024, 58 * 2)

All non-pool layers, except for the last linear layer, were followed by ReLUs. Below are my loss graphs and some samples.

img

Good images

img img

Bad images

img img

I think it failed in the cases above because the images were different from most of the training images, one with an exaggerated expression, and the other is turned a lot. These were rare in the training set, so the network fits images with faces pointed at the camera, with subtle expressions better than more expressive images.

Here is my loss graphs with only two linear layers. It is clear that the model under-fits because it converges way too quickly, with a high loss. To fix this, I added an additional layer. I saw that it was still underfitting, so I also increased layer sizes after this run.

img

In this example, my learning rate was too high. One can see this because the loss diverges. This occurred because I decreased the batch size, without decreasing the learning rate, which was 1e-2. Only the learning rate was changed.

img

Here are the filters of my first convolutional layer:

img

This part was easy to get working, but it took a lot of tuning to get to my final result. Again, I wish it performed better on turned and expressive faces, but I'm satisfied otherwise. The hardest part about this part was working on the actual network architecture.

Part 3: Train With Larger Dataset

Kaggle MAE (submitted under Joshua Levine, https://www.kaggle.com/jlevine12): 10.58514

In my data loader, I started by resizing the bounding boxes so that all points are in them, while insuring they are still square. I then pad each axis by 12%. Finally, I crop the images and apply augmentations if training. I used the same rotations, color jitters, and shears as the last part, but because there are bounding boxes, translations didn’t actually do anything, so I left them out. Here are some images I loaded:

img img img

I used the torchvision model resnet18, with architecture described below, from Deep Residual Learning for Image Recognition.

img

I then replaced the first convolution layer with a convolution layer that was identical, except that it had 1 input channel, to account for the grayscale input. I then set the fully connected layer to an identical fully connected layer, but with 136 output features (one per label). I used the pretrained model, and trained for 50 epochs on the augmented data, with a batch size of 64, and learning rate of 1e-4. Due to some issues with W&B, which I used to log the training, and my notebook timing out, my loss graph only has the first 25 epochs. I was able to recover the losses for the remaining epochs, so it is split into two figures. See the results below, along with some sample images:

img img

Outputs:

img img img

The following are bad results. I suspect they are due to the low contrast between the face and the background and because these faces seem to be pretty thin.

img img

Here is the model tested on some of my own images. The first two worked well because they are well cropped. I cropped several of the images because it was easier to crop it myself than to provide bounding boxes.

img img

The next image is bad because it isn't cropped. This causes problems because the training data was cropped using bounding boxes.

img

The next image is also bad, but this one didn't work well becaues the face is largely obstructed. This image was also cropped.

img

I thought it would be cool to try it on a painting,too, and it worked pretty well.

img

This part was the most time consuming. It was pretty easy to implement because I used a predefined network. It was very challenging to train, however, because I had to deal with a lot of timeouts, and other issues. It took several hours for each run, so I was fortunate to have good results quickly.