Facial Keypoint Detection

Kaushal Partani

In this project, we build off of our previous project and learned about using neural networks to detect facial keypoints. The results are fairly impressive, and extremely accurate depending on training time and models.

Part 1: Nose Tip Detection

In this part I first had to write a custom dataloader that allowed me to access all of the training images and their nose keypoints. All images were grayscale and of size 60x80.

Ground truth points for nose

image.png

image.png

image.png

image.png

I created the following neural network for this part:

image.png

I tried multiple learning rates, but ultimately decided on the following parameters:

  • Learning rate: .001
  • Epochs: 20
  • Batch Size: 1

These parameters result in the following losses:

image.png

This graph makes sense, since we expect our validation error to be higher than the training error. With validation error, we are not adjusting our model, so our loss should be non-negligible.

Here are some results that worked well and poorly:

Worked Well (Predicted = Blue Dot, Truth = Red Plus)

image.png

image.png

Worked Poorly (Predicted = Blue Dot, Truth = Red Plus)

image.png

image.png

I believe that this model doesn't work as well for the sideways facing faces due to the small net and small training set that we have to work with. It seems as though the prediction is for a forward facing face, since the keypoint is in the "center" of the head. After taking a look at the training set, it looks like more pictures are forward-facing than sideways-facing, somewhat explaining why our model is outputting a "forward-facing point"

Part 2: Facial Keypoints Detection

Now we move on from just the nose keypoint to all facial keypoints. For this part, I started off with writing a new dataloader that added data augmentation by rotating by up to 10 degrees and color jittering my images. In this part, the images were also grayscale, but now scaled to a higher resolution since we're using a more involved net. The images are 120x160.

Ground truth points for full facial keypoints

image.png

image.png

image.png

image.png

I created the following neural network for this part:

image.png

I tried a couple of learning rates for this part, but opted to pick one between .0001 and .001:

  • Learning Rate: .0005
  • Epochs: 20
  • Batch Size: 1

These parameters result in the following losses:

image.png

Again, this makes sense because we expect the validation loss to be greater than our training loss, since we aren't training using the validation set and the validation set is fresh data.

Neural Net Visualizing the Learned Filters

Conv1

image.png

Conv2

image.png

Conv3

image.png

Worked Well (Predicted = Blue Dot, Truth = Red Plus)

image.png

image.png

Worked Poorly (Predicted = Blue Dot, Truth = Red Plus)

image.png

image.png

It seems clear to me that the facial keypoint detection still isn't that great for people who are looking at the sides. Since our training set doesn't change from part 1, I still believe it's too small and doesn't have enough variability to cover the cases of the sideways looking faces. The majority of the images are people with fairly netural expressions that are facing forward, so I think this is the major reason why the sideways looking faces could still use work. If we take a closer look, it still seems as though the sideways looking people have predicted keypoints that are leaning towards the center, rather than to the sides (the predicted keypoint is not as extremeley shifted as we'd hope it to be), and I think this makes it clear that our model is not as robust and trained as we'd like it to be.

Part 3: Train with Larger Dataset

I started this part off by creating a new Dataloader once more. In this case, we needed to add more functionality to our dataloader since we had to do cropping as well as reading via the XML file. I created a Dataloader that cropped and resized my images for me while also performing the same data augmentation that I did in part 2.

Ground truth points

image.png

image.png

image.png

image.png

I used a premade resent18 model for this part. Here is the architecture:

image.png

Already, this model is much more involved than our previous simple models, so we hope that the results will be much more robust, and along with the larger training set, be well trained.

Since the model takes so long to train, I trusted gut instinct and went with the following parameters:

  • Learning rate: .0001
  • Epochs: 10
  • Batch Size: 1

My model trained for around 1.5 hours and resulted in the following loss diagram:

image.png

Results

image.png

image.png

image.png

image.png

image.png

image.png

image.png

It's scary how well this model works! For some images, liek the second to last and the third to last ones above, the eyes are slightly mismatched, but aside from that, the facial keypoints seem at first glance to be pretty spot on!

My collection

I tried out the net on some of my own images, as seen below:

image.png

image.png

image.png

image.png

image.png

While these keypoints are good, they could be better. In general I see that when we use images that show not just the face, the net is not as good at getting the facial key points. For example, in the first 3 pictures, me, joe biden, and george clooney, The eyes are mismatched and tend to be predicted as higher than they actually are. However, when I moved in closer to the faces, as in the Kamala Harris picture, the eyes seem to reach a much more reasonable point and the overall facial keypoint detection is pretty good! I think my performance here could have improved a bit if I actually created exact bounding boxes for the faces, but for these images, to simulate a real life scenario of where this net could be used, I took rough crops of faces instead. Regardless, the model performs pretty well!

In [ ]: