CS194 Project 4: Facial Keypoint Detection with Neural Networks

Cesar Plascencia Zuniga

Overview

Machine learning has been all the buzz recently especially in the area of computer vision. One of the popular applications is facial recognition and keypoint detection. Following, we will see how a computer can 'learn' to identify facial keypoints. We will be using the IMM Face Database for the first two parts and the IBUG Database for the third.

Part 1: Nose Tip Detection

Instead of finding all keypoints at once, let us start by just trying to identify the nose in an image. For example, these are the targets we are trying to find. The green dot represents the true position of the nose, while the red dot is our model's current guess of where it is.

Let us now train our model to improve on this guess and get closer to the green dot. The architecture of our model is as follows:

The dataset was then split into testing and validation sets. The model was trained with MSELoss and a learning rate of .001 for 25 epochs and obtained the following results.

Now with a trained model, our predictions will have been reweighted to be more similar to the true nose position.

Failure Cases

We see two failure cases above which come from our validation data. The reasons for this failure is most likely to the orientation of the faces not being captured in our testing data and/or overfitting to a specific position in the global image.

Part 2: Full Facial Keypoints Detection

Now that we have a good understanding, we can extend our model to capture all 58 points in of the face! We have to modify our model slightly to have the following architecture:

Before training our model, we get the following outputs:

After training our model for 25 epochs with the MSELoss and a learning rate of .001 from before we obtain the following results:

Now that our model has been trained, let us see the results.

Failure Case 1
Failure Case 2

Again, we have some failure cases. Here it seems like the angle of the face played a bigger role in point variance as our model seemed to have trained for the most common on-center face angle.

Lastly, we can visualize some of the learned filters from our model! Here are a few of them.

First layer
Second layer

Part 3: Train With Larger Dataset

Now is when the real fun begins. We are going to be training our same model from Part 2 on a much larger dataset of 6666 images. All we have to change is our output layer from 58 landmarks to 68 landmarks. As for the architecture of our model, it is the exact same as from Part 2. Here is what our images look like before training:

After hours of training, our model produced a mean absolute error of 324.7 and the following results on our training data:

We also get the following results on our testing data:

Lastly, we also have some failure cases. These examples showcase how awkward perspectives and background features can trick our model into putting points where they shouldn't be. The best results are those that have well defined face-boxes and straight-on perspectives.