Programming Project #5: Facial Keypoint Detection with Neural Networks

Niraek Jain-Sharma

Part 1: Nose Tip Detection

In this part of the project, we create a dataloader for the IMM Face Database, and load in the keypoints of faces, as well as extract the nose keypoint. We convert all the faces to grayscale, and normalize them between -0.5 and 0.5. Finally we resize them to 80x60. Below are the first 4 images loaded with keypoints as well as with nose points.

First four images: keypoints + nose point.

Next, we create a CNN with the following architecture found below. We have three convolution layers, first takes in 1 channel outputs 16, with kernel 5x5, and next takes in 16 and outputs 16, with kernel 3x3. Finally the last convolution layer takes in 16, outputs 32, has a kernel of 3x3. We use a RELU and maxpool after each convolution layer, and have two fully connected layers going from 1280 to 128 to 2, for the nose keypoint.

Finally we train this CNN using mean squared loss and adam with a learning rate of 1e-3. We run this for 15 epochs, with the loss seen below.

Let's visualize some of the results, some are good some are bad. I believe the good examples are mostly facing forward, so the model does better on those, and the ones that do badly are the faces facing at weird angles, and the model probably doesn't have enough training data to do well on thse yes.

Left column = Good, Right Column = Bad

Let's also test some different hyperparameters to see how the losses change. First, let's change the learning rate to 0.003

Next let's change the number of layers to 2 layers instead of three - now we just have conv going from 1 to 16 and 16 to 16.

Summing it up, having a higher learning rate jumps around more in terms of training error, because it's probably skipping over local minima. And having less layers seems weaker in terms of validation error converging lower, it keeps flipping back and forth.

Part 2: Full Facial Keypoints Detection

In this part, we try to predict ALL of the keypoints. So, we create a dataloader that loads in the data. We also augment it with ColorJitter as well as Horizontal flip so that we have more data. See below for some of the images.

As usual, we make a CNN for this data. The CNN architecture is below, so I won't bother reiterating it here, but one thing to note is that I aded relu after every convolutional layer, and maxpool after the first, second, and third, but not after the fourth and fifth. Fiddling around I found this was the most effective. Also, I went from kernel of 7 to 5, 5, and finally 3, 3, because I felt like lowering the kernel over time made the model perform better on my tests. My fully connected layers go from 2688 to 116, since we need 116 output for the 58 keypoints. I chose a learning rate of 0.001 again and mean squared loss, and trained it for 15 epochs.

Let's visualize some examples and see how this model does. The first column is good, second column is bad.

Again, we see that the facing forward faces seem to do a lot better than the tilted faces. However, the lower right face was mostly pointing straight, so perhaps that particular facial structure isn't very common, or his slight tilt threw off the model. Now, let's visualize the filters of our model.

All filters of first convolution layer (8 in total)

Part 3: Part 3: Train With Larger Dataset

The grand finale is that we need to train on a huge dataset. First, we must load in all of the images, and keypoints, by converting the data into them from xml files. We then crop the faces using the bounding boxes providing, and drop all images with incorrect bounding boxes, e.g. images that have bounding boxes which are outside the image itself. Moreover, we change all keypoints that exist outside the image to the edge, e.g. if a point was negative, we change the coordinate to 0. Below you can see examples of this:

As usual, the CNN. We use Resnet18 for this. To alter it to accept grayscale images, and for our purposes, we change the first convolution layer to accept 1 channel instead of 3. Finally, we make the last fully connected layer output 136 points instead of what it was originally. This ensures we get 68 landmarks from the image. See below for the entire network. A learning rate of 0.001 and Adam optimizer was used again.

Now we visualize the results on some of the test images. Note for the test images we altered the bounding boxes by ensuring they were inside the image, and that they didn't have bounds that were the same.

As we can see, our model doesn't do too well on the test images, even though the loss seemed to decrease significantly. This to me seems like something was wrong with how the bounding boxes were used, the data for bounding boxes was pretty bad and I didn't figure out a way to easily fix them so that the model could train well. Below is the loss function for the model over 30 epochs. My KAGGLE score was 241, which is terrible unfortunately.

Lastly, we visualize this on random full-use approved images I found online. I chose not to use personal pictures I had because I did not have consent of anyone to use their photos on a public website (and didn't want to use my own face either).

As you can see the keypoints are roughly in the shape of faces, but do not match the faces too well. Hopefully with more time in the future I can explore this again.