Project 5: Facial Keypoint Detection with Neural Networks

In this project, I created convolutional neural networks that would be able to find facial landmarks in images.

Part 1: Nose Tip Detection

Before writing any code that defined the CNNs, I created a dataloader that would load in the images and labels into the models. Here is a few examples of images with their nosepoints labeled:

I first created a simple model that find the nose tip in an image. It had four convolutional layers: it used a 3-by-3 convolution in the beginning, then two 5-by-5 convolutions, and finally another 3-by-3 convolution. Between each convolutional layer was a ReLU activation unit and 2D MaxPool layer that preserved the number of dimensions. The model had two fully connective layers at the end: one with 12800 units and after a ReLU unit was one with 2 units.

I initially used a learning rate of 0.5, but that caused my model to underfit and give me bogus values. A learning rate of 0.02 worked much better and seemed to find the nosetips.

The left image is the result of plotting the predicted nosetip from a model trained with a learning rate of 0.5, and the right image is the prediction from a model trained with a learning rate of 0.02.

This model performed quite well, and had the following training and validation MSE loss curve over time.

Because of the very high error in the first few epochs, it is a bit hard to see the trends later in training, so here are the curves after removing the first three epochs.

In the loss curves above, the loss of the model converged relatively smoothly. In comparison, training the model with a learning rate of 0.5 made the curves noticibly occilate.

After training for 25 epochs, this model performed quite well on images where the head was centered, but did not do as well on images where the head was turned.

Images that were a success
Images that were failures

This lead me to believe that the model heavily relies on the fact that the nosetip point is usually located around the center of the face (and image), rather than detecting the nose itself; this would explain why the model's nosetip points for turned faces were closer to the corner of the mouth.

Part 2: Full Facial Keypoints Detection

In this section, we created a CNN that could classify all facial keypoints. Here are some images of the facial keypoints labeled on a face:

Predicted Facial Landmarks on Images

I augmented the data set by translating the images slightly (10 pixels to the left and right) and using torchvision.transforms.ColorJitter with brightness 0.3 to change the intensity of certain parts of the faces.

My model had the following convolutional layers (in-channel x out-channel x kernel dimensions) : 1x12x3, 12x32x5, 21x64x5, 64x64x5, and 64x32x5. It also had a ReLU unit between each convolutional layer, as well as a max pool layer that preserved the numbe of channels. At the end, it had two fully connected layers: one of size 12800 and after a ReLU activation unit, a final layer of size 116.

I used a learning rate of 0.02 while training.

Here is the full list of layers: conv_channel_dims = [1, 12, 32, 64, 64, 32] conv_kernel_dim = [3, 5, 5, 5, 5] lin_layer_dims = [128, 58*2] self.convlayer = nn.Sequential( nn.Conv2d(conv_channel_dims[0], conv_channel_dims[1], conv_kernel_dim[0]), nn.ReLU(), nn.AdaptiveMaxPool2d(conv_channel_dims[1]), nn.Conv2d(conv_channel_dims[1], conv_channel_dims[2], conv_kernel_dim[1]), nn.ReLU(), nn.AdaptiveMaxPool2d(conv_channel_dims[2]), nn.Conv2d(conv_channel_dims[2], conv_channel_dims[3], conv_kernel_dim[2]), nn.ReLU(), nn.AdaptiveMaxPool2d(conv_channel_dims[3]), nn.Conv2d(conv_channel_dims[3], conv_channel_dims[4], conv_kernel_dim[3]), nn.ReLU(), nn.AdaptiveMaxPool2d(conv_channel_dims[4]), nn.Conv2d(conv_channel_dims[4], conv_channel_dims[5], conv_kernel_dim[4]), nn.ReLU(), nn.AdaptiveMaxPool2d(10), nn.Flatten(), nn.Linear(10*10*conv_channel_dims[-1], lin_layer_dims[0]), nn.ReLU(), nn.Linear(lin_layer_dims[0], lin_layer_dims[1]), )

Here are the training and validation loss curves:

Because of the very high error in the first epoch, it is a bit hard to see the trends later in training, so here are the curves after removing only the first epoch.

Below are the visualized filters that my model learned:

Filters in the first convolutional layer
Filters in the second convolutional layer
Filters in the third convolutional layer
Filters in the fourth convolutional layer
Filters in the fifth convolutional layer

After training for 25 epochs and a batch size of 64 (with each epoch covering the entire training set), this model performed similarly to the nosetip model, where it performed well on images where the head was centered, but did not do as well on images where the head was turned.

Images that were a success
Images that were failures

I believe that this is because turned faces are very different from faces looking straight on, at least from the perspective of the model. The model performed somewhat well on images that were translated, which suggests to me that the model learned how the keypoints should be arranged and where they should be roughly placed for faces looking forward. However, it seems like it did not learn how to handle faces looking to the side. Since this is not a very deep model, it is also likely that it underfitted to the data and could not generalize well to faces outside of the average.

Part 3: Train With Larger Dataset

As I had done in Part 2, I augmented the data set by translating the images by 10 pixels and changing the intensity of some images with transforms.ColorJitter.

My model was just a ResNet-18 model with a layer attached to the beginning to allow grayscale images and a layer attached to the end to output 58 coordinates. I used a learning rate of 0.03.

I trained this model on 1700 images and a batch size of 16.

Here is the code for this part: self.model_name='resnet18' self.model=models.resnet18() self.model.conv1=nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False) self.model.fc=nn.Linear(self.model.fc.in_features, num_classes)

Here are the training and validation loss curves:

Here is how my model performed on some images from the test set:

Images from the Test Set

Here are some photos from my collection that has been labeled by my model. The following photos are from the K-Pop group Oh My Girl's promotional posters from 2020.

Images from the Test Set

The first image above from my collection had, by far, the worst labelling. I believe that the angle of the face in the image, especially after resizing the image to be square, is the primary reason why the facial landmarks were wrong. The face in the first image is very compressed after resizing the image, and the model most likely biased towards treating it as if it was an average-sized face. The other images were originally quite square-shaped, so resizing them did not affect the results of the model badly. These images were in general labeled well, although the landmarks representing the eyes and eyebrows did poorly on faces with long bangs that partially covered the area around the eyes. The model performed best on the second image, where the face is looking forwards and has nothing obscuring it.

I submitted to Kaggle as Selina Kim and got a MAE of 8.13398. My Kaggle submission was created after my model was run for around 30 epochs.

Bells and Whistles: Fully Convolutional Networks

In this section, I attempted to create a fully convolutional classification network to find facial landmarks.

My model in this attempt was quite similar to my model from part 2, but instead of adding two fully connected layers at the end, I instead added two new convolutional layers. The final convolutional layer had 58 channels of 160-by-120 images, each channel representing the heatmap of each individual landmark. This model used cross-entropy loss, since it was a classification network.

Here are the layers that my model contained: nn.Conv2d(conv_channel_dims[0], conv_channel_dims[1], conv_kernel_dim[0]), nn.ReLU(), nn.BatchNorm2d(conv_channel_dims[1]), nn.Conv2d(conv_channel_dims[1], conv_channel_dims[2], conv_kernel_dim[1]), nn.ReLU(), nn.AdaptiveMaxPool2d(conv_channel_dims[2]), nn.Dropout(0.2), nn.Conv2d(conv_channel_dims[2], conv_channel_dims[3], conv_kernel_dim[2]), nn.ReLU(), nn.AdaptiveMaxPool2d(conv_channel_dims[3]), nn.Dropout(0.2), nn.Conv2d(conv_channel_dims[3], conv_channel_dims[4], conv_kernel_dim[3]), nn.ReLU(), nn.AdaptiveMaxPool2d(conv_channel_dims[4]), nn.Dropout(0.2), nn.Conv2d(conv_channel_dims[4], conv_channel_dims[5], conv_kernel_dim[4]), nn.ReLU(), nn.BatchNorm2d(conv_channel_dims[5]), nn.Conv2d(64, conv_channel_dims[6], conv_kernel_dim[5]), nn.ReLU(), nn.AdaptiveMaxPool2d(conv_channel_dims[6]), nn.Conv2d(conv_channel_dims[6], conv_channel_dims[7], conv_kernel_dim[6]), nn.ReLU(), nn.AdaptiveMaxPool2d((160, 120)), nn.Conv2d(conv_channel_dims[7], 58, 1)

I trained this model for 25 epochs, with a learning rate of 0.02 and a batch size of 64, using the same augmented dataset that part 2 used.

Here are some heatmaps that my model outputted. Each facial landmark has its own heatmap, and I plotted all of the heatmaps for each face on a grid.

Heatmaps from the first image
Heatmaps from the second image
Heatmaps from the third image

Overall, this model did not perform as well as I would have liked. I believe that I could have tried to increase the number of iteration to make it work better. A U-Net implementation, using skip connections and upsampling, might have also been more successful, if not larger. Still, different heatmaps for different points were distinctly different, with certain areas highlighted more than others. I also noticed that the heatmaps somewhat followed the orientation of the eyes, as evident in the last two images where a face looking forward had circular heatmaps while a face looking to the side had ellipsoidal heatmaps.