CS194-26 Project 4 Facial Keypoint Detection with Neural Networks

You can find a more detailed explanation of the project here.

Overview

In this assignment, I train a neural network model to automatically detect facial keypoints using PyTorch as the deep learning framework and google colab to make use of gpu.

Part 1 : Nose Tip Detection

For the first part, using data from this website I trained a simple Convolutional Neural Network that finds the nose of given facial images. The input of the model is 80 X 60 sized grayscale images of which pixel values are normalized in range -0.5 to 0.5. Here are some of the sample images with their nose tips plotted.

Sample Images for Nose Tip Detection

The architecture of the CNN model trained for this part is the following :

The architecture of the CNN model for nose tip detection.

The model was trained with Adam optimizer with learning rate if 1e-2 for 25 epochs. Here is the plot of the training/validation losses for each epoch. The criterion of the losses is MSELoss, the L2 distance between the actual nose tips and the predicted nose tips.

Losses per Epoch

Here are some of the predictions made by the trained model compared with the ground truths. The ground truths are plotted as green and the predictions are plotted in red.

Samples of predictions by the trained model.

Notice that the two left iamges are relatively well predicted, whereas the right two are not. Possible reason behind the difference is the orientations of the faces.

Part 2 : Full Facial Keypoints Detection

Now, I will move on to training the whole key points of the faces. For better training, I augmented data through randomly rotating and shifting the given images. Also, I trained a bit more complex and heavier model.

Here are some examples of the transformed images with their key points plotted in red.

Sample Transformed Images for Key Points Detection

The architecture of the bigger CNN model trained for this part is the following : Here I used more stacks of convolutional layers. Also, I added BatchNorm Layers after convolutional and linear layers. Last but not least, I used PReLU instead of ReLu as activation functions.

The architecture of the LargeCNN model for key points detection.

The model was trained for 35 epochs with Adam optimizer with learning rate of 1e-2. Here is the graph of training/validation losses per epoch.

Losses per Epoch

Here are some of the predictions made by the trained model compared with the ground truths. The ground truths are plotted as green and the predictions are plotted in red.

Samples of predictions by the trained model.

Notice that the 4th and 5th images have better aligned keypoints than the others. I guess this is due to the fact that the other images are more rotated and have faces facing sides, making them more apart from the mean face.

Here is a visualization of the first and the last convolution layers of the trained model. I tried to print out more for other layers but it seems that there are so many outputs to be returned for the latter layers.

Visualization of the first convolution layer of the trained model.
Visualization of the last convolution layer of the trained model.

Part 3 : Train With Larger Dataset

For this part, I trained a better facial key point detector using a larger dataset, specifically the ibug face in the wild dataset for training a facial keypoints detector. Here, I did not built a model from scratch but rather made use of a pretrained model, namely resnet18. I made slight modifications to its last layers so that the dimensions of its outputs match to the targets. Also, I freezed the layers except the last few.

The architecture of Resnet18 for fine-tuning for the task. The layers in the red box are the ones that are not freezed. i.e., the ones that are being trained.

The model was strained with Adam optimizer with learning rate of 1e-3. The following is the graph of training/validation loss.

Losses per Epoch

Here are some of the sample predictions for Kaggle images.

Kaggle Predictions

The score I got for the Kaggle competition is 351.86020... (Still working on it to improve...)

Here are some key points predictions on pictures I picked. (A little disappoinintg result...)

My Picture Predictions