Facial Keypoint detection with neural networks

Roma Desai | CS-194 Project 4

 

 

OVERVIEW

 

For this project, I implemented facial keypoint detection using convolutional neural networks. I implemented this in a few parts. First, I created a net to just detect the noses of people in the IMM Face Dataset. Next, I modified this net to be able to learn the points outlining an entire face. Finally, I scaled these concepts onto an even larger dataset and used the ResNet standard CNN model to learn the facial keypoints of those faces.

 

 

 

PART 1: NOSE TIP DETECTION

 

For the first part, I began by creating a PyTorch DataLoader class to load all my images as well as keypoints. This dataloader abstraction made it easy to fetch data and interface with other PyTorch functions. I scaled the images to 60 x 80. Here are some images with the nose point shown (before being resized):

 

A picture containing application

Description automatically generated

A person looking towards the camera

Description automatically generated

A person looking at the camera

Description automatically generated

A picture containing graphical user interface

Description automatically generated

 

 

 

Next, I created the actual neural network. I used 3 convolutional layers with a max pool layer following each convolutional layer. I followed this with 2 fully connected layers with the output size being 2 to represent the (x,y) coordinates of the nose point. Here are the specifications of my net:

 

Text, letter

Description automatically generated

 

 

Finally, I trained the neural network on my training data and tested on my validation data. My training function involved calculating the loss using mean squared error and updated my weights accordingly.  I trained my net for 15 epochs. Below is the plot of the training and validation loss per epoch. The x-axis represents the epochs and the y-axis represents the loss. (I did not normalize my points which is why the loss is so high). The blue line is the training loss and the orange line is the validation loss. As you can see, the loss decreased as the net trained on more data and had more time to update its weights.

 

Chart, line chart

Description automatically generated

 

 

Here are some examples of my learned points. The green points are the ground truth and the red points are the outputs of the net.

 

 

Graphical user interface, application

Description automatically generated

A picture containing photo, sitting, meter

Description automatically generated

A picture containing graphical user interface

Description automatically generated

A picture containing graphical user interface

Description automatically generated

 

As you can see, not all the facial key points were successfully identified. I believe this because of our small dataset and the facial view/expression. The images that failed were usually side views which made it hard since majority of the training images were not hard angled views. As you can see above, one of the images also has a different facial expression which may have also thrown off the results. While a lack of data is always a problem in machine learning problems, I believe some varied data/data augmentation could have helped the results.

 

 

 

PART 2: FULL FACIAL KEYPOINTS DETECTION

 

This section was very similar to part one except I modified my neural network to learn a full set of facial keypoints instead of just the nose. This time, I loaded all 58 facial keypoints in my data loader. However, to learn a full set of 58 points, we need more data. Since we do not have new images, we can augment the current dataset to product similar but different images that will help with training. I implemented random brightness/constrast changes, random rotations, and random shifts. Here are some samples:

A close up of a logo

Description automatically generated

A screen shot of a person

Description automatically generated

A picture containing indoor, photo, front, screen

Description automatically generated

A picture containing sitting, computer, screen, computer

Description automatically generated

 

 

Next, I modified my net from part 1 to be able to learn 58*2 = 116 keypoints instead of just 2. I added more convolution layers, more max pool layers, and changed the inputs/ouputs of the different layers. Here is what my net looked like:

 

A picture containing table

Description automatically generated

 

 

After training, I plotted my loss function. I trained my net for 15 epochs. Below is the plot of the training and validation loss per epoch. The x-axis represents the epochs and the y-axis represents the loss. (I did not normalize my points which is why the loss is so high). The blue line is the training loss and the orange line is the validation loss. As you can see, the loss decreased as the net trained on more data and had more time to update its weights.

 

Chart, line chart

Description automatically generated

 

 

Here are some of my results. As you can see, some did well while others did not do as well.  Again, I think this is due to a lack of varied data even with the data augmentation. The net did not do so well on faces that were angled away from the center or tilted. The net tried to replicate images that were facing head on for the most part. Surprisingly, the net got the shape of the face correct -- it is just the location that seems a little off. This may be an indication of the net assuming most faces will be forward facing.

 

 

A picture containing sitting, table, holding, food

Description automatically generated

A picture containing looking, face, sitting, child

Description automatically generated

A picture containing indoor, sitting, holding, shirt

Description automatically generated

A picture containing looking, sitting, person, face

Description automatically generated

 

 

In addition, I printed the learned filters. Here are the filters for the first and second layer. The number of filters increases exponentially so I just displayed the first 2 layers.

 

First Layer:

A picture containing chart

Description automatically generated

 

Second Layer:

 A picture containing background pattern

Description automatically generatedA picture containing background pattern

Description automatically generatedA picture containing background pattern

Description automatically generated

 

 

 

PART 3: TRAIN WITH LARGER DATASET

 

For part 3, I used the larger ibug dataset with over 6,000 images. From the previous parts, I modified my dataloader to read in the new images, crop them according to the bounding box around the faces, and resized them into 224x224 images. I also implemented data augmentation to give better results. Here are a few images along with the ground truth keypoints.

 

 

 

Next, I loaded a standard ResNet18 PyTorch model to train. I modified this net to take in 1 input and output 136 points for the 68 facial keypoints. Below is my net in detail.

 

Text, letter

Description automatically generatedText, letter

Description automatically generated

 

Below is the plot of the training and validation loss per epoch. My training set consisted of 5000 images while my validation set consisted of the rest. The x-axis represents the epochs and the y-axis represents the loss. (I did not normalize my points which is why the loss is so high). The blue line is the training loss and the orange line is the validation loss. As you can see, the loss decreased as the net trained on more data and had more time to update its weights.

 

 

Here are some resulting images from the validation set compared to the green ground truth points.

 

 

 

 

Here are some resulting images from the test set of images.

 

A picture containing application

Description automatically generated

A person with the hand on the face

Description automatically generated

A picture containing shirt

Description automatically generated

 

 

 

I also ran the model on three of my own images. Below is the result. As you can see below, the faces that were cropped similarly to the training data set resulted in better results. I think this may be a result of overfitting. I also noticed that the tilted face keypoints was also slightly harder for the program to determine.

 

 

 

REFLECTIONS:

 

Overall, I think this was a very interesting project that showed me the depth of possibilities with machine learning. I really enjoyed reading about the material in detail, finally learning PyTorch (a long-time goal of mine but unfortunately could not be disciplined enough to start), playing with parameters and trying to perfect the network. While I had no prior machine learning experience, I am really excited to apply my newly learned skills to other projects.  With this project I also came to realize the true difficulty of detecting simple faces and what that means for the field of computer vision as a whole. By struggling with this simple facial feature recognition task, I came to appreciate and wonder how these concepts are scaled to applications such as autonomous vehicles and more.

 

 

 

SOURCES:

 

I used the following resources to learn more about convolutional neural networks and various aspects of using PyTorch.

 

·      https://towardsdatascience.com/how-to-cook-neural-nets-with-pytorch-7954c1e62e16

·      https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html

·      https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#define-the-network

·      https://colab.research.google.com/github/Niranjankumar-c/DeepLearningPadhAI/blob/master/DeepLearning_Materials/6_VisualizationCNN_Pytorch/CNNVisualisation.ipynb

·      https://www.youtube.com/watch?v=Zvd276j9sZ8&ab_channel=AladdinPersson

·      https://pytorch.org/docs/stable/notes/cuda.html

·      https://stackoverflow.com/questions/24659814/how-to-write-a-numpy-array-to-a-csv-file