## Facial Keypoint Detection with Neural Networks

Zachary Wu

In this lab, we will leverage neural networks to help us with automatic keypoint detection. Earlier in the class, we saw how we can make face morph's between individuals if given correspondences of key facial features. This was done in project 3. Now, we will take a look at how we can avoid the manual selection of facial features that is originally required, and use neural networks to try to automatically detect those features.

For this project, we will use PyTorch to create our nueral networks.

For this project, we will use PyTorch to create our nueral networks.

## Part 1: Nose Tip Detection

For this first part, we will focus on solely detection one facial keypoint, that of the nose. The dataset we will use will come from the IMM face database, and will consist of 240 face images. We will use 192 as a training set, and the remaining 48 as a validation set.

The first thing we did was a create a custom dataloader. This dataloader read in the files from the folder, and returned the corresponding image and keypoint when prompted. The dataloader also does some processing on the image in terms of resizing it to a smaller size and making the values floats.

With this dataloader, we can plot and visualize some of the images in our dataset, and the corresponding nose keypoint for each one.

The first thing we did was a create a custom dataloader. This dataloader read in the files from the folder, and returned the corresponding image and keypoint when prompted. The dataloader also does some processing on the image in terms of resizing it to a smaller size and making the values floats.

With this dataloader, we can plot and visualize some of the images in our dataset, and the corresponding nose keypoint for each one.

Once we have our data loader, we create a simple convolution nueral network to try to predict the nose point. The design of the network is as follows.

ConvNet(

(conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))

(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))

(conv3): Conv2d(16, 32, kernel_size=(5, 5), stride=(1, 1))

(fc1): Linear(in_features=3328, out_features=64, bias=True)

(fc2): Linear(in_features=64, out_features=2, bias=True)

)

With maxpooling applied between the convolution layers, and Relu after everystep.

We train our network with the dataloader split into training and validation data for 25 epochs. We keep track of training and validation loss with each epoch.

ConvNet(

(conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))

(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))

(conv3): Conv2d(16, 32, kernel_size=(5, 5), stride=(1, 1))

(fc1): Linear(in_features=3328, out_features=64, bias=True)

(fc2): Linear(in_features=64, out_features=2, bias=True)

)

With maxpooling applied between the convolution layers, and Relu after everystep.

We train our network with the dataloader split into training and validation data for 25 epochs. We keep track of training and validation loss with each epoch.

The results are shown below, with 2 good predictions on the left side, and 2 predictions that did not do as good of a job. It seems our neural network will struggle with smiling and turned faces.

## Part 2: Full Facial Keypoints Detection

It would be a lot more useful if we could predict multiple facial keypoints beyond just the nose. For this part of the project, we will modify our dataloader to return larger images, as well as implement random data augmentation to avoid overfitting. The dataloader implemented will apply some color jittering, random rotation between -15 and 15 degrees, and random translation between -10 and 10 pixels in both the x,y directions.

Below are some random sampled images from this data loader, with the random transformations applied.

Below are some random sampled images from this data loader, with the random transformations applied.

Since we now have larger images, and more features to try to predict, we must also use a deeper convolution neural network. We will utilize the following design, adding more convolution layers.

FaceNet(

(cnn_layers): Sequential(

(0): Conv2d(1, 8, kernel_size=(7, 7), stride=(1, 1), padding=(1, 1))

(1): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(2): ReLU()

(3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

(4): Conv2d(8, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

(5): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(6): ReLU()

(7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

(8): Conv2d(16, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

(9): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(10): ReLU()

(11): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

(12): Conv2d(24, 30, kernel_size=(5, 5), stride=(1, 1), padding=(1, 1))

(13): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(14): ReLU()

(15): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

(16): Conv2d(30, 35, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

(17): BatchNorm2d(35, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(18): ReLU()

)

(linear_layers): Sequential(

(0): Linear(in_features=4550, out_features=512, bias=True)

(1): ReLU()

(2): Linear(in_features=512, out_features=256, bias=True)

(3): ReLU()

(4): Linear(in_features=256, out_features=116, bias=True)

)

)

We train this network over 50 epochs on the face dataset, now trying to predict all facial keypoints. Each epoch, there will be slight variation in the images due to the random data augmentation that we implemented. Below is the training and validation error for each epoch. There are unfortunately a lot of spikes and the error goes back up, so perhaps 50 is too many epochs to train for this dataset.

FaceNet(

(cnn_layers): Sequential(

(0): Conv2d(1, 8, kernel_size=(7, 7), stride=(1, 1), padding=(1, 1))

(1): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(2): ReLU()

(3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

(4): Conv2d(8, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

(5): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(6): ReLU()

(7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

(8): Conv2d(16, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

(9): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(10): ReLU()

(11): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

(12): Conv2d(24, 30, kernel_size=(5, 5), stride=(1, 1), padding=(1, 1))

(13): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(14): ReLU()

(15): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

(16): Conv2d(30, 35, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

(17): BatchNorm2d(35, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(18): ReLU()

)

(linear_layers): Sequential(

(0): Linear(in_features=4550, out_features=512, bias=True)

(1): ReLU()

(2): Linear(in_features=512, out_features=256, bias=True)

(3): ReLU()

(4): Linear(in_features=256, out_features=116, bias=True)

)

)

We train this network over 50 epochs on the face dataset, now trying to predict all facial keypoints. Each epoch, there will be slight variation in the images due to the random data augmentation that we implemented. Below is the training and validation error for each epoch. There are unfortunately a lot of spikes and the error goes back up, so perhaps 50 is too many epochs to train for this dataset.

With this larger neural networks, we can also visualize the learned filters in the first convolutional layer to see what our network tries to identify within the faces. The 7x7 filters below do not appear to be clear edge detectors of any sort or follow any other pattern that I can easily identify.

Now let's see the results of our neural network. Full facial keypoints is a harder problem, but the network manages to do ok sometimes. Below are 2 examples on the left where it does a good job, and 2 where it misses the mark. Similar to the previous ones, turned faces, emotional expressions, and now random rotations seem to be factors that make the networks predictions worse.

## Part 3: Train With Larger Dataset

Now we are ready to try to make a very deep model that will be trained on a large dataset. Hopefully this model will be good enough to make facial key point detections such that we can use it for things like face morphing.

We set up a custom data loader similar to previous portions of the project. This time though, we use information provided in the dataset to automatically crop the images around the face of the subject. Random data augmentation is still used for the training set, but will not be implemented for the test set.

For our model, we will use Resnet18, which is one of the predefined pytorch models. This was recommended as it is relatively light and easy to train. The network has been modified to have 1 input channel, and 136 output channels to match the grayscale input, and the 68 keypoint output. Other than that, the model is left as is.

The details of the model are as follows.

ResNet(

(conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)

(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)

(layer1): Sequential(

(0): BasicBlock(

(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

(1): BasicBlock(

(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(layer2): Sequential(

(0): BasicBlock(

(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(downsample): Sequential(

(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)

(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(1): BasicBlock(

(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(layer3): Sequential(

(0): BasicBlock(

(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(downsample): Sequential(

(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)

(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(1): BasicBlock(

(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(layer4): Sequential(

(0): BasicBlock(

(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(downsample): Sequential(

(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)

(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(1): BasicBlock(

(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))

(fc): Linear(in_features=512, out_features=136, bias=True)

)

We set up a custom data loader similar to previous portions of the project. This time though, we use information provided in the dataset to automatically crop the images around the face of the subject. Random data augmentation is still used for the training set, but will not be implemented for the test set.

For our model, we will use Resnet18, which is one of the predefined pytorch models. This was recommended as it is relatively light and easy to train. The network has been modified to have 1 input channel, and 136 output channels to match the grayscale input, and the 68 keypoint output. Other than that, the model is left as is.

The details of the model are as follows.

ResNet(

(conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)

(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)

(layer1): Sequential(

(0): BasicBlock(

(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

(1): BasicBlock(

(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(layer2): Sequential(

(0): BasicBlock(

(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(downsample): Sequential(

(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)

(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(1): BasicBlock(

(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(layer3): Sequential(

(0): BasicBlock(

(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(downsample): Sequential(

(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)

(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(1): BasicBlock(

(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(layer4): Sequential(

(0): BasicBlock(

(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(downsample): Sequential(

(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)

(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(1): BasicBlock(

(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace=True)

(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))

(fc): Linear(in_features=512, out_features=136, bias=True)

)

We train our model using 6 epochs. It does take quite a while, requiring a little under 2 hours for 6 epochs. A random 20% subset of the training data was set aside to use as a validation set. The training and validation error are as follows.

Below are some test set predictions that our model makes. It looks pretty good. Even though the actual key points are not given, it looks like our model does a good job of marking the key points on the face.

In order to recover the keypoints in the full image, rather than just the points cropped around the face, we simply undo the transformations that the data loader does to get the cropped image. After doing so, we can generate a .csv file containing all of our predictions and submit it to kaggle.

As of writing, I scored 72th position, with a score of 12.95. This is submitted under my name Zachary Wu. Not bad given that I trained only 6 epochs, and it looks like my model has yet to converge. With more time and compute, I think training on more epochs will yield better results.

Below are some results that we make on the test set. you can see how we scale the facial key points to match with the face on the overall image, although we predict based on the cropped version of the image around the face.

As of writing, I scored 72th position, with a score of 12.95. This is submitted under my name Zachary Wu. Not bad given that I trained only 6 epochs, and it looks like my model has yet to converge. With more time and compute, I think training on more epochs will yield better results.

Below are some results that we make on the test set. you can see how we scale the facial key points to match with the face on the overall image, although we predict based on the cropped version of the image around the face.

## My own images

Now let's see how our model does for images of our own choosing. For this part of the project, I will use a picture of my friend John, Gemma Chan, and Christopher Nolan. Let's see how it does!

Uh oh! It looks like our model isn't able to do a good job of making predictions right on these images. It looks like our model mostly predicts an average face shape for the most part. Perhaps this could stem from not using a dataloader with the exact same steps, resulting in a different intput. I'm not sure what the error is, and further investigation is necessary.

however, on the test set, the results look quite promising, and this can save us quite a bit of hassel moving forward for projects like the face morph one.

however, on the test set, the results look quite promising, and this can save us quite a bit of hassel moving forward for projects like the face morph one.