CS194-26 Project 5: Facial Keypoint Detection with Neural Networks

Part 1.1: Nose Tip Detection

For part 1, we used the IMM Face Database for training an initial model for nose tip detection. The IMM Face Database was comprised of 40 people each having 6 images from different points of view, giving us a total of 240 images. I started with initializing a custom Dataset and DataLoader. For the custom Dataset, I implemented an abstract class representing the IMM Face Database, which overrides the len and getitem methods to return the size of dataset and allow for sampling by index. I was then able to rely on the starter code provided to collect the nose keypoints for each of the 32 training people (192 images total).

Example of an image and all its labels and ground-truth nose keypoint:

As you can see the nose point is highlighted in orange and are the points we will predicting for each image in this section.

Once, I had my images and keypoints I passed them to the Dataset, which created our training and test datasets. For this I was able to follow the structure provided in the discussion section and tutorial. I was able to then visualize the images with their ground-truth keypoints, as shown below.

Ground-Truth Nose Keypoint:

##Part 1.2: CNN and Training Model Now that we had our data we could implement our convolutional neural network. I began with reviewing discussion section and the tutorial provided. I was able to follow a similar structure and create a class with 3 layers, with kernel size 5, and with various channel sizes greater than or equal to 12. The full network architecture can be seen below.

For implementing my training function, I referenced pytorch's tutorial (https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html). It guided me in understanding how to create my training function as well as visualize the images with the predictions from the model.

Part 1.3: Loss Function and Optimizer

I used Adam as my optimizer, with a learning rate of 1e-3, and mean squared error for my loss function. I trained with a batch size of 4 for 25 epochs. I was able to achieve a test-loss of around 0.0055.

Epoch Loss:

Train and Test MSE Loss Graph

Good Examples:

Red points are the actual labels, Blue points are the predicted points

Bad Examples:

Red points are the actual labels, Blue points are the predicted points

I believe some of the images may be mispredicting as it sees the nostril as the nose keypoint and thus maps the predicted point to another dark spot on the persons face. For instance, right under the ear or at the dark circle under an eye. I'm hoping I can imporve this with hypertuning and further developing my model in the next sections of the project.

Part 1.4: Hypertuning Parameters

In my first attempt to hypertune my parameters, I began with adding an additional layer of convolution as well as adjusting kernerl sizes and channel sizes to see how it would affect my network. You can see the difference in previous architecture and hypertuned architectures below:

I was able to see a bit of a drop in the testing loss for each epoch as well as the average testing loss drop slightly, but was a very slight difference.

The graph on the left is the original training and test MSE loss graph, while the graph on the right is for the new architecture with an additional convolutional layer and changed kernel and channel sizes.

As you can see there was a slight improvement in the training and testing loss.

Part 2: Full Facial Keypoint Detection

Similarly as we had done in the previous part, I began with loading implementing a dataset for all the facial keypoints and images. I was then able to create my dataloaders from these train and test datasets. I sampled some images to ensure that all the facial keypoints were being detected, as you can see below I was able to collect all 58 facial keypoints.

Nose detection is great, but it would be even better if we can try to detect all the keypoints and labels! This is exactly what we do in part 2, to try and detect all the label points on the face. I began with creating my dataset and dataloader. For the dataset this time we had to do some data augmentation via transformations. The reason we do this is to attempt to avoid any overfitting that can occur. The data augmentation technique I used was ColorJitter where I was able to adjust the brightness, contrast, hue, and saturation of the image. The ground-truth points can be seen labeled in red in the images below in the examples section the blue points are the predicted.

After creating my dataset and dataloader and sampling some images, I was able to move on to building my CNN. I used 6 convolutional layers with kernel size 5 and varying in and out channels. The architecture can be seen below. I was then able to use this with the same trainning function from the previous part to see how well my neural net would preform. Below, you can find information on the architecture, each epoch loss, and the train-validation MSE Loss graph for each epoch.

Good Examples:

Red points are the actual labels, Blue points are the predicted points

Bad Examples:

Red points are the actual labels, Blue points are the predicted points

Below we can see the learned filers

Part 3: Full Facial Keypoint Detection on Larger Dataset

For part 3, I try to work with a larger dataset of images. As before I began with implementing my dataset and dataloader for this section. In the dataset, the face may occupy only a very small fraction of the entire image. Consequently, during training, we need to crop the image to contain only the face portion. I then resize the cropped image into (224,224) and update the keypoints coordinate as well. When creating my dataset I used the same transformations that I used in part 2. Below are some images from the dataloader with its ground-truth labels/points.

This is one of my first projects working with Pytorch and nueral networks. The mean loss that I had for part 3 was 2.2903.

The architecture for my neural network

My own collection of images, the first two went well while the last one it failed. I believe because even after cropping to the face their is other noise in the image that causes it to mess up a bit and go over the other peoples body parts in the third image. My part 3 preformed quite poorly so you can see what it was like when I ran it on my model 2.