CS 294 026 Project 5 Facial Keypoint Detection with Neural Networks

Author: Tiancheng Sun

Part 1: Nose Tip Detection

In this part, I am applying a simple CNN to the IMM face database to detect nosetip. The implementation mainly contains three major components: Dataloader implementation, CNN construction, and CNN training.
Here are some example results of my data loader:

dx dy dy

Here is the first generation architecture of my NN:

Initially I trained this NN with learning rate of 1e-3 on Adam optimizer with epoch equal to 15, but the result is not that good, as you can see the validation loss quite jumps a lot.

dx

To optimize my NN, I updated into this second generation architecture (mainly modified the channel size of the Conv layers):

And in this interation, I also updated the learning rate to 1e-4 and added the number of epoch to 30, here is the loss function graph:

dx
As you can see, the validaton loss become a lot more stable this time. For the second generaton of NN, I generated the prediction for the images within the IMM face database, and here are two result that the system finished the job decently:
dx dx

Here are two result that the NN failed to generate a nice result:

dx dx
Given what I seen in the full prediction result, I think the NN is trained to looking for a dark spot near the center of the image, so it kind of become comfused while the real nose tip does not located at the center of image. -> That is, this is an unbalanced dataset issue.

Part 2: Full Facial Keypoints Detection

After finished the part 1, we kind of validated that the design and construction of our NN is good to go, therefore, in this part, I want to predict not only the nosetip, but also the other facial keypoints. Just like part 1, I constrcuted the dedicated dataset and dataloader to feed samples into NN. To conter the unbalanced dataset issue we mentationed before, I applied random brightness coontrast, random horizontal flip, random scale transformation, random rotation and random color jitter to the origianl dataset to prevent the NN model become overfit for this small IMM dataset.
dx dx dx dx
The annotation point is black, I forgot to adjust their color...

Since the job become a lot more complicated than nosetip prediction, I deepened the NN model on top of previous achitecture (added to 5 conv layer), here is the new one:

For this model, I picked MSELoss as evaluation method, and Adam optimizer with lr=0.0025. I trained the model for 100 epoches, during the optimization process I was mainly looking at the shapre of learning rate curve and compare it with the sample on 231n's website, and in the end I found 0,0025 might be the best one.
Here is the train loss and validation loss graph I got:

dx

Here are success case I picked from the prediction result:

dx dx
And here are two failure cases, for me it seems the network is still not deeep enough so it is a little bit underfit:
dx dx

And here are the learned filters from my NN:

dx dx dx
From left to right: conv_1, conv_2, and conv_3

Part 3: Train With Larger Dataset

After the warm up, now in part 3 we are kind starts the real work: with large dataset and deep NN

Kaggle

The team name I used to submit for Kaggle competation is "Tiancheng Sun", and my public score (MAE) is 9.12399. At first, I really met some big trouble at first -> no matter how do I adjust my NN model, the network always cannot give good prediction. Therefore, I decided to use the debug method suggested by Prof. Kanazawa -> Use only one image and check if the NN can remember it. And it turns out, the problem is located within my dataloader! The data augumentation I applied is way too extreame so the NN simply give up and decided to output the average result all on the cental line.

dx dx
Result generated with wrong data augumentation

But also because of this process, I implemented a lot of NN architecture into this project, which includes Resnet18, Resnet50, Resnext50, and ResUNet. Since it will take too much space if I report all of their achitecture at here, I only post the ResUnet's architecture here since it is the NN I chose to use at the end:

dx
Above is the default ResUnet architecture, I added three FC layer at the end of it to make it beable to output key point coordinates (I know it is kind of a waste of ResUnet's ability, but I've already ran out of time when I finally understand how to use 68 224*224 activation map)

Here is the three layers I added ontop: (fc1): Linear(in_features=50176, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=512, bias=True)
(fc3): Linear(in_features=512, out_features=136, bias=True)

Due to the size of ResUnet (mainly the FC layer I added), this monster is really hard to train, here is the reuslt I got (And the Utilization graph of Colab's P100 GPU, just for fun :P):

dx dx

Here are some image I picked from the test set as a demostration of the capbility of my NN:

dx dx dx dx

Here are some images I ran from my own collection. As you can see, the model is quite well on pic 1 and 3, so I am a little bit shocked when I saw image 2 (it is a fail case I picked out of lots of successful result though)

dx dx dx