Project 5. Facial Keypoint Detection with Neural Networks

Michael Wan, SID: 3034012128

Part 1. Nose Tip Detection

In this portion, I wrote a lot of code that I ended up using for the rest of the project. Specifically, I wrote out a dataset that reads the ASF files to collect the datapoints (with an option to either only keep the nose as the target or to keep all the facial points), a RegressionCNN module, and a general training loop that could be applied to different models / different dataloaders.

As a sanity check, I made sure the dataloader was correct in storing the datapoints.

Sampled image and respective nose key points.

I then ran two models with learning rates of $1e-3$ and $5e-4$, respectively. I also changed the hidden Fully Connected layer sizes, with the first model having a size of 256 while the second model has a size of 64. The training graphs are detailed below, respectively.

    
Training graphs of NoseCNN1 and NoseCNN2 respectively.

Predicted nose points (red) versus ground truth (blue). Some of the predictions are quite poor possibly due to lack of training data and wide variance in facial configurations.


Part 2. Full Facial Keypoints Detection

In this part, I defined my own transforms to augment the data, and reused a lot of the methods in Part 1. to train the model. For instance, I used the same dataset class previously defined, as well as the same training loop method.

Sampled image and respective full facial keypoints. Notice that some of the images are rotated (from data augmentation).

The model I chose has the following architecture. I chose this architecture because it is relatively fast to train (relatively few weights for a 6-layer regression CNN).

         
Model architecture and training graph, respectively. I used a learning rate of $2e-4$ and trained for 10 epochs. I performed a hyperparameter search for the optimal learning rate, and $2e-4$ worked the best for me.

Visualization of first convolution layer in FullCNN.

Model predictions (blue) versus ground truth (red). Possible reasons for failures include the need for even more training data augmentation, as well as a wider range of facial configurations to train off of.


Part 3. Train with Larger Dataset

My MSE error is 9.07269. For my model, I used a pretrined Resnet18 with a modified initial convolution layer and final fully connected layer.

    
Modified Resnet18 architecture.

I trained the Resnet18 on an augmented dataset consisting of 33330 points for 30 epochs with a learning rate of $2e-4$. I had the following training graph and validation predictions.

Training graph of custom Resnet.

Validation predictions. Predicted points are in red, ground truth points are in blue.

Test predictions.