Proj5 Facial Keypoint Detection with Neural Networks

Name: Tzu-Chuan Lin

Part 1: Nose Tip Detection

In this section, I trained the models using three different architectures and without any data augmentation.

 SimpleModelSimpleModel+one more conv layer
(SimpleModelDeeper)
SimpleModel with 5x5 filters
(SimpleModelLargeKernel)
 3 Conv(3x3) + 2 FC4 Conv(3x3) + 2 FC3 Conv(5x5) + 2FC

Because I only trained SimpleModel by 192 images (without any augmentation), the prediction seems more suspectible to the rotation of the face or change of expression.

Part 2: Full Facial Keypoints Detection

Sampled augmented images from the training data

Model details:

I tried out four different models.

Hyperparameters

  1. Baseline: 3x3 filters
  1. Baseline_5x5: 5x5 filters
  1. Baseline_7x7: 7x7 filters
  1. Baseline_9x9: 9x9 filters

Experiment results:

ModelBaselineBaseline_5x5Baseline_7x7Baseline_9x9
Training loss0.0000910.0000810.0001350.001008
Validation loss0.0003310.0003030.0002720.001702
Loss plot

Because Baseline_7x7 outperforms other three models, I choose to use Baseline_7x7.

Prediction Results:

I think the reason why the model fails on these cases is because the model seems to know what is the shape of an average face and tend to predict "average" of the keypoints in the dataset.

That's why for the first image it seems to predict a large face and for the second image, it seems to predict a smaller face.

Learned filters of Baseline_7x7:

Part 3: Train With Larger Dataset

Architecture of my models

I directly use pretrained ResNet(18 and 34) as the starting point and replace its last fc layer with two newly initialized fc layers.

(The same as here)

With lr=0.001.

Results

NOTE: the MAE in my report is computed as follow:

And the Kaggle's MAE is computed as this:

ModelResNet18 + 2FCResNet34 + 2FC
Train1.5489691.917886
Valid1.6311351.777731
Kaggle7.73143N/A
Graph
ModelResNet18 + 2FC
Train1.02
Valid1.11
Valid (Kaggle metric)5.86
Kaggle6.41695
Graph

Some sampled augmented images:

BeforeAfter

1

2

3

I noticed that the first one seems to perform the best. But the second and the third are not so good. I guess there might be several reasons:

  1. The second image is tilted and I didn't train the model with rotational augmentation
  2. The third image's seems to be filtered heavily by the photographer and the head is not exactly in the middle. So the model chose to predict the result 68 keypoints more close to the image's middle.

B&W

I use this repository as the backbone of my fully convolutional network for keypoint detection. (using U-Net + ResNet).

I place a gaussian for each keypoints, resulting a (68, H, W) heatmap for my model to learn.

Visualization of the guassian map for this image (only show 2 keypoints):

Here are some examples predicted from the dense prediction + argmax to get the peak:

And the loss graph:

Train loss: 0.027509 Valid loss: 0.088322

And I score 5.88478 on Kaggle.

Conclusions