Project 5 Facial Keypoint Detection with Neural Networks

CS194-26: Image Manipulation and Computational Photography

Author: Sunny Shen

Part 1: Nose Tip Detection

In the first part of the project, I used the IMM Face Database, which contains 240 facial images of 40 persons and each person has 6 facial images in different viewpoints to train a NN model to detect the nose tip of people.

I used images of the first 32 persons (192 images) as training set and the remaining 8 persons (48 images) as the validation set.

Here are some face images with the ground truth of nose tips:

NN Deisgn:

Net(
  (conv1): Conv2d(1, 12, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(12, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv3): Conv2d(24, 36, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (fc1): Linear(in_features=6912, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=2, bias=True)
)

Each convolutional layer is followed by a ReLU followed by a maxpool. There's a ReLu after the first FC but not the second.
Loss Function: Mean Squared Error (MSE)
Optimizer: Adam

Hyperparameter tunning 1: Learning Rate

I trained my model for 25 epochs with learning rates of 0.001, 0.005, and 0.01. Here are my results of train and validation MSE loss during the training process.

The training set is pretty small, and with a larger LR we see the results fluctuate a lot. Therefore I think LR of 0.001 and epoch of 25 looks the best (the validation loss wasn't increasing so I wouldn't worry too much about overfitting at 25 epoch)

Hyperparameter tunning 2: Kernel Size

Net_k5(
  (conv1): Conv2d(1, 12, kernel_size=(5, 5), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(12, 24, kernel_size=(5, 5), stride=(1, 1), padding=(1, 1))
  (conv3): Conv2d(24, 36, kernel_size=(5, 5), stride=(1, 1), padding=(1, 1))
  (fc1): Linear(in_features=5040, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=2, bias=True)
)

I also played around with kernel size and experimented with kernel size of 5. Here are my results of train and validation MSE loss during the training process.

Hyperparameter tunning in summary

Here's a summary of the results of train and validation MSE loss with different LR and kernel size

Seems like ksize = 3 & lr = 0.001 are the best choice, as that results in low loss & low fluctuations across epochs

Here are some results of predictions:

Predictions that seems to work pretty well

Predictions that didn't really work

I think the model fails on these images because they had inconsistent/bad lighting. In the first two images the faces are half lighted and half in the shade, and in the 3rd image the face is completely in the shade (darker than the rest of the dataset). These images aren't the "traditional" well-lit photos that the model sees the most.

Part 2: Full Facial Keypoints Detection

Now we can move forward to predict all the 58 facial keypoints on facial images.

Data Augmentation: Since it's a small dataset, I implemented data augmentation during training to avoid overfitting. I randomly rotated images with [-15, 15] degrees, applied random horizontal flip with a probability of 0.3, and randomly jitter the images.

Here are some images with their ground truth facial keypoints

NN Deisgn:

Net(
  (conv1): Conv2d(1, 20, kernel_size=(5, 5), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(20, 40, kernel_size=(5, 5), stride=(1, 1), padding=(1, 1))
  (conv3): Conv2d(40, 60, kernel_size=(5, 5), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(60, 80, kernel_size=(5, 5), stride=(1, 1), padding=(1, 1))
  (conv5): Conv2d(80, 100, kernel_size=(5, 5), stride=(1, 1), padding=(1, 1))
  (fc1): Linear(in_features=1800, out_features=1500, bias=True)
  (fc2): Linear(in_features=1500, out_features=116, bias=True)
)

Each convolutional layer is followed by a ReLU followed by a maxpool. The only exception is Conv5, which was followed by a ReLu but not maxpool.
Learning Rate: 0.001
Epoch: 30
Loss: MSE
Optimizer: Adam

Here's a summary of the results of train and validation MSE Loss

Predictions that seems to work pretty well

Predictions that didn't really work

I noticed that if their head is turned away the model is more likely to predict the keypoints wrong. And similarly to Part I, if there's bad lighting (the face is half bright half dark, or the face is completely not well-lit) the model made worse predictions too.

Learned Filters

Conv 1 Filters:

Conv 1 Filters.jpg

Conv 2 Filters:

Conv 2 Filters.jpg

Conv 3 Filters:

Conv 3 Filters.jpg

Conv 4 Filters:

Conv 4 Filters.jpg

Conv 5 Filters:

Conv 5 Filters.jpg

Part 3: Train With Larger Dataset

In this part, we use a larger dataset with 6666 images of different sizes & 68 facial keypoints

I used 85% of it as training set and 15% of the images as validation set. I used the same data augmentation techniques from part two.

part3_origin

NN Deisgn:

I used the pretrained ResNet34 Model; I modified it by changing the in_channel to 1 for the first layer as I am working with grayscale images, and I changed the out_channel of the last layer to be 136 for the 68 landmarks.

ResNet(
  (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (2): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer2): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (2): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (3): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer3): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (2): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (3): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (4): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (5): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer4): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (2): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=512, out_features=136, bias=True)
)

Epoch: 25
Batch size: 16
Learning Rate: I had a decaying LR because I realized my error tends to fluctuate a lot in later epochs if I don't decrease LR (although according to some google research it's not that necessary to decay LR for Adam optimizer). I start with LR = 0.0015, and decay by 80% every 2 epochs.
Loss: Mean Absolute Error (MAE)
Optimizer: Adam

Results

My MAE (Mean Absolute Error) on Kaggle is 12.23218.

resnet34_2

Predictions on test set

Predictions on my own images

my_imge0

My model did pretty well on the images of Ron, Harry, and Hermione. The performance on Lily Collin's face was acceptable but it didn't get the boundary of her face completely correct.