CS 194-26 Project 5: Facial Keypoint Detection with Neural Networks

Brian Zhu, brian_zhu@berkeley.edu

Part 1: Nose-Tip Detection

Ground Truth Keypoints

Samples from training set:

Samples from validation set:

Convnet Architecture

ConvNetPart1(
  (conv_layers): Sequential(
    (0): Conv2d(1, 16, kernel_size=(7, 7), stride=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=same)
    (4): ReLU(inplace=True)
    (5): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=same)
    (6): ReLU(inplace=True)
    (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (flatten)
  (fc_layers): Sequential(
    (0): Linear(in_features=1728, out_features=512, bias=True)
    (1): ReLU(inplace=True)
    (2): Linear(in_features=512, out_features=2, bias=True)
  )
)

Training and Results

Training parameters:

MSE Loss:

Validation set predictions:

Legend:

We see that the model performs best when the face has a closed mouth and short hair:

But when there is something different (e.g. open mouth, unique hairstyle), the model has a hard time finding the nose tip:

The shape of the mouth and hairstyle are details that are not relevant to the model, yet the model picks up on these signals and it messes up the prediction. Combined with the MSE graph showing significantly lower training MSE than validation MSE, the model is overfitting to the training set and is relying on the hair, mouth, and possibly other unnecessary details to predict the position of the nose tip.

Hyperparameter Sweep

Sweeping across:

Constant parameters:

All runs together:

Holding learning rate constant at 3e-4:

Holding batch size constant at 8:

Trends:

Part 2: Full Facial Keypoints Detection

Data Augmentation

Parameters:

Samples with ground truth keypoints:

Convnet Architecture

ConvNetPart2(
  (conv_layers): Sequential(
    (0): Conv2d(1, 64, kernel_size=(11, 11), stride=(4, 4))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 128, kernel_size=(5, 5), stride=(1, 1), padding=same)
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=same)
    (7): ReLU(inplace=True)
    (8): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=same)
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=same)
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (flatten)
  (fc_layers): Sequential(
    (0): Linear(in_features=8960, out_features=3072, bias=True)
    (1): ReLU(inplace=True)
    (2): Linear(in_features=3072, out_features=116, bias=True)
  )
)

Training and Results

Parameters:

MSE Loss:

Validation set predictions:

Just like in Part 1, the model is overfitting to the training set and has a hard time generalizing to unseen details, such as hairstyle, smiling/open mouth, hands in the image.

Learned filters from the first convolution layer:

Part 3: Train With Larger Dataset

Convnet Architecture:

Used pretrained Resnet18 and modified for grayscale images

Training and Results

Parameters:

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
normalize = transforms.Normalize(mean=[(0.485 + 0.456 + 0.406)/3],
                                 std=[(0.229 + 0.224 + 0.225)/3])

MSE Loss:

MAE on test set:

Validation set predictions:

Test set predictions: