CS 294: Project #4: Facial Keypoint Detection with Neural Networks ¶

Emaad Khwaja¶

Part 1: Nose Tip Detection¶

Dataloader

The Dataloader was highly inspired from the Pytorch tutorial page. The images are converted to grayscale, resized (with points adjusted accordingly) to 80x60, and normalized such that the float range is from -.5 to .5.

The test set contains the all images corresponding to the first 32 people, and the test set contains the last 8. Images within these sets are shuffled.

Examples are shown below of faces run through the dataloader. The ground truth nose point is marked in green.

CNN

The CNN architecture, NoseNet, uses 4 convolutional layers, followed by 2 fully connected layers.

The first convolutional layer is followed by an Max pooling operation and a ReLU regularizer. The output of the final fully-connected layer is a [1,2] array.

The input of the layer is a single image, and the output is a 2-element tensor. This tensor is resized to a 2x1 array and compared against the ground-truth nose point.

3x3 convolutional filters were used with padding = 1. Pooling occureed of 2x2 sections.

The architecture is shown below:

NoseNet(
  (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (maxpool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (relu1): ReLU()
  (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (maxpool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (relu2): ReLU()
  (conv3): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (maxpool3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (relu3): ReLU()
  (conv4): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (maxpool4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (relu4): ReLU()
  (fc1): Linear(in_features=480, out_features=240, bias=True)
  (relu5): ReLU()
  (fc2): Linear(in_features=240, out_features=2, bias=True)
)

Loss Function and Optimizer

MSELoss was selected as the loss function, and Adam with a learning rate of 1e-3 was chosen as an optimizer. The training was performed over the course of 25 epochs.

Final Training Loss: 0.0006580417709640698
Final Validation Loss: 0.0017410991336825833

Ground truth labels are marked in Green and Predicted Points are marked in orange.

We can see this model performs very well. The examples on the right show that it struggles in dealing with the region between cheekbones and dimples. This could be compositionally similar to the area beneath the nose.

Alternative Model

I played around with the model and obtained slightly better performance from a few key modifications:

Learning Rate was tuned to 5e-4.
The first pooling operation was switched to an Average Pool rather than a Max Pool.
Dropout with a p=.2 was added before the fully connected layer in order to reinforce generalizability of the model.
A sigmoid regularizer was added as the final step in the network to constrain the outputs between 0 and 1.

I believe the Average pooling operation assists in learning because it prserves spatial features of the image. Since there is a slight discrepancy between the actual nose tip and the annotated point (i.e. not along the edge), it's likely that this region would not be well represented by a Max Pooling operation.

Constraining the output between 0 and 1 allows for the network to converge faster by limiting the output to valid values.

NoseNetAlt(
  (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (avgpool): AvgPool2d(kernel_size=2, stride=2, padding=0)
  (relu1): ReLU()
  (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (maxpool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (relu2): ReLU()
  (conv3): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (maxpool3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (relu3): ReLU()
  (conv4): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (maxpool4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (relu4): ReLU()
  (conv2_drop): Dropout2d(p=0.2, inplace=False)
  (fc1): Linear(in_features=480, out_features=240, bias=True)
  (relu5): ReLU()
  (fc2): Linear(in_features=240, out_features=2, bias=True)
  (sigmoid): Sigmoid()
)

Final Training Loss: 0.0002499745857073186
Final Validation Loss: 0.001130979319986633

We can see that the training and validation losses are marginally better than the original architecture. It is interesting to note, however, that the size of the largest loss on the validation set are significantly lower than before, indicating more robustness to outliers.

CS 294: Project #4: Facial Keypoint Detection with Neural Networks ¶

Emaad Khwaja¶

Part 1: Nose Tip Detection¶

Part 2: Full Facial Keypoints Detection¶

Part 3: Train With Larger Dataset¶

Bells and Whistles¶