CS 194-26 Project 5

Dylan Tran

Part 1: Nosetip Detection

Sampled images w/ ground truth keypoints

Pixel values have been normalized to be in [-0.5, 0.5] and image has been resized to be 60 x 80. The nose keypoint coordinates have also been normalized to be in [0, 1]

Train and validation MSE loss

We train for 15 epochs with lr 1e-3 and batch size 4. Our network consists of 3 convolutional layers (each followed by a relu and maxpool) as well as 2 fully connected layers.

Network predictions

Generally it works well for images that are centered with the faces facing forward. It tends to fail on off centered images as well as faces at different angles most likely due to the limited amount of data we have with off centered and rotated faces. It also seems to fail on some images that are brighter than others (more contrast).

Part 2: Facial Keypoint Detection

Sampled images w/ ground truth keypoints

Pixel values have been normalized to be in [-0.5, 0.5] and image has been resized to be 240 x 180. The facial keypoint coordinates have also been normalized to be in [0, 1]. Below are some ground truth labels for a couple images.

Model Architecture

We have larger network consisting of 5 convolutional layers and 3 fully connected layers. The detailed breakdown is shown below.
FacialPointsConvNet(
(conv1): Conv2d(1, 12, kernel_size=(7, 7), stride=(1, 1))
(conv2): Conv2d(12, 15, kernel_size=(5, 5), stride=(1, 1))
(conv3): Conv2d(15, 24, kernel_size=(3, 3), stride=(1, 1))
(conv4): Conv2d(24, 20, kernel_size=(3, 3), stride=(1, 1))
(conv5): Conv2d(20, 15, kernel_size=(3, 3), stride=(1, 1))
(fc1): Linear(in_features=225, out_features=200, bias=True)
(fc2): Linear(in_features=200, out_features=200, bias=True)
(fc3): Linear(in_features=200, out_features=116, bias=True)
)
Hyperparameters:
lr: 1e-3
batch_size: 32
epochs:20

Train and validation MSE loss

Plot of training and validation loss shown below.

Network predictions

Generally it works well for images that are centered with the faces facing forward. It tends to fail on off centered images as well as faces at different angles most likely due to the limited amount of data we have with off centered and rotated faces. Although we do perform data augmentation, we still only have a very limited amount of faces. We also use the entire image which consists of a lot of background pixels which are used as features and may contribute to noise. Our kernels are also fairly small and may not capture larger features. Additionally, there are many points that are normalized so small shifts in points may not increase the loss by that much.

Visualizing Filters

Visualizing filters for each layer below

Part 3: Larger Dataset

Sampled images w/ ground truth keypoints

Below is an example of an image with ground truth keypoints, cropped using the bounding boxed, resized, and augmented

Mean absolute error on Kaggle: 186.52353

Model Architecture

For this larger dataset, we used the ResNet18 Architecture which is described below.


test

It consists of 5 convolutional layers and a full connected layer. It was pretrained on color images but this wasn't applicable since we train on grayscale images. We modify the first layer to take 1 input channel and the final full connected layer to output 68 * 2 points.

Hyperparameters:
lr: 1e-3
batch_size: 64
epochs:10

Train and validation MSE loss

Plot of training and validation loss shown below.

Network predictions on Test set

Network predictions on other images