CS 194-26 Project 5: Facial Keypoint Detection with Neural Networks

In this project we used neural networks to automatically detect keypoints on faces, first detecting the nose then moving to the full face and then finally training with a large dataset.

Part 1: Nose Tip Detection

In this part I used 32 people 6 images each to train, with 8 people 6 images each as validation. The goal in this part is to detect nose tip locations.

Dataloader Samples:

im1 | im2 | im3 | im4 | -- | -- | -- | -- |

CNN:

Train Loss: 0.005753048229962587 Val Loss: 0.014989902265369892 LR: 1e-3, 3 convolution layers, 25 epochs

1e3_3conv

Train Loss: 0.005749324802309275 Val Loss: 0.01004868745803833 LR: 1e-3, 4 convolution layers, 25 epochs 1e3_4conv

Train Loss: 0.005196912679821253 Val Loss: 0.005820723250508308 LR: 1e-4, 4 convolution layers, 50 epochs 1e4_4conv

The best model was one with 4 convolutions, a learning rate of 1e-4, and 50 epochs: model

Using this model, here are two good and two bad predictions:
Successes:
good1 good2

Failures:
bad1 bad2

These two failure cases are likely due to head rotation as well as the second image's shadows underneath the eyes that may look like that of the shadow underneath the nose.

Part 2: Full Facial Keypoints Detection

We are now doing prediction of all 58 keypoints. I used a larger image input size of 240x180.

Dataloader

For the dataloader, I used a random brightness and contrast offset, in addition to a random rotation between -10 and 10 degrees.

dataloader1 dataloader2 dataloader3

CNN

I used 5 convolution layers and 2 fully connected layers, and a learning rate of 1e-3. model

loss

Train Loss: 0.003081735922023654 Val Loss: 0.004428303800523281 LR: 1e-3, 5 convolution layers, 25 epochs

Using this model, here are two good and two bad predictions, where orange is my prediction and blue is the ground truth.
Successes:
good1 good2
Failures:
bad1 bad2

Failure cases here are likely due to head rotation, where the model appears to fit towards the mean face instead of matching the photo's features.

Learned convolutional filters:
Layer 1:
layer1
Layer 2:
layer2
Layer 3:
layer3
Layer 4:
layer4
Layer 5:
layer5

Part 3: Train with a Larger Dataset

Dataloader: Same as part 2 with brightness and contrast, and random rotation. dataloader1 dataloader2 dataloader3

Kaggle Submission: Username JerryZ, Public Score 24.60084

Architecture: I used ResNet18, changing the input channel to 1 and the output to 68*2 = 136.
model

Train Loss: 0.0009061 Val Loss: 0.02571 LR: 1e-3, 10 epochs
loss

Results:

1 2 3 4

Own photos:
1 2 3

The model does decently on all the photos, but performs worst on the photo of Bezos likely due to the thinness of the head not particularly matching other photos in the dataset.