CS194-26: Intro to Computer Vision and Computational Photography, Fall 2021

Project 5: Facial Keypoint Detection with Neural Networks

Angela Chen



Part 1: Nose Tipe Detection

I used images from the IMM Face Database. After converting images to grayscale and normalizing float values, I resized the images to be 80x60. I used a batch size of 1.

Here are some sampled images from my dataloader with green marked as the ground-truth nose keypoint:



Here's the architecture of my CNN for detecting nose keypoints:

I used MSE loss, Adam optimizer, a learning rate of 1e-3, and trained for 25 epochs. Here are the training and validation losses:

Green is ground-truth. Red is my prediction.

Here's some images where my network correctly detected the nose:


Here's some images where my network incorrectly detected the nose:


My network seems to fail to detect noses when the person's head is turned towards their left. This is probably due to not learning the head turns properly during training and thus overfitting, possibly due to a small amount of data.

Part 2: Full Facial Keypoints Detection

For this part, I resized images to 160x120. I used a batch size of 1. I randomly changed brightness and saturation. I randomly rotated the image by a degree between -15 and 15 degrees. I randomly shifted the image by pixels between -10 and 10 pixels. I only augmented images in the training set and didn't do any data augmentations for the validation set.

Here are some sampled images from my dataloader with green marked as the ground-truth keypoints:



Here's the architecture of my CNN for detecting face keypoints:

I used MSE loss, Adam optimizer, a learning rate of 1e-3, and trained for 25 epochs. Here are the training and validation losses:

Green is ground-truth. Red is my prediction.

Here's some images where my network correctly detected facial keypoints (for the most part):


Here's some images where my network incorrectly detected facial keypoints (very far off):


I personally was having a lot of trouble because my validation loss kept on seeming to be underfitting against my training loss. I think I could have maybe trained for more epochs or more drastically switched up my architecture as adding more convolutional layers didn't really seem to help me for the part.

Here are the learned filters for the first convolutional layer:

Part 3: Train With Larger Dataset

For this part, I trained images in the ibug face in the wild dataset . I used the given bounding box to crop each image and resize the crop to 224x224. I split the dataset into 80% for training and 20% for validation. For the train dataloader, I used a batch size of 64 and did the same data augmentations as in part 2. For the validation dataloader, I used a batch size of 256.

Here are some sampled images from my dataloader with green marked as the ground-truth keypoints:



For training, I used a ResNet18 model. I changed the first layer to be Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) to accept the 1 input channel grayscale images. I changed the last layer to be Linear(in_features=512, out_features=136, bias=True) so the output channel number would be the (x, y) coordinates of the 68 landmarks for each face. I used MSE loss, Adam optimizer, a learning rate of 1e-3, and a weight decay of 1e-6.

I trained for 5 epochs on Kaggle. Here are the training and validation losses for the 5 epochs:

My mean absolute error for the test set on Kaggle using 5 epochs of training was 11.99853.

Here's some images with the keypoints prediction in the testing set. This time, green represents the prediction.



Here's some images of me with my model run on them. Green represents prediction again.



I didn't do any bounding box cropping so I didn't really expect any of these to turn out well, especially the one with me holding a rabbit since my face is at the top of the image. Though the prediction for the fourth image of my face turned out alright. I trained with more epochs, but it didn't seem to actually improve my results by much. To get better results, I could experiment more with different batch sizes, learning rates, and different architectures than just ResNet18.

Here's some more predictions on images I found on the internet: