CS 194-26: Computational Photography, Fall 2020

Project 4

Derek Phan

Report

Part 1: Nose Tip detection

This part offers an introduction to CNNs by detecting the nosepoint of a facial image. This uses CNNs in order to train a neural network model in order to output a nosepoint. First, I needed to create a DataLoader in order to easily pass data into the network. The primary challenge was processing the data in the dataloader. Below, we can see some samples from our dataloader, the training and validation losses, and the 2 good outputs of the neural network and two bad outputs.



DataLoader sample 1
DataLoader sample 2
Training and validation losses over epochs
One good prediction
Another good prediction
One bad prediction
Another bad prediction

The predictions for the side facing images seemed to fail more. For a majority of the images, it seems like the dataset is more biased to front facing images, so it makes sense that the side facing images fail more often. In addition, it may be more biased towards male faces, since the female faces shown above fail.

Part 2: Full Facial Keypoints Detection

This part expands on part 2, but instead predicts entire face landmarks rather than just a single nose point. This folowed the same process as the previous part, where a DataLoader needs to be created to pass data into a CNN. However, the main difference in this is that we augment the data to prevent overfitting. In this step, we use a random rotation. Below you can see some samples from the DataLoader with augmentation, training and validation losses across epochs, two good predictions as well as 2 bad predictions, and a visualization of the filters from the trained network.


DataLoader sample 1
DataLoader sample 2
Neural Network Architecture

The neural network uses 6 convolution layers, with 30 channels and a kernel size of 5x5. In addition, we use 3 pooling layers of size 2x2, which are alternated between the convolution layers. We also use 2 fully connected layers, with an output of 116, since we have 58 landmark points, which are comprised of x and y. We use 25 epochs to train, with a learning rate of .0001 and the Adam optimizer.

Training and validation losses over epochs
Good face prediction 1
Good face prediction 2
Bad face prediction 1
Bad face prediction 2
Convolution layer 1 visualization
Convolution layer 2 visualization

The images here fail perhaps due to the shifting of the points. For both the images, it looks like there is a significant shift in the points when the subject's head is turned. The overall shape of the predictions seem to be close, but the positioning seems to be off. This is perhaps because the training set did not include shifting, and as a result was more biased towards data that was less shifted.

Part 3: Train With Larger Dataset

In this section, we take what we did for part 2 and expand to train on a much larger dataset of images. The primary challenge was the preprocessing, since we are using a prebuilt model for our CNN. We first crop the image to the bounded box, while also updating the landmarks and converting them to be relative to make things easier. We then apply data augmentation such as a random rotation and a random color jitter. After, we pass the data into the CNN to train the model. The best model scored a 27.87297 MAE on Kaggle.


For training, we use resnet18, with an input size of (1, 64) for our first convolutional layer. We use 10 epochs to train, with a learning rate of .0001, and MSE Loss function. We train in batches of 60 and validate in batches of 10 (these numbers were pretty arbitrary and only really affect loading into memory). In addition, we use 600 images for validation and the remaining 6066 images for training.

Training and validation losses over epochs
Test good prediction 1
Test good prediction 2
Test bad prediction 1
Test bad prediction 2
Personal image 1
Personal image 2
Personal image 3

The first image is a decent result, with some issues in terms of face shape. The second image is a bad result, very poorly predicted. This is likely due to the slant in the face, which is probably more of a slant than what we accounted for in our random rotation. his makes sense, since our model on the test images tended to fail on the slanted faces as seen from above. The final image is probably the best of the three results, and has a good approximation of where the facial features are.