CS-194 Project 4¶

By: Calvin Chen

<torch._C.Generator at 0x114dd3cf0>

Part 1: Nose Tip Detection¶

For this section, I used the IMM Face database to conduct nose tip detection for the different faces in the dataset.

Dataloader¶

For the Dataloader in Part 1, I initially converted the images to greyscale, normalized the pixel values around 0, and then resized the images downwards to 80 x 60.

Sampled images from dataloader with ground-truth keypoints¶

Here's some images from the dataloader with their corresponding ground-truth keypoints on their noses.

CNN¶

Here, I defined the CNN architecture used for Part 1 as the following:

Conv2D layer of a 3x12 filter with kernel size 5
Conv2D layer of a 12x20 filter with kernel size 5
Conv2D layer of a 20x28 filter with kernel size 5
Fully connected layer with ReLU non-linearity and maxpooling
Second fully connected layer with output 2 (for x and y coordinates)

I trained the model over 20 epochs using an Adam optimizer with a learning rate of 0.001.

Losses¶

Below are the training and validation losses plotted over the 20 epochs used to train the model.

Plotting results¶

Here's a few examples of where the CNN worked well and a few where it didn't. The main difference between the images and the accuracy of the CNN seemed to be pretty related to the angle/direction the face was oriented in (the model was more accurate in labeling faces facing straight).

Good images¶

Bad images¶

Part 2: Full Facial Keypoints Detection¶

For this section, I went beyond just detecting the nose point, and now moved onto detecting all the keypoints in the face images. This meant the dataloader returning 58 points instead of just 1 for all the faical points.

DataLoader¶

For this section's dataloader, I conducted data augmentation to increase the accuracy and robustness of the model itself. This entailed:

Rescaling the image downwards
Translating it randomly
Changing its color randomly
Rotating it randomly

Sampled images from dataloader visualized with ground-truth predictions¶

CNN¶

For this section, I create my CNN architecture with the following:

3x12 Conv2D layer with 7x7 filter, and ReLU + max pool applied
12x20 Conv2D layer with 5x5 filter, and ReLU + max pool applied
20x28 Conv2D layer with 3x3 filter, and ReLU + max pool applied
28x30 Conv2D layer with 7x7 filter, and ReLU + max pool applied
30x32 Conv2D layer with 5x5 filter, and ReLU + max pool applied
Fully connected layer with output size 400
Fully connected layer with output size 116 (to correspond to the 58 facial points)

For this model, I used an Adam optimizer with a learning rate 0.01 and trained over 20 epochs.

Plotting the training and validation accuracies over time for the Part 2 model¶

Facial images inputted into trained network¶

For this part, I inputted differentt original images into the network to see what output prediction points I would find. From preliminary findings, it seems that faces that were oriented more towards the camera (not skewed or facing away), did better, since a lot of the training data was oriented around an average straight face as well. Below are depictions of some of the images that did well and some that didn't.

Imgaes that do well¶

Images that don't do as well¶

Visualize the learned filters¶

After training the model, I took a look at the filters from the model's layers themselves. Here's what some of the filters from the first layer look like:

Part 3: Training with Larger Dataset¶

<torch._C.Generator at 0x7fa84731b618>

Creating Dataloader¶

For this section, I constructed a dataloader similar to the one used in Part 2, tthe only main difference being the size of the output shape (224x224 instead of 120x160). Additionally, I used different boundary boxes labeled on the images to crop the faces out of the photos themselves.

Constructing CNN¶

For this section, I constructed a CNN using ResNet18's CNN architecture, with only minor tweaks to taking in only one color channel and outputting 136 rather than 1000 (for the 68 keypoints over the 1000 classes). Additionally, I trained this model using an Adam optimizer with a learning rate of 0.001 over 10 epochs.

Training model¶

Plotting losses¶

Below, I plotted the training loss and the validation loss of the model over the 10 epochs trained on.

Applying model to test set¶

Kaggle¶

On Kaggle, I received a MAE of 15.78838.

Visualizing images in test set¶

Additionally, I visualized some of the different images from the test set and plotted the keypoints predicted onto them.

Testing on images from "my collection"¶

Below you'll find different inputted images from the web into the model. It seems that the model does a lot better with images where the face is directly facing the camera as compared to other images with slightly angeled faces.