CS-194 Project 4

By: Calvin Chen

<torch._C.Generator at 0x114dd3cf0>

Part 1: Nose Tip Detection

For this section, I used the IMM Face database to conduct nose tip detection for the different faces in the dataset.

Dataloader

For the Dataloader in Part 1, I initially converted the images to greyscale, normalized the pixel values around 0, and then resized the images downwards to 80 x 60.

Sampled images from dataloader with ground-truth keypoints

Here's some images from the dataloader with their corresponding ground-truth keypoints on their noses.

CNN

Here, I defined the CNN architecture used for Part 1 as the following:

  1. Conv2D layer of a 3x12 filter with kernel size 5
  2. Conv2D layer of a 12x20 filter with kernel size 5
  3. Conv2D layer of a 20x28 filter with kernel size 5
  4. Fully connected layer with ReLU non-linearity and maxpooling
  5. Second fully connected layer with output 2 (for x and y coordinates)

I trained the model over 20 epochs using an Adam optimizer with a learning rate of 0.001.

Losses

Below are the training and validation losses plotted over the 20 epochs used to train the model.

Plotting results

Here's a few examples of where the CNN worked well and a few where it didn't. The main difference between the images and the accuracy of the CNN seemed to be pretty related to the angle/direction the face was oriented in (the model was more accurate in labeling faces facing straight).

Good images

Bad images

Part 2: Full Facial Keypoints Detection

For this section, I went beyond just detecting the nose point, and now moved onto detecting all the keypoints in the face images. This meant the dataloader returning 58 points instead of just 1 for all the faical points.

DataLoader

For this section's dataloader, I conducted data augmentation to increase the accuracy and robustness of the model itself. This entailed:

  • Rescaling the image downwards
  • Translating it randomly
  • Changing its color randomly
  • Rotating it randomly

Sampled images from dataloader visualized with ground-truth predictions

CNN

For this section, I create my CNN architecture with the following:

  1. 3x12 Conv2D layer with 7x7 filter, and ReLU + max pool applied
  2. 12x20 Conv2D layer with 5x5 filter, and ReLU + max pool applied
  3. 20x28 Conv2D layer with 3x3 filter, and ReLU + max pool applied
  4. 28x30 Conv2D layer with 7x7 filter, and ReLU + max pool applied
  5. 30x32 Conv2D layer with 5x5 filter, and ReLU + max pool applied
  6. Fully connected layer with output size 400
  7. Fully connected layer with output size 116 (to correspond to the 58 facial points)

For this model, I used an Adam optimizer with a learning rate 0.01 and trained over 20 epochs.

Plotting the training and validation accuracies over time for the Part 2 model

Facial images inputted into trained network

For this part, I inputted differentt original images into the network to see what output prediction points I would find. From preliminary findings, it seems that faces that were oriented more towards the camera (not skewed or facing away), did better, since a lot of the training data was oriented around an average straight face as well. Below are depictions of some of the images that did well and some that didn't.

Imgaes that do well

Images that don't do as well

Visualize the learned filters

After training the model, I took a look at the filters from the model's layers themselves. Here's what some of the filters from the first layer look like:

Part 3: Training with Larger Dataset

<torch._C.Generator at 0x7fa84731b618>

Creating Dataloader

For this section, I constructed a dataloader similar to the one used in Part 2, tthe only main difference being the size of the output shape (224x224 instead of 120x160). Additionally, I used different boundary boxes labeled on the images to crop the faces out of the photos themselves.

Constructing CNN

For this section, I constructed a CNN using ResNet18's CNN architecture, with only minor tweaks to taking in only one color channel and outputting 136 rather than 1000 (for the 68 keypoints over the 1000 classes). Additionally, I trained this model using an Adam optimizer with a learning rate of 0.001 over 10 epochs.

Training model

Plotting losses

Below, I plotted the training loss and the validation loss of the model over the 10 epochs trained on.

Applying model to test set

Kaggle

On Kaggle, I received a MAE of 15.78838.

Visualizing images in test set

Additionally, I visualized some of the different images from the test set and plotted the keypoints predicted onto them.

Testing on images from "my collection"

Below you'll find different inputted images from the web into the model. It seems that the model does a lot better with images where the face is directly facing the camera as compared to other images with slightly angeled faces.