CS 194 PROJECT 5

Overview

In this project, we trained Convolutional Neural Networks to detect facial keypoints.

Part 1: Nose Tip Detection

In this part, we used a simple convolutional network to detect where the nosetip is in a picture.

Result

1) Ground Truth Keypoints:

2) Train and Validation Loss

I played around with several different hyper parameters, and they are plotted below.

Version 1

for version 1, we used CNN structure: Net(

(conv1): Conv2d(1, 32, kernel_size=(7, 7), stride=(1, 1))

(conv2): Conv2d(32, 24, kernel_size=(5, 5), stride=(1, 1))

(conv3): Conv2d(24, 32, kernel_size=(3, 3), stride=(1, 1))

(conv4): Conv2d(32, 32, kernel_size=(5, 5), stride=(1, 1))

(fc1): Linear(in_features=64, out_features=1000, bias=True)

(fc2): Linear(in_features=1000, out_features=2, bias=True))

For hyperparameters, we used LR = 1e-3 and Epoch = 25, batchSize = 1.

Version 2

for version 2, we used the same CNN structure as version 1: Net(

(conv1): Conv2d(1, 32, kernel_size=(7, 7), stride=(1, 1))

(conv2): Conv2d(32, 24, kernel_size=(5, 5), stride=(1, 1))

(conv3): Conv2d(24, 32, kernel_size=(3, 3), stride=(1, 1))

(conv4): Conv2d(32, 32, kernel_size=(5, 5), stride=(1, 1))

(fc1): Linear(in_features=64, out_features=1000, bias=True)

(fc2): Linear(in_features=1000, out_features=2, bias=True))

For hyperparameters, we used LR = 2e-3 and Epoch = 25, batchSize = 1.

Version 3

for version 3, we used a slightly different CNN structure as version 1&2: Net(

(conv1): Conv2d(1, 32, kernel_size=(7, 7), stride=(1, 1))

(conv2): Conv2d(32, 24, kernel_size=(7, 7), stride=(1, 1))

(conv3): Conv2d(24, 32, kernel_size=(5, 5), stride=(1, 1))

(conv4): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))

(fc1): Linear(in_features=64, out_features=100, bias=True)

(fc2): Linear(in_features=100, out_features=2, bias=True))

For hyperparameters, we used LR = 1e-3 and Epoch = 25, batchSize = 1.

Version 4

for version 4, we used another CNN structure: Net(

(conv1): Conv2d(1, 32, kernel_size=(7, 7), stride=(1, 1))

(conv2): Conv2d(32, 24, kernel_size=(5, 5), stride=(1, 1))

(conv3): Conv2d(24, 24, kernel_size=(5, 5), stride=(1, 1))

(conv4): Conv2d(24, 12, kernel_size=(3, 3), stride=(1, 1))

(fc1): Linear(in_features=36, out_features=500, bias=True)

(fc2): Linear(in_features=500, out_features=2, bias=True))

For hyperparameters, we used LR = 1e-3 and Epoch = 25, batchSize = 1.

In the following part, I am going to show the result of the fourth version.

3) Training Results

Correct Results

The following are some pictures that my model predicts correctly. The green dot is the ground truth while the red dot is the prediction.

Incorrect Results

The following are some pictures that my model predicts incorrectly. The green dot is the ground truth while the red dot is the prediction. I think the reason my model is not predicting these correctly is that the faces in these pictures are turning too much, and we don't have enough this kind of data to train on.

Part 2: Full Facial Keypoints Detection

In this part, we used a larger network than part one to train a model that can predict all 58 keypoints of the face.

1) Sample image and ground truth

To augment the images, I added rotation randomly between +- 10 degrees, as well as random shifting between +- 10 pixels in both x and y directions. I also changed the keypoints correspondingly. The result images with ground truth looks like:

Augmented face and ground truth

More augmented face and ground truth

2) Detailed Architecture of the Model

The detailed model structure is shown as follows:

The model structure used in this part

The hyperparameters are LR = 5e-5, epoch = 20, batchSize = 1.

3) Training and Validation loss

As shown below is the training and validation loss in this part.

4) Training Results