CS 194-26 Project 4 - Andrew Loeza

Overview:

In this project, we first learned how to load and normalize our image sets and to build neural networks that can utilize this data by using PyTorch. We then designed a neural network that was capable of detecting a nose on human faces. Then, we learned how to build a more sophisticated network that could do simple facial recognition and how to augment our data to improve the machine learning process by using transformations. And lastly, we used used all of these techniques to process a larger image set and use it to train a predesigned PyTorch neural network that could do much more robust facial recognition.

Part 1: Nose Tip Detection:

In this section we imported our Dane premarked images and designed a very simple five to six layer convolutional neural network. Our images were turned to greyscale, normalized, and rescaled to (80 x 60) prior to being used for training and validation.

Below you can some of the preprocessed Dane face images with the green dot indicating the true location of their noses.

Here you can see the graphed training and validation loss for our simple nose detection neural network. As can be seen, our validation loss ends up being above our training loss, which is to be expected.

And finally you can see the results of 4 images from our validation set being passed into our neural network. The first and fourth images are successes, with the network either predicting the exact location or being only a couple of pixels from the actual value. Unfortunately, though, the second and fourth images are failures. This is most likely because most of the images in our training dataset are of the humans being in the center of the frame while staring directly at the camera. The high frequency of these images then causes our neural network to overtrain on them, making it less robust for any variations in facial expressions or tilting of the head. And as we can see, both the head tilting and curling of the lips and slight squinting of the eyes in the second image cause the network to instead predict the lip. The reasoning for the third images failure is similar.

Part 2: Full Facial Keypoints Detection:

In this section, we expanded on what we did in the first section, building a neural network that involves more convolutional layers and that outputs 116 data points, representing the 58 points used to detect faces. Moreover, we used data augmentation to help prevent our network from overfitting on the training data and help prevent what occurred in the first part.

Below you can see the results of our data augmentation on the Dane image set. The images are being randomly rotated $\pm$ 15 degrees, randomly shifted vertically and horizontally by 10 pixels, and the brightness and contrast and being randomly jittered. Lastly, the images that the network processes have dimensions (160 x 120).

For the neural network's design, it begins with 6 convolutional layers designed as follows where the number to the left of the arrow is the input layers and the number to the right is the output layers:

1.  1 -> 16 (Padding = 3 and kernel size = (7x7))
2. 16 -> 32 (Padding = 2 and kernel size = (5x5))
3. 32 -> 64 (Padding = 1 and kernel size = (3x3))
4. 64 -> 64 (Padding = 2 and kernel size = (5x5))
5. 64 -> 32 (Padding = 1 and kernel size = (3x3))
6. 32 -> 16 (Padding = 2 and kernel size = (5x5))

Moreover, layers 1, 3, and 5 are followed by a max pool with kernel size = (2x2) and every layer is followed by a Relu.

It is then followed up by two fully connected layers designed as as follows where the number to the left of the arrow is the input layers and the number to the right is the output layers:

1. 4800 -> 2400
2. 2400 -> 116

Moreover, layer 2 is followed by a Relu.

The hyperparameters I used to train the model are:

Learning Rate: 0.001
Batch Size   : 16
Epochs       : 30

Below you can see the training and validation loss for our simple facial detection network.

Below you can see the results of our simple facial detection network on the validation set images. As can be seen, even with the data augmentation, the network still struggles to deal with unusual facial expressions and with heads tilted at angles. The head tilting issue can be seen in the first image. Although looking at the fourth image, it appears that the data augmentation has relieved some of these issues since it does a fairly good job of correctly detecting the facial keypoints despite the fact that the head is tilted. The second image also fails, although the reason for this is a bit more nebulous since the face is clearly looking straight on towards the camera. The only reason I can imagine that it failed to detect this face is because our data augmentation has made it difficult for the network to overfit for the large number of faces looking straight at the camera that are in our image set. So, the trade-off of being to properly detect faces like those in the fourth image is that now our network will now sometimes underfit some of the faces looking straight on towards the camera like in the second image. However, as can be seen in the third image, which the network correctly identifies, this does not happen all the time.

Lastly for this part, here are the filters, or the weights, of the first convolutional layer in this simple facial detection neural network.

Part 3: Train With Larger Dataset:

In this section, we processed the iBug image set and used a predefined PyTorch model called ResNet18 with two modifications. The first layer was modified to accept images with 1 input channel (i.e. greyscale images) and the last layer was modified so that its output channel was 136 or 2 x 68 which is the number of landmark points for each image. In addition to this, the following hyperparameters were used to train the network:

Learning Rate: 0.01
Batch Size   : 64
Epochs       : 10

Lastly, in the Kaggle competition, the ResNet18 model's predictions received a mean absolute error of 16.20860.

Below you can see some of the images from our testing set along with their ground truth points in green. As can be noted, the images have several augmentations applied to them, including random rotating, shifting, horizontal flip, and brightness and contrast jittering.

Below you can see the training loss and validation loss for the ResNet18 network.

Below you can see a few example predictions the network made on the test set images. Overall, the results seem very good, except that it tends to oversize the outline of the head sometimes and it tends to struggle with getting the precise location of the lips correct.

Lastly, here are three images of some celebrities whose faces I decided to have the network detect. The results are all pretty good, although the network seems a bit picky about the angles of faces and tends to be confused by background objects and lighting. As long as the face being passed to it is the main focus of the image and the face is more-or-less looking in the direction of the camera, it works pretty well. Some of these issues can be seen in the third image of Elon Musk, where the location of the eyes and lips are slightly off most likely due to the angle his head is facing.