Feature Detection with Neural Network

Introduction

In this project we are going to inspect usage of neural network in facial feature detection. Since we are dealing with images, the fundamental base model for each task will be CNN, and we will try out different models to see their performance on different datasets, and while performing different tasks.

 

Nosetip Detection

The first task is to predict nosetip of a human face. The dataset used here is the IMM Face Database.

The features in original database contains 58 full facial features. The nosetip takes up the position 53. Before training the network, we should first extract the nosetip for each input image, and set it as feature y. Meanwhile, I reduced image size from 480×640 to 60×80 , and turned them to grayscale to reduce calculation complexity.

Post-processed image as well as their corresponding feature points are shown below:

Post-processed image 1

part1_visualize1

Post-processed image 2

part1_visualize2

For training, I used 3 convolutional layers, the first with kernel size 3, the second and third with kernel size 5, and 2 fully connected layers. The channels of convolutional layers are 4, 8 and 12 respectively. I previously tried more complex layers, ranging from 20-25, but the results are suboptimal. This is probably due to the fact that we only have 192 training data, which prevents us from fully training a large convolutional network.

I trained the network for 100 epochs with first a learning rate of 5×104 then 4×103. Training and validation loss for both learning rates are plotted below. We can see that a larger learning rate induces more violent fluctuation in validation loss, while a model trained with smaller learning rate has milder "bumps". The effect of large learning rate can also be demonstrated in validation results. A learning rate of 4×103 produces more unstable results.

Loss curve with small learning rate

part1_loss

Loss curve with large learning rate

part1_loss_4e-3

Below are some validation results. Some of them predict the nosetip pretty nicely, while others deviate from the ground truth.

Good result 1

part1_good1

Good result 2

part1_good2

Bad result 1

part1_bad1

Bad result 2

part1_bad2

While the underlying reason of some deviations remains to be discussed, I noticed that the model is worse at predicting female images. This is probably because female images take up less portion of the dataset. Moreover, female images differ from general male faces in hair styles and facial structures. As the model is used to male image inputs, it will perform suboptimally on less common female images.

 

Full Facial Feature Prediction

In the second part, we are going to predict full facial feature points. Since the prediction results are more complicated, we are using a deeper neural network. To make sure that this deeper network is fully trained, we also perform data augmentation on the original images, to rotate and shift them to enlarge dataset.

Here are a few images sampled from the augmented dataset, along with the ground truth key points.

Original image

part2_original

Rotated image

part2_rotate

Vertically shifted image

part2_vertical

Horizontally shifted image

part2_horizontal

The neural network consists of 5 convolutional layers, and 2 fully connect layers.

The first convolutional layer has channel 12, kernel size 5 with padding 2 both vertically and horizontally and stride 1. Followed by a "relu function".

The second convolutional layer has channel 24, kernel size 5 with padding 2 both vertically and horizontally and stride 1. Followed by a "relu function".

The third convolutional layer has channel 32, kernel size 5 with padding 2 both vertically and horizontally and stide 1. Followed by a "relu function".

The fourth convolutional layer has channel 64, kernel size 5 with padding 2 both vertically and horizontally and stide 1. It is followed by a "relu function" and then an average pooling layer with kernel size 4.

The fourth convolutional layer has channel 128, kernel size 5 with padding 2 both vertically and horizontally and stide 1. The output tensor is batch-normalized. It is then passed into a "relu function" and an average pooling layer of kernel size 4.

Output tensor from convolutional layers is then flattened into a vector of dimension 8960.

The first fully connected layer shrinks tensor from 8960 to 2640. It is followed by a batchnorm layer then a "relu function".

The second fully connected layer further shrinks the tensor from 2640 to 512. It is followed by a batchnorm layer then a "relu function".

The third fully connected layer produces a 116-dim vector, containing 58 flattened (x,y) coordinates.

I trained the network for 200 epochs with a learning rate of 4×103. The training and validation loss curves are shown as below:

Loss curve

part2_loss

Here are a few validation images that the network detected well, and some that the network failed to align with ground truth:

Good result 1

part2_good1

Good result 2

part2_good2

Bad result 1

part2_bad1

Bad result 2

part2_bad2

Again we can see that the network fails to predict well on female images. This is probably because female images take up only a small part of the whole dataset, making it hard for the neural network to take the general structure of female faces into account.

I also plotted the filters of convolutional layers for visualization:

First layer

part2_conv1

Second layer

part2_conv2

Third layer

part2_conv3

Fourth layer

part2_conv4

Fifth layer

part2_conv5

 

Training with a Larger Dataset

In this part we are training on a larger dataset with 6666 images. Since there are sufficient images, I am using a much deeper neural network than part 2. In particular, I adapted ResNet18 for prediction.

The only difference I made for ResNet18 to fit my dataset is to change the first convolutional network from 364 channels to 164 channels, and the last fully connected layer from 5121000 channels to 512136 channels.

I did not use the pretrained model. The original task for ResNet18 is to classify images among 1000 categories. Therefore, the pretrained model is likely to detect a general image structure among a certain category. Since all inputs in our model are human faces, it is likely that the original model outputs a similar set of feature points for all faces, since they should all be classified as "human". My experiment result demonstrated the above hypothesis. With a pretrained network, training loss decreases drastically by the second epoch, but stays the same afterwards. This means that the model weights converge to a local minimum, which is likely to be the "general structure", or average face.

I trained the model from scratch for 30 epochs. The training and validation loss curves are plotted as below:

Loss curve

part3_loss

In addition to training, validation and test sets, I also tried the model on my own images.

Self image 1

part3_selfim1

Self image 2

part3_selfim2

Self image 3

part3_selfim3

We can see that the model is good at predicting faces with a frontal view. This is probably because frontal faces take up most of the dataset, so the model regard them as the "most general case".

My Kaggle score for this part is 11.18744. Here are a few sampled test set predictions:

Test image 1

part3_test1

Test image 2

part3_test2