In this project we are going to inspect usage of neural network in facial feature detection. Since we are dealing with images, the fundamental base model for each task will be CNN, and we will try out different models to see their performance on different datasets, and while performing different tasks.
The first task is to predict nosetip of a human face. The dataset used here is the IMM Face Database.
The features in original database contains 58 full facial features. The nosetip takes up the position 53. Before training the network, we should first extract the nosetip for each input image, and set it as feature
Post-processed image as well as their corresponding feature points are shown below:
Post-processed image 1
Post-processed image 2
For training, I used 3 convolutional layers, the first with kernel size 3, the second and third with kernel size 5, and 2 fully connected layers. The channels of convolutional layers are 4, 8 and 12 respectively. I previously tried more complex layers, ranging from 20-25, but the results are suboptimal. This is probably due to the fact that we only have 192 training data, which prevents us from fully training a large convolutional network.
I trained the network for 100 epochs with first a learning rate of
Loss curve with small learning rate
Loss curve with large learning rate
Below are some validation results. Some of them predict the nosetip pretty nicely, while others deviate from the ground truth.
Good result 1
Good result 2
Bad result 1
Bad result 2
While the underlying reason of some deviations remains to be discussed, I noticed that the model is worse at predicting female images. This is probably because female images take up less portion of the dataset. Moreover, female images differ from general male faces in hair styles and facial structures. As the model is used to male image inputs, it will perform suboptimally on less common female images.
In the second part, we are going to predict full facial feature points. Since the prediction results are more complicated, we are using a deeper neural network. To make sure that this deeper network is fully trained, we also perform data augmentation on the original images, to rotate and shift them to enlarge dataset.
Here are a few images sampled from the augmented dataset, along with the ground truth key points.
Original image
Rotated image
Vertically shifted image
Horizontally shifted image
The neural network consists of 5 convolutional layers, and 2 fully connect layers.
The first convolutional layer has channel 12, kernel size 5 with padding 2 both vertically and horizontally and stride 1. Followed by a "relu function".
The second convolutional layer has channel 24, kernel size 5 with padding 2 both vertically and horizontally and stride 1. Followed by a "relu function".
The third convolutional layer has channel 32, kernel size 5 with padding 2 both vertically and horizontally and stide 1. Followed by a "relu function".
The fourth convolutional layer has channel 64, kernel size 5 with padding 2 both vertically and horizontally and stide 1. It is followed by a "relu function" and then an average pooling layer with kernel size 4.
The fourth convolutional layer has channel 128, kernel size 5 with padding 2 both vertically and horizontally and stide 1. The output tensor is batch-normalized. It is then passed into a "relu function" and an average pooling layer of kernel size 4.
Output tensor from convolutional layers is then flattened into a vector of dimension 8960.
The first fully connected layer shrinks tensor from 8960 to 2640. It is followed by a batchnorm layer then a "relu function".
The second fully connected layer further shrinks the tensor from 2640 to 512. It is followed by a batchnorm layer then a "relu function".
The third fully connected layer produces a 116-dim vector, containing 58 flattened
I trained the network for 200 epochs with a learning rate of
Loss curve
Here are a few validation images that the network detected well, and some that the network failed to align with ground truth:
Good result 1
Good result 2
Bad result 1
Bad result 2
Again we can see that the network fails to predict well on female images. This is probably because female images take up only a small part of the whole dataset, making it hard for the neural network to take the general structure of female faces into account.
I also plotted the filters of convolutional layers for visualization:
First layer
Second layer
Third layer
Fourth layer
Fifth layer
In this part we are training on a larger dataset with 6666 images. Since there are sufficient images, I am using a much deeper neural network than part 2. In particular, I adapted ResNet18 for prediction.
The only difference I made for ResNet18 to fit my dataset is to change the first convolutional network from
I did not use the pretrained model. The original task for ResNet18 is to classify images among 1000 categories. Therefore, the pretrained model is likely to detect a general image structure among a certain category. Since all inputs in our model are human faces, it is likely that the original model outputs a similar set of feature points for all faces, since they should all be classified as "human". My experiment result demonstrated the above hypothesis. With a pretrained network, training loss decreases drastically by the second epoch, but stays the same afterwards. This means that the model weights converge to a local minimum, which is likely to be the "general structure", or average face.
I trained the model from scratch for 30 epochs. The training and validation loss curves are plotted as below:
Loss curve
In addition to training, validation and test sets, I also tried the model on my own images.
Self image 1
Self image 2
Self image 3
We can see that the model is good at predicting faces with a frontal view. This is probably because frontal faces take up most of the dataset, so the model regard them as the "most general case".
My Kaggle score for this part is 11.18744. Here are a few sampled test set predictions:
Test image 1
Test image 2