CS 194-26 Project 3: Facial Keypoint Detection with Neural Networks

Xidong Wu

Overview

In this project, I use neural networks to automatically detect facial keypoints, instead of manual detection in the project 3. PyTorch is chosen as the deep learning framework.

Part 1: Nose Tip Detection

In the first part, I use the IMM Face Database to train the model for nose tip detection. IMM Face Database consists of 240 facial images, in which there are persons and each person has 6 facial images in different viewpoints. In addition, all images are attached with 58 facial keypoints. I choose first 32 * 6 images as train set and 8 * 6 as the validation set.

Nose tip points sample

After reading the image [1], I convert the image into grayscale and convert image pixel values to [-0.5, 0.5]. Then, all images are resized to 80x60. I chose the facial keypoints (-6) corresponding to the nose tip as the target. For CNN architectures, I have 3 convolutional layers. Channel number is [1, 12, 12, 16]. All kernel size is 5 * 5. The convolutional layer will be followed by a ReLU followed by a maxpool with size of 2. Finally, I have 3 fully connected layers, with size of [120, 84, 2]. Learning rate is 1e-4 [2]. In addition, for train set, batch_size = 32, shuffle = True while for validation set, batch_size = 8 and shuffle = False. The information of model is listed below.

Model information

The loss curve shows that as the iteration increases, the loss value decreases.

Loss curve: train vs validation

The below graph show 2 facial images which the network detects the nose the most correctly, and 2 more images where it detects the most incorrectly. I think the reason why some images fail is that the the size of train set is limited. For some special face position, model cannot detect the position of nose.

Best result sample 1

Best result sample 2

Worst result sample 1

Worst result sample 2

Part 2: Full Facial Keypoints Detection

In this part, I detect all 58 facial keypoints (landmarks). In order to increase the size of train set, data augmentation is put into consideration. To be precise, RandomCrop, rotation, shift and flip are implemented.

Facial key points sample

data augmentation (Crop)

data augmentation (Rotation)

data augmentation (Shift)

data augmentation (Flip)

My CNN architecture has 5 convolutional layers and its size is [1, 12, 16, 20, 24, 28]. Corresponding kernel size is [7, 5, 5, 5, 3]. Max pool with size 2 * 2 is applied to the first three convolutional layers. And I have three fully connected layer with size of [500, 500, 116]. The learning rate is 1e-3. For train dataset, batch_size = 2, shuffle=True while for validation dateset, batch_size = 8, shuffle=False. The rotation and shift transform are used to increase size of train data set.

ConvNetL(
(conv1): Conv2d(1, 12, kernel_size=(7, 7), stride=(1, 1))
(conv2): Conv2d(12, 16, kernel_size=(5, 5), stride=(1, 1))
(conv3): Conv2d(16, 20, kernel_size=(5, 5), stride=(1, 1))
(conv4): Conv2d(20, 24, kernel_size=(5, 5), stride=(1, 1))
(conv5): Conv2d(24, 28, kernel_size=(3, 3), stride=(1, 1))
(pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(fc1): Linear(in_features=1400, out_features=500, bias=True)
(fc2): Linear(in_features=500, out_features=500, bias=True)
(fc3): Linear(in_features=500, out_features=116, bias=True)
)

Loss curve: train vs validation

The below graph show 2 facial images which the network detects the nose the most correctly, and 2 more images where it detects the most incorrectly. For faliure case, we could see that they are the same person. I think the reason why some images fail is that the model might overfit and cannot be used for some certain face types. It could be improved by addition of more different kinds of face shapes and face position.

Best result sample 1

Best result sample 2

Worst result sample 1

Worst result sample 2

Finally, the below image visualize the fisrt convolutional learned filters [3].

First later filter

Part 3: Train With Larger Dataset

In this part, we use the Google Colab to use the GPU. After loading the image, we need to crop the image and then the images are resized into 224x224. The shifr and rotaition transormation are considered.

Sample 1

Sample 2

Sample 3

I choose the ResNet18 model. In order to match the dimension of trainset, we modify the fisrt layer to (1, 64, (7, 7), (2, 2), (3, 3), bias=False) because our grayscae image only has one channel. The learning rate is 1e-4. For the last layer, the output dimesion is set as 68 * 2. The trasin set, we choose batch_size with 1, shuffle=True and num_workers = 4. In order to increase size of train set, we put data augmentaion, like image rotation and image shift, into consideration.

The score of my prediction on Kaggle is 7.93920.

Kaggle score

The loss curve shows that as the iteration increases, the loss value decreases.

Loss curve: train vs validation

The below graph show 3 facial key points detection results in the testing set.

Best result sample 1

Best result sample 2

Best result sample 1

Best result sample 2

Best result sample 1

Best result sample 2

Finally, I use model to detect the facial key points on actress Liu Yifei. The first two shows a good performance. Then I am curious what if there are two faces on one image? The third reulst shows this model will work cannot be used in this case.

Image 1

Image 2

Image 1

Image 2

Image 1

Image 2

References

[1] https://pytorch.org/tutorials/beginner/data_loading_tutorial.html [2] https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py [3] https://colab.research.google.com/github/Niranjankumar-c/DeepLearning-PadhAI/blob/master/DeepLearning_Materials/6_VisualizationCNN_Pytorch/CNNVisualisation.ipynb [4] https://pytorch.org/docs/stable/torchvision/models.html