Facial Keypoints Detection with Neutral Networks

The objective of the project was to design convolutional neural network architecture to automatically detect facial keypoints based on the IMM Face Database. In the first part we had to predict nose tip and defined the problem as a pixel coordinate regression problem (input: greyscale image, output: nose tip positions in x and y coordinates). In the second part, we were detecting full facial keypoints, and in the third part we trained with larger dataset called the ibug face in the wild dataset for training a facial keypoints detector.

Part 1. Nose Tip Detection

1. Dataloader. The dataset had 240 facial images of 40 persons and each person has 6 facial images in different viewpoints. We split the dataset into training (32) and test sets (8). First, we converted images into greyscale (so that the channel input in CNN would be 1), and normalized float values. We also resized images to (80,60).

description of gif
description of gif
description of gif

2. CNN Architecture. For my final model I used 3 convoluational layers, changing the kernel sizes from (2,2) and (5,5) with the consistent padding (2,2). All of them were followed by ReLU and maxpooing of size (2,2). I used MSE loss function and Adam optimizer with the learning rate of 0.0005, trained on 25 epochs.

  1. Conv1: Conv2d(1, 12, kernel_size=(2, 2), stride=(1, 1), padding=(2, 2))
  2. Conv2: Conv2d(12, 12, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  3. Conv3: Conv2d(12, 16, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  4. Fc1: Linear(in_features=1120, out_features=200, bias=True)
  5. Fc2: Linear(in_features=200, out_features=2, bias=True)

Here are the best results:

description of gif
description of gif

Here are the worst results that might have failed due to head rotations in the images and not standing in the middle of the images, on the majority of training samples it might have been the case that the noses were in the center:

description of gif
description of gif

3. Training and Validation Accuracy. Loss on 25th epoch: 0.192(training), 0.232(validation).

description of gif
---------------------------------------

Part 2. Full Facial Keypoints Detection

1. Dataloader. The dataset had 240 facial images of 40 persons and each person has 6 facial images in different viewpoints. We split the dataset into training (32) and test sets (8). First, we converted images into greyscale (so that the channel input in CNN would be 1), and normalized float values. We performed data augmentation on images in order to prevent the model from overfitting. More specifically, we used ColorJitter to enhance brightness, Rotation and Horizontal Flip with converted landmarks respectively.

description of gif
description of gif
description of gif

2. CNN Architecture. For my final model I used 6 convoluational layers, changing the kernel size (2,2) with the consistent stride (1,1). All of them were followed by ReLU (non-linearity), and the majority of the layers were followed by maxpool (2,2) except for the last and first layers. We trained on 60 epochs, using MSE loss function and Adam optimizer with the learning rate of 0.05.

  1. Conv1: Conv2d(1, 16, kernel_size=(2, 2), stride=(1, 1))
  2. Conv2: Conv2d(16, 16, kernel_size=(2, 2), stride=(1, 1))
  3. Conv3: Conv2d(16, 32, kernel_size=(2, 2), stride=(1, 1))
  4. Conv4: Conv2d(32, 128, kernel_size=(2, 2), stride=(1, 1))
  5. Conv5: Conv2d(128, 128, kernel_size=(2, 2), stride=(1, 1))
  6. Conv6: Conv2d(128, 32, kernel_size=(2, 2), stride=(1, 1))
  7. Fc1: Linear(in_features=96, out_features=500, bias=True)
  8. Fc2: Linear(in_features=500, out_features=116, bias=True)p2-

Here are the best results:

description of gif
description of gif

Here are the worst results that might have failed due to head rotations in the images and also facial shifts to the left/right. It might be worth to perform data augmentation in the future.

description of gif
description of gif

3. Training and Validation Accuracy, Learned Filters.

description of gif

3. Learned Filters. Row one: layers 1-3, row two: layers 4-6.

description of gif
description of gif
description of gif
description of gif
description of gif
description of gif
---------------------------------------

Part 3. Larger Training Set

1. Dataloader. The dataset needed a croppping around bounding boxes and resize to that scale so that the training would be performed on those images. We have also needed to update facial keypoints. I had errors on downloading class images, so worked with the same dataset that i found online.

description of gif
description of gif
description of gif

2. CNN Architecture. Training on 6666 images, here we used ResNet18 CNN architecture and Google Collab GPU. I used MSE loss and Adam optimizer with the learning rate 0.0001. Modified the first layer, so that the input channel would be 1 bc of greyscale images.

description of gif

Here are the results:

description of gif