The goal of this project is to automatically detect keypoints for facial features by using neural nets. We use PyTorch as a framework to use these networks. We begin by trying to detect the tip of a person's nose and then try to detect all of the facial features on a face.

I used the dataloader from torch.utils.data.Dataloader in order to write a custom dataloader to access the images and keypoints from the IMM Face Database. Below I have displayed a few of the sample images with the nose keypoint marked in red.

After building my preliminary CNN model, I performed some hyperparameter tuning by adjusting the learning rate and the filter size. I tried adjusting the learning rate from 1e-3 to 2e-3. For changing filter size, I changed the size from 5 to 7 to see how that would affect my losses. The overall results did not seem to change significantly with this tuning.

Below is a graph showing the Training and Validation MSE loss during the training process. The graphs with adjusted hyperparameters are also shown.

Below are two facial images where the network detects the nose correctly. The red point represents the predicted point, and the green point represents the ground truth.

Below are two facial images where the network detects the nose incorrectly. Again, the red point represents the predicted point, and the green point represents the ground truth. The network might have failed in the cases shown below because one face was tilted to the side, which may have confused the network in being able to detect the feature. For the other image, the line near the mouth may have been seen as similar enough to a nose so the model marked the bottom as the keypoint.

I used the dataloader from torch.utils.data.Dataloader again, but this time I selected all the facial keypoints to include in the dataloader. I also performed data augmentation to help prevent the network from overfitting. This was done by adding images to the dataloader that were either randomly rotated by -15 to 15 degrees or randomly cropped by a specific portion of the image. Below are some of the sample images from the dataloader labeled with the ground-truth keypoints.

Below are the layers for my model architecture. The learning rate was 1e-3 and the batch size was 4.

FaceDetectionCNN(

(conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))

(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))

(conv3): Conv2d(16, 32, kernel_size=(5, 5), stride=(1, 1))

(conv4): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))

(conv5): Conv2d(64, 128, kernel_size=(5, 5), stride=(1, 1))

(fc1): Linear(in_features=4608, out_features=64, bias=True)

(fc2): Linear(in_features=64, out_features=116, bias=True)

)

Below is a graph showing the Training and Validation MSE loss during the training process.

Below are two facial images where the network detects the facial keypoints correctly. The red points represent the predicted points, and the green points represent the ground truth.

Below are two facial images where the network detects the facial keypoints incorrectly. Again, the red points represent the predicted points, and the green points represent the ground truth. The network might have failed in the cases shown below because of how far left the faces are facing compared to many other images in the training set and the features aren't as consistent because of that angle. The nose keypoint detector had a difficult time with similar images.

Below I have visualized some of the filters for my neural network of the first convolutional layer.

For my Kaggle submission, my username is Sean Kim. At the time of submitting this website, my score was 10.13204.

This dataloader was essentially the same as the one from the previous part, but I also performed color jittering for additional data augmentation. Below are some example images from my dataloader with the keypoints labeled.

Here is the ResNet model architecture I used to train with a larger dataset. I used the default ResNet18 model, but just modified the input and output channels to
match the dimensions needed for this data. The learning rate was 1e-3 and the batch size was 16.

ResNet(

(conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)

(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace)

(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)

(layer1): Sequential(

(0): BasicBlock(

(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace)

(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

(1): BasicBlock(

(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace)

(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(layer2): Sequential(

(0): BasicBlock(

(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace)

(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(downsample): Sequential(

(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)

(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(1): BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace)

(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(layer3): Sequential(

(0): BasicBlock(

(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace)

(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(downsample): Sequential(

(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)

(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(1): BasicBlock(

(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace)

(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(layer4): Sequential(

(0): BasicBlock(

(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace)

(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(downsample): Sequential(

(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)

(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace)

(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

)

)

(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))

(fc): Linear(in_features=512, out_features=136, bias=True)

)

Below is a graph showing the Training and Validation MSE loss during the training process for the resnet18 model for 10 epochs.

Here are some of the visualized keypoint results for the resnet18 model on the test set after 50 epochs using the full dataset to train.

Here are some of the visualized keypoint results for the resnet18 model on some images of my choice after 50 epochs using the full dataset to train.

For bells and whistes, I applied the automatic facial point detection from this project to project 3 in order to be able to automatically morph from one face to another and to compute the mid-way face more easily. The results are shown below. I performed a morph from Jisoo to Eunwoo and also computed the mid-way faces for Eunwoo with me and Jisoo.

I found it really cool to apply deep learning for this application of detecting facial points and making the process automatic. I really liked applying the project to my project 3 code to see this process done easily without having to select points!