Shreyas Patankar
I began the project by first writing a custom PyTorch Dataset & Dataloader in order to grab the input images. These images were converted to grayscale, normalized, and scaled to a size of 80x60. The nose tip keypoint was identified in each one. Some examples of original images & the images the dataloader uses are shown below.
The next step was to write a convolutional neural network in order to learn the locations of the nose keypoints. The network was constructed with the following layers:
conv1 | conv2 | conv3 | fc1 | fc2 | |
---|---|---|---|---|---|
input channels | 1 | 12 | 22 | 1280 | 64 |
output channels | 12 | 22 | 32 | 64 | 2 |
filter | 3x3 | 3x3 | 3x3 | n/a | n/a |
Each convolutional layer was followed first by a ReLU, then a Max Pool of size 2x2. The first fully connected layer was followed with a ReLU.
I split the data into the train and test groups--the testing set was 20% of the original data, and the remaining 80% was the training data. I trained my net with a learning rate of 1e-3, unbatched, and ran for 25 epochs. Below is the averaged training & validation loss per epoch, as well as some success & failure cases. The blue point represents the predicted result, while the red point is the ground truth keypoint.
The most obvious possible reason for the failure cases is simply the lack of training data. There were under 200 images in the training set to begin with, and only a fraction of those were facing a specific orientation. Neural Nets tend to do significantly well when there is a plethora of training data. Moreover, it is worth noting that in the failure cases, the faces are generally not facing straight forward, though I believe this is a much more subtle reason for failure.
Finally, below is a visualization of the learned filters from the convolutional layers.
I have omitted the 3rd layer of filters because the numbers grow significantly from layer to layer, as evidenced by conv1 and conv2.
In this part, the goal was to continue to identify keypoints, but for the full face. Below are sampled images from a new dataloader with the ground truth keypoints. The images were scaled to 160x120, converted to grayscale, and normalized.
As in the previous part, the next step was to write a convolutional neural network in order to learn the locations of the nose keypoints. This time, the network had additional convolutional layers to account for the larger image size:
conv1 | conv2 | conv3 | conv4 | conv5 | fc1 | fc2 | |
---|---|---|---|---|---|---|---|
input channels | 1 | 12 | 20 | 24 | 30 | 96 | 256 |
output channels | 12 | 20 | 24 | 30 | 32 | 256 | 116 |
filter | 3x3 | 3x3 | 3x3 | 3x3 | 3x3 | n/a | n/a |
Each convolutional layer was followed first by a ReLU, then a Max Pool of size 2x2. The first fully connected layer was followed with a ReLU.
Exactly as before, I split the data into the train and test groups--the testing set was 20% of the original data, and the remaining 80% was the training data. I trained my net with a learning rate of 1e-3, unbatched, and ran for 30 epochs. Below is the averaged training & validation loss per epoch, as well as some success & failure cases. The blue points represent the predicted result, while the red points are the ground truth keypoints.
The main for the failure cases is simply the lack of training data. As before, There were under 200 images in the training set to begin with, and only a fraction of those were facing a specific orientation. Neural Nets tend to do significantly well when there is a plethora of training data. Moreover, the vast majority of faces are taken from the same distance. When there is variation in this metric, the neural net doesn't respond very well. This could point to the strange shapes identified in the two failure cases. To improve this, I could have done additional data augmentation in order to learn better features.
Finally, below is a visualization of the learned filters from the convolutional layers.
I have omitted the remaining layers of filters because the numbers grow significantly from layer to layer, as evidenced by conv1 and conv2.
The final section of the project was to detect facial keypoints using a larger training set. For our purposes, we used the ibug faces dataset. My network was based on a pre-existing neural network called resnet18. The architecture of the network is shown below:
ResNet(
(conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer2): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer3): Sequential(
(0): BasicBlock(
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): Linear(in_features=512, out_features=136, bias=True)
)
The network was trainnned with a learning rate of 1e-3, as before. The batch size was 32 and ran for 5 epochs. I would have run for more, however, Colab was crashing for an unknown reason after ~7 epochs. Below is my training and validation error over the epochs.
Here are the final results on some of the testing data.