**CS294-26 Project 4: Facial Keypoint Detection with Neural Networks** By Neerja Thakkar In this assignment, we move beyond manual face keypoint annotation to detection with neural networks! Nose Tip Detection ========================= First, we train a simple toy neural network with 3 convolutional layers to detect a nose tip. Here are some sampled images from my dataloader with ground-truth keypoints in blue: ![Example training images](out/sample_gt.png width=700) Here is the training loss (blue) and validation loss (orange) during training per batch of 4 images. ![x axis - training epochs, y axis - MSE ](out/train_val_plot.png width=600) Here are some results on validation data, unseen during training. Green is ground-truth, and red is the predicted keypoint. ![](out/sample_out_1.png width=700) ![Example result images](out/sample_out_2.png width=700) We see that the network does pretty well on most of the images, aside from two on the woman where her face is rotated or tilted at an angle. I hypothesize that there were not many of these angles during training, so the network couldn't learn how to accurately find the nose in these as well. The network is also far from perfect on a few of the images at a more straight-on angle. This is likely due to the small amount of training data (which was not augmented), and the simple network architecture. Full Facial Keypoints Detection ============================= Here are some sampled images from my dataloader with ground-truth keypoints in blue: ![Example training images](out_full_face/sample_gt.png width=700) For a network architecture, I use the following: | Layer | Layer Type | In dim | Out dim | Kernel Size | Activation function | |-------|------------|--------|---------|-------------|---------------------| | 1 | Convolutional | 1 | 12 | 3x3 | ReLU | | 2 | Max Pool | - | - | 2x2 | - | | 3 | Convolutional | 12 | 15 | 3x3 | ReLU | | 4 | Max Pool | - | - | 2x2 | - | | 5 | Convolutional | 16 | 20 | 3x3 | ReLU | | 6 | Max Pool | - | - | 2x2 | - | | 7 | Convolutional | 20 | 24 | 5x5 | ReLU | | 8 | Max Pool | - | - | 2x2 | - | | 9 | Convolutional | 24 | 32 | 7x7 | ReLU | | 10 | Max Pool | - | - | 2x2 | - | | 11 | Fully Connected | 96 | 120 | - | ReLU | | 12 | Fully Connected | 120 | 58*2 | - | - | During training, I use data augmentation in the form of random rotations from -5 degrees to 5 degrees, random cropping, and random changes to hue and saturation. At test time, the images are only resized. A learning rate of 1e-3 with MSE loss and and Adam optimizer were used. Here is the training loss (blue) and validation loss (orange) during training per batch of 4 images. ![x axis - training epochs, y axis - MSE ](out_full_face/train_val_plot.png width=600) The validation loss is pretty variable, indicating that the network could possibly be overfitting to the training data. Given the small training set, this makes sense. Here are some outputs: ![](out_full_face/sample_out_1.png width=700) ![Example result images](out_full_face/sample_out_2.png width=700) Similarly to the nose detector network, this network performs quite decently on images where the person is starting straight ahead. On the three images where the woman's face is not straight ahead (the two rightmost images in the top row, and second from the left in the bottom row), the network struggles. Likely a larger training dataset would help this. Rotating the images by more for data augmentation did not seem to help, but perhaps a random warp would. Here are the learned filters in the first conv layer: ![](out_full_face/0.png width=150) ![](out_full_face/1.png width=150) ![](out_full_face/2.png width=150) ![](out_full_face/3.png width=150) ![](out_full_face/4.png width=150) ![](out_full_face/5.png width=150) ![](out_full_face/6.png width=150) ![](out_full_face/7.png width=150) ![](out_full_face/8.png width=150) ![](out_full_face/9.png width=150) ![](out_full_face/10.png width=150) ![](out_full_face/11.png width=150) Train With Larger Dataset ================= Now, using the same ideas as with a smaller dataset, we train a model with a larger dataset. During training, I use data augmentation in the form of random rotations from -15 degrees to 15 degrees, random cropping, and random changes to hue and saturation. At test time, the images are only resized. A learning rate of 0.005 with MSE loss and and Adam optimizer were used. The images are cropped based on a bounding box of the face (which I extended 1.4x), before resizing. I trained on 6000 images and held out the rest for validation. For a network architecture, I use ResNet 18, with the input layer modified to take a single channel image, and the output layer modified to give 68*2 outputs. This results in the following architecture: ````` ResNet( (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (1): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer2): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer3): Sequential( (0): BasicBlock( (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer4): Sequential( (0): BasicBlock( (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (fc): Linear(in_features=512, out_features=136, bias=True) ) ````` Here are some example ground truth images sampled from my dataloader, with keypoints in blue: ![](out_resnet/gt.png width=700) Here is the training loss (blue) and validation loss (orange) during training, with a batch size of 4 images. ![](out_resnet/train_val_plot.png width=400) Here are some example outputs on the held-out validation set: ![](out_resnet/out1.png width=700) ![](out_resnet/out2.png width=700) ![](out_resnet/out3.png width=700) ![](out_resnet/out4.png width=700) These results were overall pretty decent. Rotated faces were handled well, likely because of the data augmentation. For images where the face was seen only in profile, the network sometimes struggled (ex. first row, leftmost image), as well as for images where the faces was tilted (ex. last row, second image from the left). The network was also sometimes not as accurate for images that were not centered (ex. second row, leftmost image). Here are some results on images in the testing set: ![](out_resnet/test4.png width=450) ![](out_resnet/test2.png width=450) ![](out_resnet/test3.png width=300) ![](out_resnet/test1.png width=300) ![](out_resnet/test5.png width=300) And here are some results on my images: ![](out_resnet/im1.jpg width=400) ![](out_resnet/im2.jpg width=400) ![](out_resnet/im3.jpg width=400) ![](out_resnet/im4.jpg width=400) Given that the results seemed very reasonable on the test set, it was surprising that these results were not great, especially for the eyes and eyebrows. For the baby images, it is possible that the network was mostly trained on data of adults, and therefore had a hard time recognizing face features. However, for the second image, the network not working well is weird. With this network, I got a MAE score of 14.68 on the Kaggle competition. Antialiased CNNs ================= I decided to try using [Antialiased CNNs](https://github.com/adobe/antialiased-cnns) with a BlurPool layer. I trained my modified resnet18 with BlurPool layers for 15 epochs with a learning rate of 0.001. Here are some example results: ![](out_resnet/antialiased_out.png width=700) ![](out_resnet/aa4.png width=450) ![](out_resnet/aa2.png width=450) ![](out_resnet/aa3.png width=300) ![](out_resnet/aa1.png width=300) ![](out_resnet/aa5.png width=300) While the antialiased CNN looked slightly better for some images, the overall results weren't that much different - my Kaggle MAE was only reduced to 13.79303. It would be interesting in the future to try this idea with other architectures and hyperparameters to improve the performance even more.