Part 1

The neural network for Part 1 was a relatively simple scheme: 3 convolutional layers, each of which was followed by a RELU and then a maxpool, finished with 2 fully-connectd linear layers at the very end. Below shows some results: a sampled image from the Dataloader with ground-truth keypoints, the training vs validation accuracy per epoch, and some successes and failures. Note that because validation is only done after at least 1 epoch of training, even the initial validation will appear 'good' because some training has already been performed.

Sample image from Dataloader (already downscaled, thus low-res)

Accuracy per epoch (blue is training, orange is validation)

These two predictions had decent results (R = ground truth, B = prediction)

These two predictions had poor results (R = ground truth, B = prediction)

As I'm not super familiar with NNs, I'm not sure exactly why it may have failed. I noticed that the NN tended to simply return the average location of the nose for all images, thus images whose noses were close to the average had good results, while those with faces pointed in off angles or otherwise far from average failed. Perhaps this is due to not enough data or variance in data, or perhaps this is a product of overfitting.

Part 2

Part 2 was fairly similar to Part 1 in abstract structure, with some implementation difference: data augmentation was used in Part 2 (random slight rotations and brightness/contrast adjustments) to attempt to give the CNN more robustness. The detailed architecutre of the CNN is shown below (not shown is a ReLU after each Conv2D layer):


            FaceNet(
                (conv1): Conv2d(1, 16, kernel_size=(5, 5), stride=(1, 1)) 
                (pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
                (conv2): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
                (pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
                (conv3): Conv2d(32, 64, kernel_size=(1, 1), stride=(1, 1))
                (pool3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
                (conv4): Conv2d(64, 40, kernel_size=(3, 3), stride=(1, 1))
                (pool4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
                (conv5): Conv2d(40, 64, kernel_size=(3, 3), stride=(1, 1))
                (pool5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
                (fc1): Linear(in_features=960, out_features=1024, bias=True)
                (fc2): Linear(in_features=1024, out_features=116, bias=True)
            )
        

The following hyperparameters were used: learning rate of 0.001, batch size of 4, shuffle enabled in DataLoader. Below are more images showing the neural network: loss over time, a sampled image with ground-truth keypoints, two successes, two failures, and visualized filters

Average MSE per epoch (20 total)

Sample DataLoader image. There is a slight rotation that is difficult to see (image has been cropped in to remove edges), and a randomized contrast and brightness adjustment which has inverted the pixel values, giving it the 'night vision' look.

Above are two pictures on which the CNN performed reasonably well.

Above are two pictures on which the CNN performed reasonably poorly.

Like previously, the CNN seemed to lean towards simply returning an average keypoint location of all the faces, perhaps due to overfitting and/or insufficient data. I tried several data augmentations to try and alleviate this issue, but unfortunately was not too successful.

Here, we see some of the visualized filters from the first layer of the CNN. The number of filters rapidly increases in size after the first convolutional layer, but look similar in features.

Part 3

For Part 3, I used the ResNet18 model prebuilt into PyTorch. Slight changes were made: the 3 input channels were changed to 1 to account for our images being BW, and the output layer was changed to 136, the number of facial keypoints times two. I used a training rate of 0.05, and a batch size of 30. A detailed can be seen below:


            ResNet(
                (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
                (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (relu): ReLU(inplace=True)
                (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
                (layer1): Sequential(
                    (0): BasicBlock(
                    (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
                    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    (relu): ReLU(inplace=True)
                    (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
                    (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    )
                    (1): BasicBlock(
                    (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
                    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    (relu): ReLU(inplace=True)
                    (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
                    (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    )
                )
                (layer2): Sequential(
                    (0): BasicBlock(
                    (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
                    (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    (relu): ReLU(inplace=True)
                    (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
                    (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    (downsample): Sequential(
                        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
                        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    )
                    )
                    (1): BasicBlock(
                    (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
                    (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    (relu): ReLU(inplace=True)
                    (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
                    (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    )
                )
                (layer3): Sequential(
                    (0): BasicBlock(
                    (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
                    (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    (relu): ReLU(inplace=True)
                    (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
                    (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    (downsample): Sequential(
                        (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
                        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    )
                    )
                    (1): BasicBlock(
                    (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
                    (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    (relu): ReLU(inplace=True)
                    (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
                    (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    )
                )
                (layer4): Sequential(
                    (0): BasicBlock(
                    (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
                    (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    (relu): ReLU(inplace=True)
                    (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
                    (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    (downsample): Sequential(
                        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
                        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    )
                    )
                    (1): BasicBlock(
                    (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
                    (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    (relu): ReLU(inplace=True)
                    (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
                    (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                    )
                )
                (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
                (fc): Linear(in_features=512, out_features=136, bias=True)
                )
        

Here is the training loss per epoch, hovering around ~28 by Epoch 15.

Here are the results of running the trained model on some testing set photos.

Here are the results of running the trained model on some photos from my collection. We see that unfortunately, none of the points fit particuarly well, some discussion below

I realized fairly late into the project that the cause of seemingly poor models was due to my augmentations, notably the way I cropped. Although I'm not too sure the details, the way in which I cropped and resized image seemed to cause it to fail on images that were not cropped in a similar fashion - for example, images from my collection. Specifically, you may notice that all the predicted points on my images are shifted up and to the left, leading me to believe my implementation of the bounding box cropping somehow upset the neural network training. Meanwhile, on images that underwent the same transformation as test images, the model actually performed fairly well (though not perfect).

Due to the above issues with the unusual way in which I implemented cropping, my Kaggle score was not too impressive, with an MSE of 123. I spent a great deal of time attempting to 'undo' my transformations such that the prediction keypoints could be correctly scaled up the orignal image, but was ultimately unsuccessful, and I did not have the time to redesign the transformations and re-run the neural networks. Depending on the image, some rescaled points fit perfectly to the original image, some were slighly off, and some were way wrong. I unfortunately didn't recognize this issue until Part 3; had I seen this pattern Part 2, it's likely I would've been able to fix the augmentation in time and submit a Kaggle with a much more competitive MSE.