MSE was the loss function used. As the spec said, we normalized our images and the coordinates were also between 0 and 1 (ratio) when the loss is calculated. We can see the loss reducing with epochs below and the train loss is less than the test, and the valiation loss is also decreasing - this gives us an indication that training has not resulted in under/over fitting and that we did a good job.
The first hyperparameter I varied was the learning rate. With 1e-2 (first picture) we see that was too big and we bounced around areas of high loss and did not find a low loss region on the loss surface. With 1e-4 (second picture) we see that is too small and our loss is barely decreasing and flat, maybe we get stuck in a flat region on our loss surface. 1e-3 (third picture) we see this is optimal as our loss is slowly decreasing but also not bouncing around. This is confirmed by our numbers, 1e-3 had the lowest training and valiation loss.
The second hyperparameter I varied was number of convolution layers. I ran with 3 layers vs 4 and saw that 3 layers was slightly better and definitely quicker than 4 layers. This is because with 4 layers you have more backprop to do and many more parameters to tune which maybe didn't tune well enough in the given number of epochs. The performance for 3 layers vs 4 is shown below. 3 layers has a slighly lower loss.
Two success cases
Two failure cases. I think the failures occur when the person looks to the left or the right and thus the nose moves from the center. In most pictures the head is straight so the nose is at the center and this is what the network learns which is why it does well for front on images but not non front on images.
I have also included examples of augmented data. These include pictures where the brightness has been increased, the saturation has been changed, or a combination of both has been done.
I used 5 convolution layers followed by a couple of fully connected layers. All layers had RELu after them and most also had a max pooling layer after them. The exact architecture has been shown below. THe input was 120*160 image.
MSE was the loss function used. As the spec said, we normalized our images and the coordinates were also between 0 and 1 (ratio) when the loss is calculated. We can see the loss reducing with epochs below and the train loss is less than the test, and the valiation loss is also decreasing - this gives us an indication that training has not resulted in under/over fitting and that we did a good job.
Two success cases
Two failure cases. I think the failures occur when the person looks to the left or the right and thus the nose and rest of the face moves from the center. In most pictures the head is straight so this is what the network learns which is why it does well for front on images but not non front on images.
These are the first 12 filters of the first convolution layer. It looks like these filters are detecting edges of the image and even to a certain extent the middle of the image which is where important facial features such as the nose lie.
I have submitted to Kaggle. My team name is just my name, Sarthak Arora.
For this part, I used a standard torchvision.models pretrained resnet18 model. The first change I made to this was change the number of input channels in the first convolution layer to 1. The second change I made was change the number of output features in the FC layer to 136 to match the number of outputs of our net. 68 points * 2 (x, y). This was the model. ResNet( (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (1): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer2): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer3): Sequential( (0): BasicBlock( (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer4): Sequential( (0): BasicBlock( (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (fc): Linear(in_features=512, out_features=136, bias=True) )
Here is a picture (and one zoomed in) of how the train and validation loss progressed over 20 epochs.
Here are a few examples of the predictions of the neural net on the test set. The predictions seem to be doing well for pics of all angles.
Here are the results of the net on 3 photos of my own. It worked well on images that were well zoomed in on just the face and not well on pictures where there was a lot of noise (things apart from face) due to the lack of a bounding box. Overall we see the network did a good job of predicting face shape, direction, and features.