In this project I trained several convolutional neural networks in pytorch to perform regressions. The first network was a 3-conv nose detector for grayscale images. The second was a 5-conv facial keypoint detector, and the third was a modified Resnet18 for facial keypoint detection.
For this part, I trained a simple network for nose tip detection.
The architecture was:
Here we see the nose keypoint in green.
If you can’t read axes, disable dark mode as it hides the axes of the plot. We can see the loss goes down consistently for training while validation is more erratic but still goes down by the end.
Note, I multiplied the loss by 100 to get a more readily interpretable value. I also omitted the first recorded epoch loss as it was much higher and made the shape of the loss curve harder to see.
Green is ground truth, red is predicted.
Here, possibly because of the shadowing on the person’s face, the nose is incorrectly detected to the right of where it should be.
A similar failure case, though not as pronounced, again possibly due to lighting (though the lighting is not as bad here).
I experimented with changing the number of channels for the conv layers and the learning rate. Here are the final validation losses after training for about 25 epochs.
The best results were found with a very low learning rate and a lot of channels, but for most of the results above I used the 0.001 LR with 16 channels as it was fairly balanced and achieved nearly the same loss with lower compute overhead.
Generally, it seems that too-high learning rate reduces the quality of the model.
Here, we see all the keypoints and we can clearly see the effects of data augmentation (rotating, shifting, brightness) that was applied for the second part, training for all keypoints.
I used a learning rate of 0.001 and tried both 16 and 32 channel conv layers. The precise architecture was
All conv2d layers output 16 or 32 channels depending on the training run.
Again, I omitted the first epoch and multiplied loss by 100 just to make the plot a bit clearer and more interpretable. Here, I showed the 16 channel version as it was much better (0.088 val loss at the end of 25 epochs vs. 0.248 for 32 channel).
Again, green = groundtruth, red = predicted
Here’s an example that shows the model has a strong prior on how the face is arranged. The whole face is coherent but it’s slightly off and wrong angle, which makes some of the points more incorrect than others. In other words, if the model gets a few points wrong, it’s likely to get many wrong, as it places the face in a face-like arrangement.
Here’s an interesting result from the training set. It seems that rotations can confuse the network, which indicates that the data augmentation is indeed challenging it and probably reducing overfitting though not enough that it’s able to reliably predict rotated images well.
For this part, I trained on the CSUA compute cluster using 2 GPUs. (I’m not sure if the code even used the GPUs though).
If I or others resubmit this score/ranking is subject to change. The ranking could decrease but the score cannot I believe since how Kaggle works.
The architecture is torchvision stock Resnet18. The only changes I made were to make the first layer have 1 input channels nd the last layer have 136 output features as described in the assignment website, to make it compatible with the dataset we used.
I trained with LR of 0.001 for 10 epochs with a batch size of 4.
Here I didn’t omit the first epoch and divided by 100 after multiplying, so this is a more “raw” plot of loss and we can see that it drops a LOT in the first epoch. However, I don’t think the model converged and training for more epochs could have really improved the overall results.
I set aside part of the training dataset for validation. That’s better than using the test dataset as we can also compare to ground truth.
Red = predicted, green = ground truth
Unfortunately, it looks like the model failed to detect the face in any of the custom images. My hypothesis is that this is because I didn’t crop these images as tightly as the dataset’s images. That could explain why the faces are also larger than my face is in the custom images.
I learned the importance of training with a small dataset first to catch bugs and get everything working before training on the full dataset.