Neural Network Project

Praveen Batra


In this project I trained several convolutional neural networks in pytorch to perform regressions. The first network was a 3-conv nose detector for grayscale images. The second was a 5-conv facial keypoint detector, and the third was a modified Resnet18 for facial keypoint detection.

Part 1: Nose Tip Detection

For this part, I trained a simple network for nose tip detection.

The architecture was:

Sampled image from dataloader with ground-truth keypoints

Here we see the nose keypoint in green.

Training and validation MSE loss

If you can’t read axes, disable dark mode as it hides the axes of the plot. We can see the loss goes down consistently for training while validation is more erratic but still goes down by the end.

Note, I multiplied the loss by 100 to get a more readily interpretable value. I also omitted the first recorded epoch loss as it was much higher and made the shape of the loss curve harder to see.

Correctly detected images

Green is ground truth, red is predicted.

Incorrectly detected images

Here, possibly because of the shadowing on the person’s face, the nose is incorrectly detected to the right of where it should be.

A similar failure case, though not as pronounced, again possibly due to lighting (though the lighting is not as bad here).

Hyperparameter tuning

I experimented with changing the number of channels for the conv layers and the learning rate. Here are the final validation losses after training for about 25 epochs.

The best results were found with a very low learning rate and a lot of channels, but for most of the results above I used the 0.001 LR with 16 channels as it was fairly balanced and achieved nearly the same loss with lower compute overhead.

Generally, it seems that too-high learning rate reduces the quality of the model.

Part 2: Full Facial Keypoint Detection

Dataloader and data augmentation

Here, we see all the keypoints and we can clearly see the effects of data augmentation (rotating, shifting, brightness) that was applied for the second part, training for all keypoints.

Model architecture

I used a learning rate of 0.001 and tried both 16 and 32 channel conv layers. The precise architecture was

All conv2d layers output 16 or 32 channels depending on the training run.

Plot of loss

Again, I omitted the first epoch and multiplied loss by 100 just to make the plot a bit clearer and more interpretable. Here, I showed the 16 channel version as it was much better (0.088 val loss at the end of 25 epochs vs. 0.248 for 32 channel).

Good results

Again, green = groundtruth, red = predicted

Bad results

Here’s an example that shows the model has a strong prior on how the face is arranged. The whole face is coherent but it’s slightly off and wrong angle, which makes some of the points more incorrect than others. In other words, if the model gets a few points wrong, it’s likely to get many wrong, as it places the face in a face-like arrangement.

Here’s an interesting result from the training set. It seems that rotations can confuse the network, which indicates that the data augmentation is indeed challenging it and probably reducing overfitting though not enough that it’s able to reliably predict rotated images well.

Learned filters

Part 3 Train with larger dataset

For this part, I trained on the CSUA compute cluster using 2 GPUs. (I’m not sure if the code even used the GPUs though).

Dataloader sampled images

Kaggle results and username

If I or others resubmit this score/ranking is subject to change. The ranking could decrease but the score cannot I believe since how Kaggle works.

Detailed architecture and hyperparameters

The architecture is torchvision stock Resnet18. The only changes I made were to make the first layer have 1 input channels nd the last layer have 136 output features as described in the assignment website, to make it compatible with the dataset we used.

I trained with LR of 0.001 for 10 epochs with a batch size of 4.

Plot of loss

Here I didn’t omit the first epoch and divided by 100 after multiplying, so this is a more “raw” plot of loss and we can see that it drops a LOT in the first epoch. However, I don’t think the model converged and training for more epochs could have really improved the overall results.

Visualized predictions on Validation set

I set aside part of the training dataset for validation. That’s better than using the test dataset as we can also compare to ground truth.

Red = predicted, green = ground truth

My custom results

Unfortunately, it looks like the model failed to detect the face in any of the custom images. My hypothesis is that this is because I didn’t crop these images as tightly as the dataset’s images. That could explain why the faces are also larger than my face is in the custom images.

What did you learn

I learned the importance of training with a small dataset first to catch bugs and get everything working before training on the full dataset.