The goal of this part is to be able to predict the location of the nose keypoint on an input face image using a convolutional neural network. The first step is to create a dataloader for our face dataset. Here are some sample images from my dataloader visualized with the ground-truth keypoint.
We now train our model for 25 epochs, using adam optimizer, MSELoss as our criterion, and 0.001 as the learning rate. Shown below is the plot for the training and validation MSE loss.
I also did hyperparameter tuning to see if I could make the model performance better. I first tried varying the learning rate to be 0.05. I then tried modifying the model architecture by increasing the sizes of the convolutional layers, and also adding one more fully connected layer. Here are graphs of the training and validation accuracy over time
Varying Learning Rate
Changing model architecture
Now we can evaluate our trained model to see the quality of its predictions. Shown below are 2 images that the model did well on, and 2 images that the model did not do so well on. Based on these examples, it seems like my model is able to generalize well to images where the face is in a standard orientation with the mouth closed. In the two images that my model did poorly on, the person's face is turned sideways in the first, and the mouth is open in the second.
We now want our model to be able to detect all facial keypoints in a face image, not just the nose. We also apply some data augmentation to prevent the model from overfitting. I chose to apply small rotations to images between -15 and 15 degrees to images as my method of data augmentation. Here are some samples from my dataloader:
I trained for 25 epochs using a learning rate of 0.01, batch size of 8, adam optimizer, and MSE loss. Here is the graph of training and validation loss:
Shown below are 2 images that the model did well on, and 2 images that the model did not do so well on. Similar to the previous part, my model does well on faces that are in a front-facing orientation. The two images that it doesn't do well on are ones where the person's face is turned to the side. My model may have overfit to frontal-facing orientations, or just does not have the complexity to generalize to different orientations of faces yet.
I also visualized the filters learned in the first and second convolutional layer of our network:
Layer 1:
Layer 2:
In this part, we train on the ibug face in the wild dataset. For this dataset, I first had to crop our images according to their specified bounding boxes. This was because we only want to feed the face portion of each image into our model. I also applied the same rotation data augmentation technique that I applied for part 2, and also added color jittering. Here are some images with labeled keypoints sampled from our dataloader.
The architecture I used for this part is based of Resnet18, with a few modifications to fit our current task. First, I changed the first convolutional layer to have 1 input channel, as our input images are grayscale. Last, I changed the output dimension of the last fully connected layer to 136, to account for 68 keypoints. I left the rest of the layers unchanged.
I trained for 25 epochs using a learning rate of 0.01. Here is the graph of training and validation loss:
For my final model, I trained on the entire training dataset (no more training/validaiton split) for 200 epochs. Here are some results of the model in action on some test set images.
Here are some results of the model in action on some images of top TFT streamers.
For my Kaggle submission, I trained my model on the entire provided dataset for 3300 epochs with a learning rate of 0.01 and batch size of 16. My Kaggle username is Kevin Chen and my Kaggle score is 6.63344
I chose to create a morph sequence using automated keypoints prediction. The images I chose in the sequence are the ones of TFT streamers shown above. Here is the morph sequence: