CS 194-26: Project 5 - Keypoints Detection!

Tejas Thvar, Fall 2021

Part 1 - Nose Tip Detection

Images from the IMM dataset were processed using a custom data loader. Images were grayscaled and rezized to (80,60). Sample images with their annotated nose keypoints as well as landmark keypoints are shown below.

After this point a simple CNN was trained to perform nose-tip detection. The CNN was implemented using PyTorch and follows the design recommendations from spec. The CNN NoseNet has 3 layers with its full construction and loss curves detailled below.

Results of varying hyperparameters are also shown below. We can see that varying the kernel size to 7 [Right] and varying the learning rate to 5e-3 [Left] does not change much in the loss curves. Generally speaking, changing the learning rate shows how much model response is updated, which is visible by the more jerky validation and training curves. On the larger kernel size, more parameters are able to be learned quickly which would prompt more accurate loss curve, which is somewhat apparent empirically.

Example results are shown below. The green point is ground truth and the red point is predicted [Success on left two, failure on right two]. My assumption is that there is not enough training data on the turned head cases, making it too difficult to come up with accurate predictions. Also, when heads are turned there are much more corners that pop up in frame (ear, jaw, etc.)

Part 2 - Full Keypoint Detection

In order to prevent overfitting, first data augmentation was performed on the dataset. Examples of augmented data are shown below, with annotated keypoints. Transformations were performed according to the pytorch custom transformation tutorial (Random Rotation of up to 15 degrees, Changing brightness by 10%, Random Translation by 10 pixels, and lastly scaling to (240, 180)) .

2.1: CNN Architecture

For the full keypoint detection task, a more complicated CNN needed to be used. The detailed architecture is included below but as an overview 5 convolutional layers were used with varying kernel sizes of 5 and 3. I tested out multiple designs in order to come up with the optimal design choice. Additionally, fully connected layer design was somewhat determined experimentally (making sure dimensions lined up when predicting). I used dropout between each fully connected layer to reduce overfitting, especially given the small size of our training set (causing validaiton loss to exceed training loss initially but converge over time). Training hyperparameters used were a learning rate of 1e-3 and 100 epochs with 10 batch size to reach "convergence". Loss curves and sample predictions are shown below [correct on left, incorrect on right]. We can see that there is some intuition behind the errors - it seems that more sharply rotated images are harder to detect keypoints with. Additionally, it is clear that thinner/more distored faces are more difficult to accurately assess keypoint location. However, overall this network had pretty good results and got the general facial structure right for almost every face.

Learned 5 x 5 filters from the first convolutional layer are also visualized below.

Part 3 - Kaggle Keypoint Detection

For this part of the project, first data was loaded via the sample code provided. After this, the data was transformed according to the same methods as part 2. The image was also cropped to just the face of each using provided bounding boxes. Sample images are provided below:

For this part of the project, a larger ResNET18 model with pretrained weights was used. They key difference is that the first convolutional layer had its number of channels lowered to 1 and the last fully conncected layer needs to be updated to generate 68*2 keypoints. The full CNN architecture is detailed below.

The model was trained on the first 6000 images for 10 epochs with a 1e-3 learning rate. The model was validated on the last 666 images. Sample predictions and the loss curves are shown below.