Tejas Thvar, Fall 2021
Images from the IMM dataset were processed using a custom data loader. Images were grayscaled and rezized to (80,60). Sample images with their annotated nose keypoints as well as landmark keypoints are shown below.
After this point a simple CNN was trained to perform nose-tip detection. The CNN was implemented using PyTorch and follows the design recommendations from spec. The CNN NoseNet has 3 layers with its full construction and loss curves detailled below.
Results of varying hyperparameters are also shown below. We can see that varying the kernel size to 7 [Right] and varying the learning rate to 5e-3 [Left] does not change much in the loss curves. Generally speaking, changing the learning rate shows how much model response is updated, which is visible by the more jerky validation and training curves. On the larger kernel size, more parameters are able to be learned quickly which would prompt more accurate loss curve, which is somewhat apparent empirically.
Example results are shown below. The green point is ground truth and the red point is predicted [Success on left two, failure on right two]. My assumption is that there is not enough training data on the turned head cases, making it too difficult to come up with accurate predictions. Also, when heads are turned there are much more corners that pop up in frame (ear, jaw, etc.)
In order to prevent overfitting, first data augmentation was performed on the dataset. Examples of augmented data are shown below, with annotated keypoints. Transformations were performed according to the pytorch custom transformation tutorial (Random Rotation of up to 15 degrees, Changing brightness by 10%, Random Translation by 10 pixels, and lastly scaling to (240, 180)) .
Learned 5 x 5 filters from the first convolutional layer are also visualized below.
For this part of the project, first data was loaded via the sample code provided. After this, the data was transformed according to the same methods as part 2. The image was also cropped to just the face of each using provided bounding boxes. Sample images are provided below:
For this part of the project, a larger ResNET18 model with pretrained weights was used. They key difference is that the first convolutional layer had its number of channels lowered to 1 and the last fully conncected layer needs to be updated to generate 68*2 keypoints. The full CNN architecture is detailed below.
The model was trained on the first 6000 images for 10 epochs with a 1e-3 learning rate. The model was validated on the last 666 images. Sample predictions and the loss curves are shown below.