By: Calvin Chen
For this section, I used the IMM Face database to conduct nose tip detection for the different faces in the dataset.
For the Dataloader in Part 1, I initially converted the images to greyscale, normalized the pixel values around 0, and then resized the images downwards to 80 x 60
.
Here's some images from the dataloader with their corresponding ground-truth keypoints on their noses.
Here, I defined the CNN architecture used for Part 1 as the following:
I trained the model over 20 epochs using an Adam optimizer with a learning rate of 0.001.
Below are the training and validation losses plotted over the 20 epochs used to train the model.
Here's a few examples of where the CNN worked well and a few where it didn't. The main difference between the images and the accuracy of the CNN seemed to be pretty related to the angle/direction the face was oriented in (the model was more accurate in labeling faces facing straight).
For this section, I went beyond just detecting the nose point, and now moved onto detecting all the keypoints in the face images. This meant the dataloader returning 58 points instead of just 1 for all the faical points.
For this section's dataloader, I conducted data augmentation to increase the accuracy and robustness of the model itself. This entailed:
For this section, I create my CNN architecture with the following:
For this model, I used an Adam optimizer with a learning rate 0.01 and trained over 20 epochs.
For this part, I inputted differentt original images into the network to see what output prediction points I would find. From preliminary findings, it seems that faces that were oriented more towards the camera (not skewed or facing away), did better, since a lot of the training data was oriented around an average straight face as well. Below are depictions of some of the images that did well and some that didn't.
After training the model, I took a look at the filters from the model's layers themselves. Here's what some of the filters from the first layer look like:
For this section, I constructed a dataloader similar to the one used in Part 2, tthe only main difference being the size of the output shape (224x224 instead of 120x160). Additionally, I used different boundary boxes labeled on the images to crop the faces out of the photos themselves.
For this section, I constructed a CNN using ResNet18's CNN architecture, with only minor tweaks to taking in only one color channel and outputting 136 rather than 1000 (for the 68 keypoints over the 1000 classes). Additionally, I trained this model using an Adam optimizer with a learning rate of 0.001 over 10 epochs.
Below, I plotted the training loss and the validation loss of the model over the 10 epochs trained on.
On Kaggle, I received a MAE of 15.78838.
Additionally, I visualized some of the different images from the test set and plotted the keypoints predicted onto them.
Below you'll find different inputted images from the web into the model. It seems that the model does a lot better with images where the face is directly facing the camera as compared to other images with slightly angeled faces.