Before inferring facial keypoints, we first focus on detecting just the nose. I first write a dataloader that extends the Dataset class to load in the IMM Face Dataset. For each query of the dataset, I make sure to
normalize the images and then resize them to 60x80. Here are some sampled images with their corresponding keypoints overlaid on top:
I then design a neural network with 3 convolution layers and 2 fully connected layers. Each convolution layer is followed by ReLU and then max pooled. I train the model for 20 epochs, using MSE loss and the Adam optimizer with a learning rate of 5e-04. Here are the training and validation loss curves during training:
|
|
Below are a few results that my trained model produces. The orange dot is my predicted label, while the blue dot represents the ground truth label. The top row contains good results, while the bottom row
contains suboptimal results. I believe that my model fails in some cases because it is not very robust— it has trouble dealing with faces with extreme orientations and poses. In addition, it seems to use
the intensity to detect the nose. Because of this, it ends up placing the nose keypoint on the cheeks of some faces because the intensity variation is similar to that of a nose.
After detecting one face keypoint, we try and detect all 58 of them. Using the same dataset, I slightly modify the dataloader that I write to produce all landmarks. This time, I also make sure to resize the images
to 120x160 pixels, double the size as before. Here are some sampled images with their corresponding facial landmarks overlaid on top:
The neural network I designed for this task has 5 convolution layers and 2 fully connected layers. Each convolution layer is followed by ReLU and then max pooled. Here is the architecture of my network:
I train the model for 30 epochs, using MSE loss and the Adam optimizer with a learning rate of 1e-03. During training, I also augment the data by adding in random horizontal flips. Here are the training and validation loss curves during training:
|
|
Additionally, here are the convolution filters that my neural network was able to learn. These correspond to the first convolution layer:
The results that my trained model were able to produce are shown below. Once again, the orange dots are my predicted labels, while the blue dots represents the ground truth labels. The top row
contains good results, while the bottom row contains suboptimal results. Since my data augmentation was not robust, I still think that my model has trouble generalizing to various poses sometimes,
as you can see from the first bad example. It also has trouble when the faces show extreme expression, shown in the second bad example. I think that in the future I'd want to add in random rotations
when augmenting my data.
With my trained model, I was able to get a score of 19.11047 on Kaggle under the username of 'William Ren'. Here are some results I was able to produce on the testing data:
Lastly, I was able to predict on a few of my own images as well, after adding custom bounding boxes. My model seems to do fairly decently, though it does have some trouble getting the exact
jaw outline down.