In this project, I used neural networks to automatically detect facial keypoints. This involved using PyTorch to create a convolutional neural network and training and testing it on data from the IMM Face Database.
This first part involved making an initial toy model for nose tip detection. To do this, I first parsed the landmarks from the dataset and stored it into a dataset object. I called the PyTorch Dataloader on that dataset, splitting it into test and validation sets. Below is some samples of the annotated images.
|
|
|
|
Then I created my neural net structure is as below. I followed the guidelines in the spec, choosing to keep the output dimensions of all my convolutional layers as 12 and kernel size as 3. I used a total of 25 epochs with a learning rate of .001 and a batch size of 1.
Net(
(conv1): Conv2d(1, 12, kernel_size=(3, 3), stride=(1, 1))
(conv2): Conv2d(12, 12, kernel_size=(3, 3), stride=(1, 1))
(conv3): Conv2d(12, 12, kernel_size=(3, 3), stride=(1, 1))
(fc1): Linear(in_features=480, out_features=260, bias=True)
(fc2): Linear(in_features=260, out_features=2, bias=True)
)
Here is a plot of the MSE for training and validation sets, over 25 epochs.
|
|
|
These are examples of my neural net correctly identifying the nosetip point, with the ground truth point in red and the prediction in blue.
|
|
These are failure cases where the neural net does not correctly identify the nosetip point. I believe this is due to the faces being tilted and the facial features being warped as a result.
In this part, I applied the same logic to bigger images and to predict multiple facial keypoints instead of just one. Adjustments I made here were to augment the data so as to prevent overfitting. I did this by doing random rotations of the face and random shifts of the face. I tried using Color Jitter to randomly change the brightness and saturation of the face, but I found that to be rather unsuccessful. I also modified my CNN from before, adding 2 more convolutional layers and experimenting a lot with input and output channel sizes. I trained with learning rate .001, batch size 1, and 25 epochs.
Net(
(conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))
(conv2): Conv2d(32, 16, kernel_size=(3, 3), stride=(1, 1))
(conv3): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1))
(conv4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1))
(conv5): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1))
(pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(pool3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(pool4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(pool5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(fc1): Linear(in_features=48, out_features=300, bias=True)
(fc2): Linear(in_features=300, out_features=116, bias=True)
)
Here is a plot of the MSE for training and validation sets, over 25 epochs.
|
|
|
These are examples of my neural net correctly identifying the facial keypoints, with the ground truth point in red and the prediction in blue.
|
|
These are failure cases where the neural net does not correctly identify the facial keypoints. I believe this is due to the faces structure being a little different. The first example has a slim face and is smiling brightly; the second has a long chin with a beard and is also smiling.
Here are the learned filter visualizations for my neural net.
|
|
I used the PyTorch dataloader and CNN tutorials as linked in the spec as well as the filter visualization code as linked on Piazzas.