Part 1: Nose Tip Detection

We began with a small convolutional neural network architecture trained on a small face keypoint dataset in order to regress the location of only the nose keypoint. My network consisted of 3 convolutional layers, with channel depth 12, 20, and 32 (with the input to layer 1 being 1 channel as the input is grayscale images). Each convolutional layer was followed by a max pool of size 2 with stride 2 and a nonlinear ReLu layer. The final feature map is then flattened and then decoded into the nose keypoint coordinates through a fully connected layer. I did not use batch norm layers since the problem seemed very small, but my network probably would have benefitted from them. I downscaled each image to 1/8 size before passing it into the network. By default, I trained with a learning rate of 0.001 and a batch size of 4. I updated the model parameters using an Adam optimizer.

Below are some examples of the dataset with the nose keypoint annotation visualized.

Below are some results from the network predicting the nose keypoint. These were trained using the 3x3 kernel configuration with learning rate 0.001. The loss graph is visualized below (row 4).

Image 0
Image 2
Image 9
Image 10
Image 12
Image 26
Image 38
Image 47

We see that the network performs better on some cases than others. The most common trend is that the network outputs a good nose keypoint when the face is looking forward, but outputs a suboptimal keypoint when the face is looking off to one side. This is probably more difficult for the network as when looking straight on, there is a symmetry to the face and the nose lies exactly on the line of symmetry. When looking off to the side, there is no longer any symmetry and the nose is off to one side.

I tried changing the size of the kernels as well as the learning rate to see how the hyperparameters impacted training and the final result. The rows represent the following experiments:

  1. 5x5 kernels (default is 3x3 kernels)
  2. 7x7 kernels (default is 3x3 kernels)
  3. Learning rate = 0.0001
  4. Learning rate = 0.001
  5. Learning rate = 0.01

Image 0
Image 2
Loss for 5x5 kernel training
Image 0
Image 2
Loss for 7x7 kernel training
Image 0
Image 2
Loss for Learning Rate = 0.0001 training
Image 0
Image 2
Loss for Learning Rate = 0.001 training
Image 0
Image 2
Loss for Learning Rate = 0.01 training

We see that the optimal performance seems to be 3x3 kernels with a 0.001 learning rate. The worst performance came from training with a learning rate of 0.01. Having too high a learning rate leads to very unstable gradients.

Part 2: Full Facial Keypoints Detection

In part 2, we expanded our network and trained it to output all 58 facial keypoints. In order to make this work, we also needed to add data augmentation since the dataset is too small for the network to converge correctly. My network architecture became 6 convolutional layers, with channel depths: 12, 20, 32, 40, 48, and 64. Each convolutional layer was run followed by a ReLu layer, but max pool layers were only inserted after every other layer to reduce the amount of dimensionality reduction. The final featurte map is then flattened and run through two fully connected layers with a ReLu in between to decode the coordinates of all 58 keypoints. I trained with a learning rate of 0.0001. The images were downscaled to 1/4 scale before inputted into the network. My data augmentation involved random rotations to the image, followed by adjustments to the color, saturation, and brightness of the image. I trained with a batch size of 4. I updated the model parameters using an Adam optimizer.

Below are visualized some examples of the data loaded from the dataset and augmented for training.

Below are visualized some example results from the network on test images.

We see some trends of where the network performs well or poorly. The first set of images performed quite poorly. I hypothesize that this is the case due to the man's lack of hair. On the flipside, the last set of images also performed quite poorly due to the opposite reason, the man has some crazy hair. We see the second and third set of images display good results, most likely due to the subjects' arguably more regular hair and expression.

Below visualized is the loss graph as the network trained.

Below visualized are some of the learned kernels for the input convolutional layer. These kernels are the only ones that reflect actual pixel geometry. We see that they are quite abstract in nature, which is to be expected since there are only 12 output channels, and the network can't specialize them too heavily.

Part 3: Train With Larger Dataset

In part 3 of the project, we expanded beyond the small starting facial keypoint dataset and trained/tested on a much larger one. Since the dataset was much larger and much more diverse, a more complex network architecture was needed. For this part of the project, I started with a pretrained ResNet18 model. I then made several adjustments. In order to reduce overfitting, I froze the gradients of the first 3 convolutional layers of the network. I also replaced the last fully connected layer of ResNet18 with a fully connected layer of size 512x2048, applied a ReLu, and then one more fully connected layer of size 2048x(num_keypoints x 2). I also expanded my data augmentation scheme by adding a random amount of gaussian blur to the images and increasing the amount of rotation applied to the image. I trained the network with a learning rate of 0.0001 and a batch size of 4. I updated the model parameters using an Adam optimizer. My Kaggle submission reports a Mean Absolute Error (in pixels) of 6.01715.

Below visualized are some results on the test set.

Below visualized is the training and testing loss during training.

Below are visualized some results on my own images.

We see that surprisingly, the network performs well on even the second image, when the fact is almost entirely occluded by a mask. The only one it performs poorly on is the last image, where the picture is of a computer screen with a face on it. The artifacts induced by the pixels are probably what cause the most difficulty for the network.