Project 5: Facial Keypoint Detection with Neural Networks

Part 1: Nose Tip Detection

Data loader:

We first gray scale and downsize the image for faster processing. Then only the key point for nose tip is consider, so we disregard the other key points.

CNN architecture:

We first attempt a CNN with 3 convolution layers, 3 max_pool layers and 2 fully connected layers. The first convolution layer outputs 8 channels, the next outputs 16 channels and the third output 32 channels. All filters are 3x3. The last fully connected layer output 2 prediction, one for the x-axis and one for the y-axis.

Hyperparameter Tuning:

We try varying the learning rate and the number of convolution layers. Test between learning rate = 0.0001 vs learning rate = 0.001 and between 3 conv layers vs 4 conv layers. Bellow are the training curve for each:

After 10 epoch all models reach similar train and validation loss. Howevers with a small learning rate 0.0001 the network converges slower. With four layers the network loss oscillate less than three layers.

Results:

Here are some sample model predictions(red) vs ground truth (green):

Bellow is the training curve of the final model:

Here are some failed cases:

Versus correct cases:

We think the failed cases are due to the model looking at the shape of the face as a whole, then predicts that the nose tip should be somewhere in the middle of the face. This approach fails when people look down or to the side, changing their nose tip positions. Note that the model predicts most accurately when the person looks straight at the camera rather than tilting their face downward or sideways.

Part 2: Full Facial Keypoints Detection

Data loader:

As before we downsize and gray scale the image. Now, we consider all 58 key points and add data augmentation. The augmentations we used are: random color jitter, rotation by a random angle between [-15,15], shift by a random amount between [-10,10] pixels.

CNN architecture:

We use five convolution layers, four max pool layers and two fully connected layers. The first convolution layer outputs 8 channels, the next outputs 16 channels and the third and fourth output 32 channels and the last outputs 64 layers. All filters are 3x3. The last fully connected layer output 58*2 prediction, one for the x-axis and one for the y-axis of each 58 key points. We train for 15 epochs with a learning rate of 0.0001.

Results:

Bellow is the training curve of our model:

Visualization of learned filters in the first three layers:

Here are some sample model predictions(red) vs ground truth (green):

Here are some failed cases:

Versus correct cases:

We think the failure cases are due to the network having 1) bias towards faces that looks straight at the camera and 2) looking at the shape of the face as a whole. Thus in some of the failure case, we see that the subject is facing to the side or tilting their head but the network still predict them as looking straight.

Part 3: Train With Larger Dataset

We use the same data augmentations as in Part 2 in addition to random shuffling the training set.

CNN Architecture:

We use a pre-trained ResNet-18 model from the torchvision library. We make two modification to the model: 1) the first conv layer is modified to take in gray scale images with only 1 input channel and 2) the last fully connected layer is modified to output 68*2 predictions, one for the x-axis and one for the y-axis of each 68 key points. We train for 10 epochs with a learning rate of 0.0001.

Results:

Bellow is our model's training curve:

Here are some sample model predictions(red) vs ground truth (green):

Here are some prediction on photos from my collection:

Part 4: Bells & Whistles: Integrating with project 3

Using the face morphing algorithm from project 3, we combine that with our facial feature extraction model to arbitrary morph any two faces without needing to manually annotate features.