Project 5 Report

Kevin Mo (SID 3034598971)

Introduction

In this project, we will explore automatic face keypoint detection using machine learning techniques. In particular, we will train deep convolutional neural networks to detect keypoints from image data sourced from popular facial datasets and compare between various models.

Nose Tip Detection

In this section, I trained a toy network to detect nose keypoints in images using a dataset of 240 different images of 60 distinct faces. For training, a 220/20 training and validation split was used, with custom dataloaders written for loading the full 240 image dataset into memory and transforming them (grayscale, normalizing, resizing) into a format the network can easily digest.

render
render
render

The nose detection network uses a total of three convolution layers with ReLU and max pooling (stride 2) along with two fully connected layers. The code is shown below.

class NosepointCNN(nn.Module):
                def __init__(self):
                super(NosepointCNN, self).__init__()
                
                # convolution layers
                self.conv1 = nn.Conv2d(1, 12, 5)
                self.conv2 = nn.Conv2d(12, 24, 5)
                self.conv3 = nn.Conv2d(24, 32, 5)
                
                # transformations
                self.fc1 = nn.Linear(32 * 3 * 8, 120)  # 5*5 from image dimension
                self.fc2 = nn.Linear(120, 2)
                
                def forward(self, x):
                # Max pooling over a (2, 2) window
                x = F.max_pool2d(F.relu(self.conv1(x)), 2)
                x = F.max_pool2d(F.relu(self.conv2(x)), 2)
                x = F.max_pool2d(F.relu(self.conv3(x)), 2)
                x = torch.flatten(x, 1) # flatten all dimensions except the batch dimension
                x = F.relu(self.fc1(x))
                x = self.fc2(x)
                return x
                
            

The network was initialized and trained with an Adam optimizer with learning rate of 1e-3. The batch size was set to 1 and was trained for 25 epochs. Below are the results of the model.

MSE loss

Below are some of the results of our toy model.

Good examples:

good example
good example

Bad examples:

good example
good example

The model tended to fail on angled faces, with the failed results predicting near other keypoints (mouths, nose).

Full Facial Keypoints Detection

In this section, we use the same 240-image dataset to train a network to identify all 68 keypoints on a face. Much of the process is similar to our work on nose tip detection, but the following modifications are made:

  1. The resize transformation is changed so that the input image is larger (80x60 to 160x120).
  2. Additional transformations were added to prevent overfitting, including color jitter in the brightness and saturation directions.

Below are a few examples of our dataloader at work:

good example good example good example good example

We use an expanded network to train on this dataset, where we add two more convolution layers with more channels. The network is as shown below:

class FaceCNN(nn.Module):
                        def __init__(self):
                        super(FaceCNN, self).__init__()
                        
                        # convolution layers
                        self.conv1 = nn.Conv2d(1, 12, 5)
                        self.conv2 = nn.Conv2d(12, 32, 5)
                        self.conv3 = nn.Conv2d(32, 64, 5)
                        self.conv4 = nn.Conv2d(64, 128, 5)
                        self.conv5 = nn.Conv2d(128, 32, 5)
                        
                        # transformations
                        self.fc1 = nn.Linear(32 * 3 * 8, 160, bias=True)  # 5*5 from image dimension
                        self.fc2 = nn.Linear(160, 2 * 58, bias=True)
                        
                        def forward(self, x):
                        # Max pooling over a (2, 2) window
                        x = F.max_pool2d(F.relu(self.conv1(x)), 2)
                        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
                        x = F.max_pool2d(F.relu(self.conv3(x)), 2)
                        x = F.relu(self.conv4(x))
                        x = F.relu(self.conv5(x))
                        x = torch.flatten(x, 1) # flatten all dimensions except the batch dimension
                        x = F.relu(self.fc1(x))
                        x = self.fc2(x)
                        return x
                    

The network was initialized and trained with an Adam optimizer with learning rate of 1e-3. The batch size was set to 1 and was trained for 30 epochs. Below are the results of the model.

Good examples:

good example
good example

Bad examples:

bad example
bad example

This time around, the network had trouble positioning and scaling some face points correctly when the face is looking directly at the camera, with some of the predictions mistaking features such as eyebrows as eyes or having a different jawline.

Besides the third image, the larger network performed identification of keypoints rather accurately. The loss curves are shown below:

MSE loss

Below are some visualizations of the learned filters in the first few convolution layers:

CNN layer 1
CNN layer 1

Train with Larger Dataset

In this part, we will operate with a much larger dataset from wild data and use a more established network to train our data. Training was performed locally with a custom dataloader that loads the dataset into memory and transforms images (normalizing, cropping, resizing) for network input. In particular, we use the dataset’s bounding boxes to crop the image and resize it while applying additional transformations such as color jitter, rotation, and random grayscaling. Below are some of the samples from the dataloader:

good example
good example
good example
good example

The dataset was split 80/20 between training and validation, shuffled each time.

The network utilized is Resnet18 that takes a color (3-channel) image input and outputs a set of 68 keypoints (flattened as 136 points) in the final fully connected layer. During training, all relevant data is moved to the GPU for faster parallel processing.

The network was initialized and trained with an Adam optimizer with learning rate of 1e-3. The batch size was set to 32 and was trained for 20 epochs. Below are the results of the model.

Validation dataset:

good example
good example
good example
good example

As we can see, our network does a lot better at prediction with our validation data!

Below is the MSE loss curves for training and validation:

MSE loss

By the end of training, the model was able to reduce loss to around 1.5.

KAGGLE: The public score I obtained in the Kaggle competition is 10.40924. Kaggle name is Kevin Mo (kevmo).

Using Other Images

I’ve also fed in some of my own original images to see if they would work with the model. For many, they were accurate enough!

Good examples:

good example
good example
good example

Bad examples:

good example
good example
good example

Much of the errors the model makes are also visible in the previous toy models that were discussed, including troubles deciding where the face boundaries were, misplacing keypoints on incorrect features, and more. However, I found that most of this could be remedied by changing the crop to be tighter/smaller so that the face covers a majority of the 224x224 image.

Overall, while this project was challenging at parts, it was very worthwhile in applying my pre-existing knowledge of ML in how to build and train a network.