Project 4 Facial Keypoint Detection with Neural Networks

Lizhi (Gary) Yang

Part 1: Nose Tip Detection

I used the pytorch dataset class and dataloader method to load the data and the groundtruth keypoints into the pipeline, here are sampled image from my dataloader visualized with ground-truth keypoints.

                                                     part1-sample1part1-sample2part1-sample3part1-sample4

The network is defined to have 3 convolutional layers with ReLU activation and stride 1 followed by max pooling layers with stride 2, and 2 fully connected layers. I used Adam optimizer with a learning rate of 1e-3 and train for 25 epochs with batchsize 1.
The convolutional layers have out channels 32, 24, and 12 with kernel size 7x7, 5x5 and 3x3 respectively.  Below is the network construction and the train-validation loss graph:

class NoseNet(nn.Module):

    def __init__(self):
        super(NoseNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 7)
        self.conv2 = nn.Conv2d(32, 24, 5)
        self.conv3 = nn.Conv2d(24, 12, 3)
        self.fc1 = nn.Linear(12 * 4 * 7, 6 * 4 * 7)
        self.fc2 = nn.Linear( 6 * 4 * 7, 2)

    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), 2)
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = F.max_pool2d(F.relu(self.conv3(x)), 2)                                         
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]
        num_features = 1
        for s in size:
            num_features *= s
        return num_features
part1-loss

Here are 2 examples that worked and 2 that did not. I think the reason the model failed is that it could not detect noses on angled faces. You can see that for the same person, when the face is angled, the model consistently fails. The left two are the good examples and the right two are the failed ones.

part1-good1part1-good2 part1-bad1part1-bad2

                                            


Part 2: Full Facial Keypoints Detection

For data augmentation, I applied a random shift of between -10 and 10 pixels, a random rotation of between -15 and 15 degrees and random color jitter of brightness, contrast, saturation and hue. all of factor 0.2. Here are sampled image from my dataloader visualized with ground-truth keypoints.

                                                                                                                     part2-sample1part2-sample2part2-sample3part2-sample4

The network is defined to have 6 convolutional layers with ReLU activation and stride 1 with the first 4 followed by max pooling layers with a stride of 2, and 2 fully connected layers. I used Adam optimizer with a learning rate of 1e-3 and train for 25 epochs with batchsize 1.
The convolutional layers have out channels 128, 64, 32,16, 16 and 16
respectively, while all have kernel size 3x3.  Below is the network construction and the train-validation loss graph:

class FaceNet(nn.Module):

    def __init__(self):
        super(FaceNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 128, 3)
        self.conv2 = nn.Conv2d(128, 64, 3)
        self.conv3 = nn.Conv2d(64, 32, 3)
        self.conv4 = nn.Conv2d(32, 16, 3)
        self.conv5 = nn.Conv2d(16, 16, 3)
        self.conv6 = nn.Conv2d(16, 16, 3)

        self.fc1 = nn.Linear(16 * 9 * 5, 8 * 9 * 5)
        self.fc2 = nn.Linear(8 * 9 * 5, 116)

    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), 2)
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = F.max_pool2d(F.relu(self.conv3(x)), 2)
        x = F.max_pool2d(F.relu(self.conv4(x)),2)
        x = F.relu(self.conv5(x))
        x = F.relu(self.conv6(x))
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features
part2-loss

Here are 2 examples that worked and 2 that did not.  I think the reason the model failed is that it is not good at detecting rotated faces and faces that are tilted upwards. Lack of data is also one reason since we only have so many images in the IMM dataset. The left two are the good examples and the right two are the failed ones.
part2-good1part2-good2 part2-bad1part2-bad2

                                                                                                                     

Here are the learned filters.

                                                                                                                      part2-filter


Part 3: Train With Larger Dataset

On Kaggle I have a mean absolute error of 8.91051. My Kaggle username is lzyang and I should show up on the leaderboard as Lizhi Yang.

The network is Resnet18 with the input modified to take in 1-channel grayscale images and the last fully connected layer to output a 136-dimension vector for the 68 keypoints. I used Adam optimizer with a learning rate of 1e-3 and train for 20 epochs with batchsize 1. The dataset is augmented with a random shift of between -10 and 10 pixels, a random rotation of between -15 and 15 degrees and random color jitter of brightness, contrast, saturation and hue. all of factor 0.2. The convolutional layers have out channels 128, 64, 32,16, 16 and 16 respectively, while all have kernel size 3x3.  Below is the network construction and the train-validation loss graph,which is drawn for train and validation losses of 20 epochs, the x-axis got set automatically for less cluttered viewing:

net = models.resnet18()
net.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3,bias=False)
net.fc = nn.Linear(512, 136)
part3-loss

Here are some predictions from the test set:
                                                       part3-test1    part3-test2part3-test3part3-test4part3-test5

Here are some of my photos. You can see that it has some trouble detecting my eyes when I wear glasses in the first 3 images, especially in the first and third images where the eye detection are offset, probably because of the glass frame.

                                                                               part3-custom1part3-custom2part3-custom3part3-custom4


Bells & Whistles

Auto Morph

I incorporated the keypoint detection into project 3's morph function and generated a video of the 4 images of myself shown in part 3 above morphed. Here is the link to the video if the below is not loading. It is implemented by changing the load_points function of project 3 to use the model trained in part 3 to detect keypoints and return them.

Anti-aliased max pool

Using anti-aliased max pool to train the network in part 2 on 300 images from the iBug dataset and validating on 50 and doing the same without anti-aliased max pool shows me that when using anti-aliased max pool, the training tends to go smoother but slower. Below are the two train-validation loss graphs, the left being training with  anti-aliased max pool and the right being training without  anti-aliased max pool. The graphs are drawn for train and validation losses of 20 epochs, the x-axis got set automatically for less cluttered viewing.

bells-anti
bells-orig