COMPSCI 194-26: Project 5

Kaijie Xu

nortrom@berkeley.edu

Background

In this project, I automatically detected the points picked from the shape of a face using Neural Networks. My model should detect key points on the face automatically with any given face image.

Part 1: Nose Tip Detection

The Dataset:

I used the IMM Face Database for the project

The dataset contains 240 facial images of 40 people, while each person have 6 images from different viewpoints. For every image, 58 facial key points are predefined

However, in part 1 I only consider the key point referring to nose.

The Dataloader:

For the 240 images, I loaded the first 192 images and their corresponding selected points as training set, then the other 48 ones as validation set.

Here are a couple of sampled images from my dataloader with ground-truth keypoints:

Model Architecture:

Net(

(conv1): Conv2d(1, 12, kernel_size=(7, 7), stride=(1, 1))

(conv2): Conv2d(12, 18, kernel_size=(5, 5), stride=(1, 1))

(conv3): Conv2d(18, 24, kernel_size=(3, 3), stride=(1, 1))

(conv4): Conv2d(24, 32, kernel_size=(3, 3), stride=(1, 1))

(fc1): Linear(in_features=64, out_features=20, bias=True)

(fc2): Linear(in_features=20, out_features=2, bias=True) )

After training with Adam optimizer(which is also used in later parts), here is my train and validation accuracy over 25 epochs:

Result:

Correct Cases:

Here are two examples which worked correctly.

My model works really well on them because they are facing straight to the camera

In other words, the network may learn the information mostly from the center of the image,

thus we can detect keypoints from these specific "normal" images easily

Incorrect Cases:

Here are two examples which worked incorrectly.

Since there are not enough data, the network is not able to catch the keypoints correctly for the faces looking other ways(the left one, and the only one image with bad performance)

And since most of the faces are devoid of expression, it is also hard to caputure the features for the smiling one(the right one, but close)

Part 2: Full Facial Keypoints Detection

The Dataset:

I used the same dataset as in the part 1. But in part 2, I predict all 58 keypoints instead of just the nose point

The Dataloader:

In Part 2, I resized the images in the datasets to size 160*120 instead of 80*60 to get more training information.

To prevent overfitting, I also created an augmentation class that further changed the images by rotating them and changing their brightness randomly

Here are some examples:

Model Architecture:

Net( (conv1): Conv2d(1, 12, kernel_size=(7, 7), stride=(1, 1))

(conv2): Conv2d(12, 18, kernel_size=(6, 6), stride=(1, 1))

(conv3): Conv2d(18, 24, kernel_size=(5, 5), stride=(1, 1))

(conv4): Conv2d(24, 32, kernel_size=(4, 4), stride=(1, 1))

(conv5): Conv2d(32, 42, kernel_size=(3, 3), stride=(1, 1))

(fc1): Linear(in_features=2772, out_features=400, bias=True)

(fc2): Linear(in_features=400, out_features=116, bias=True) )

Here is my train and validation accuracy over 50 epochs:

Result:

Good Cases:

Bad Cases:

Here are some learned filters for each conv layer

Below are the first two layers for conv1, conv2, and conv3

Below are the first two layers for conv4 and conv5

Part 3: Train With Larger Dataset

For Part 3, I use a larger dataset, the ibug face in the wild dataset, to train a facial keypoints detector.

I perform the similar data augmentation as in the previous part. All images are firstly converted to grayscale and rescaled, with randomly rotating and flipping, and brightness are changed.

Sample Images

I use the pretrained Resnet18 model.To fit my training set, I modify the first layer and the last layer and leave other parameters unchanged.

I trained for 65 epochs with learning rate = 0.0001.

Here is the detailed model architecture.

Model Architecture:

ResNet(
  (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer2): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer3): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer4): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=512, out_features=136, bias=True)
)

Here is the loss curve for the last 10 epochs.

Result:

Sample keypoints-detected images in the testing set:

My images with keypoints predicted:

My model does not work well for the first image probably due to my glasses, while the second one works better a little bit cause my glasses are smaller(but still not satisfying)

Image of Trump works pretty well, but the one for depp does not due to his "messy" hair

Bells & Whistles

I also try to use the network to auto detect the keypoints for the given images and do the morph as in Project 3

Here is the sample morph video generated