Facial Key Point Detection with Neural Networks

Roth Yin | rothyin@berkeley.edu

Part 1: Nose Tip Detection

Use torch.utils.data.DataLoader to load the data. Reserve 20% for validation.

Resize the images to 60 * 80, adjust the landmarks accordingly. PIL has better anti-aliasing when resizing.
Transform the data so that it ranges from -0.5 to 0.5

samples with ground truth landmarks

Create a CNN with torch.nn.Module with

3 layers of convolution such that
- the number of out-channels increases from 12 to 22 to 32
- the filter sizes decreases from 7 to 5 to 3
2 layers of fully connected with feature size from 3872 to 1024 to 2
ReLU and maxpool after each convolution layers
Flatten before the fully connected layers
ReLU after the first fully connected layers

Train with torch.optim.Adam with learning rate 1e-3 and use loss function torch.nn.MSELoss

The tuning result is as follows:

In terms of the learning rate, 3e-3 plateaus in the beginning, indicating that the learning rate is too high. 5e-4 does not make too much progress, which means that it steps too slow and is trapped in a local minimum.
In terms of the batch sizes, batch size of 2 makes the training oscalates too much. Batch size of 16 makes the training plateau too early.

successful examples
failed examples

The successful cases are mainly the front face, which is standard and learned well by the network. The side facing images are prone to error. In the above examples, we can see that the network learned that the nose point is right under some shade, but was not good enough to tell apart different kinds of shade.

Part 2: Full Facial Key Points Detection

Use torch.utils.data.DataLoader to load the data. Reserve 20% for validation.

Resize the images to 120 * 160, adjust the landmarks accordingly. PIL has better anti-aliasing when resizing.
Transform the data so that it ranges from -0.5 to 0.5
Because now we are predicting all facial points, to prevent overfitting, augment the data with
- torchvision.transforms.ColorJitter
- torchvision.transforms.RandomAutocontrast
- torchvision.transforms.RandomEqualize
- Rotation: Because the landmarks have to rotate with the images as well, implement the rotation manually. For a random probability,
  - Use torchvision.transforms.functional.rotate to rotate the images
  - Use rotation matrix to rotate the landmarks
- Horizontal flip: Because the landmarks have to flip with the images as well, implement the flip manually. For a random probability,
  - Use torchvision.transforms.functional.hflip to flip the images
  - Negate the x-components of the landmarks

Samples with ground truth landmarks:

Create a CNN with torch.nn.Module with

5 layers of convolution such that
- the number of out-channels increases from 1 to 256
- the filter size is 3
2 layers of fully connected with feature size from 768 to 512 to 116
ReLU and maxpool after each convolution layers
Flatten before the fully connected layers
ReLU after the first fully connected layers

Train with

torch.optim.Adam with learning rate 1e-3
loss function torch.nn.MSELoss
batch size 16

Result

successful examples
failed examples

For the first failed example, it might be because the lighting is very strong which makes his face not contrasting enough. So the network is hard to recognize the shadows.
For the second failed example, she has too much motions

The following are the visualized filters:

Part 3: Train With Larger Dataset

Use ibug faces, resizing the bounding box crops to size 224 * 224. Though the labeling sometimes causes the landmarks to be outside of the bounding boxes, experimentally it is better not to do anything, because it mimics the situation in the test time. The following samples are shown with the bounding boxes enlarged for aesthetic reasons (specially enlarging it such that it fits the circle with image center as the center, the largest distance between any key point and the image center as the radius).

Use torchvision.models.resnet18.

Change the in-channel of the first convolution layer to 1 because gray-scaled images are used to save training time.
Change the out features of the last fully connected layer to size 136 because each image has 68 key points.

Train with

torch.optim.Adam with learning rate 1e-3
loss function torch.nn.MSELoss
batch size 128

On Kaggle I was able to get MSE of 12.60784.
Visualization of sampled results:

Results when running from images outside of ibug

For the first image, the network apparantly takes the hair edge as the jaw edge and recognizes the eyebrows as eyes.
For the second image, the network is pretty accurate.
For the thrid image, the network overestimates the shape of the person's face, which should be thinner than predicted.

Bells & Whistles (Extra Point)

Use the above trained networks and results to automatically detect key points to morph faces with the implementation from project 3.