Daniel Edrisian
Project 5
Esketit
In this project, we try to detect facial key points using neural networks. It is split into three parts: Part 1 will handle a soft neural network for detecting nose key points. Part 2 will be based on detecting the entire facial points based on the same dataset used in part1. Part 3 is an extended version of part 2 where we use a larger data set and a more robust neural network architecture.
Based on the Danes' face dataset, we will train a neural network for one facial key point--the nose. The images are preprocessed and downsized in the dataloader.
The following images are sampled from the dataloader and represented with their respective ground truth labels (as the green dot).
CNN1(
(conv1): Conv2d(1, 32, kernel_size=(7, 7), stride=(1, 1))
(n1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(32, 32, kernel_size=(5, 5), stride=(1, 1))
(n2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))
(n3): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv4): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))
(n4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(fc1): Linear(in_features=64, out_features=512, bias=True)
(fc2): Linear(in_features=512, out_features=2, bias=True)
)
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 32, 74, 54] 1,600
BatchNorm2d-2 [-1, 32, 37, 27] 64
Conv2d-3 [-1, 32, 33, 23] 25,632
BatchNorm2d-4 [-1, 32, 16, 11] 64
Conv2d-5 [-1, 32, 14, 9] 9,248
BatchNorm2d-6 [-1, 32, 7, 4] 64
Conv2d-7 [-1, 32, 5, 2] 9,248
BatchNorm2d-8 [-1, 32, 2, 1] 64
Linear-9 [-1, 512] 33,280
Linear-10 [-1, 2] 1,026
================================================================
Total params: 80,290
Trainable params: 80,290
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.02
Forward/backward pass size (MB): 1.49
Params size (MB): 0.31
Estimated Total Size (MB): 1.82
----------------------------------------------------------------
During experiments, I noticed that the higher the learning rate, the quicker the loss function converges. Although I observed some unexpected loss increases towards the end of the training session (around epoch #20). Increasing the number of convolution features didn't seem to drastically improve performance. Just a tiny bit.
The following 2 images were correctly predicted. We notice the subject's face is pointed straight to the camera.
The incorrect predictions are related to the subjects' angle towards the camera. This means the network is missing the nose information because of the camera's angle changes (and the full nose isn't shown). Also, the dataset is too small to cover those cases adequatly. No data transformation was applied.
In this section, I implemented a neural network for predicting all the facial key points. I used data augmentation for the images, like resizing, color saturation manipulation, vertical and horizontal flipping (flipation?), normalization, and rotation.
CNN2(
(conv1): Conv2d(1, 16, kernel_size=(7, 7), stride=(1, 1))
(n1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(16, 16, kernel_size=(7, 7), stride=(1, 1))
(n2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(16, 16, kernel_size=(5, 5), stride=(1, 1))
(n3): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1))
(n4): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv5): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1))
(n5): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(fc1): Linear(in_features=1056, out_features=512, bias=True)
(fc2): Linear(in_features=512, out_features=116, bias=True)
)
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 16, 154, 114] 800
BatchNorm2d-2 [-1, 16, 77, 57] 32
Conv2d-3 [-1, 16, 71, 51] 12,560
BatchNorm2d-4 [-1, 16, 35, 25] 32
Conv2d-5 [-1, 16, 31, 21] 6,416
BatchNorm2d-6 [-1, 16, 15, 10] 32
Conv2d-7 [-1, 16, 13, 8] 2,320
BatchNorm2d-8 [-1, 16, 13, 8] 32
Conv2d-9 [-1, 16, 11, 6] 2,320
BatchNorm2d-10 [-1, 16, 11, 6] 32
Linear-11 [-1, 512] 541,184
Linear-12 [-1, 116] 59,508
================================================================
Total params: 625,268
Trainable params: 625,268
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.07
Forward/backward pass size (MB): 3.37
Params size (MB): 2.39
Estimated Total Size (MB): 5.83
----------------------------------------------------------------
Conv1 | Conv2 | Conv3 | Conv4 | Conv5 |
---|---|---|---|---|
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 64, 224, 224] 640
BatchNorm2d-2 [-1, 64, 224, 224] 128
ReLU-3 [-1, 64, 224, 224] 0
MaxPool2d-4 [-1, 64, 112, 112] 0
Conv2d-5 [-1, 64, 112, 112] 36,864
BatchNorm2d-6 [-1, 64, 112, 112] 128
ReLU-7 [-1, 64, 112, 112] 0
Conv2d-8 [-1, 64, 112, 112] 36,864
BatchNorm2d-9 [-1, 64, 112, 112] 128
ReLU-10 [-1, 64, 112, 112] 0
BasicBlock-11 [-1, 64, 112, 112] 0
Conv2d-12 [-1, 64, 112, 112] 36,864
BatchNorm2d-13 [-1, 64, 112, 112] 128
ReLU-14 [-1, 64, 112, 112] 0
Conv2d-15 [-1, 64, 112, 112] 36,864
BatchNorm2d-16 [-1, 64, 112, 112] 128
ReLU-17 [-1, 64, 112, 112] 0
BasicBlock-18 [-1, 64, 112, 112] 0
Conv2d-19 [-1, 128, 56, 56] 73,728
BatchNorm2d-20 [-1, 128, 56, 56] 256
ReLU-21 [-1, 128, 56, 56] 0
Conv2d-22 [-1, 128, 56, 56] 147,456
BatchNorm2d-23 [-1, 128, 56, 56] 256
Conv2d-24 [-1, 128, 56, 56] 8,192
BatchNorm2d-25 [-1, 128, 56, 56] 256
ReLU-26 [-1, 128, 56, 56] 0
BasicBlock-27 [-1, 128, 56, 56] 0
Conv2d-28 [-1, 128, 56, 56] 147,456
BatchNorm2d-29 [-1, 128, 56, 56] 256
ReLU-30 [-1, 128, 56, 56] 0
Conv2d-31 [-1, 128, 56, 56] 147,456
BatchNorm2d-32 [-1, 128, 56, 56] 256
ReLU-33 [-1, 128, 56, 56] 0
BasicBlock-34 [-1, 128, 56, 56] 0
Conv2d-35 [-1, 256, 28, 28] 294,912
BatchNorm2d-36 [-1, 256, 28, 28] 512
ReLU-37 [-1, 256, 28, 28] 0
Conv2d-38 [-1, 256, 28, 28] 589,824
BatchNorm2d-39 [-1, 256, 28, 28] 512
Conv2d-40 [-1, 256, 28, 28] 32,768
BatchNorm2d-41 [-1, 256, 28, 28] 512
ReLU-42 [-1, 256, 28, 28] 0
BasicBlock-43 [-1, 256, 28, 28] 0
Conv2d-44 [-1, 256, 28, 28] 589,824
BatchNorm2d-45 [-1, 256, 28, 28] 512
ReLU-46 [-1, 256, 28, 28] 0
Conv2d-47 [-1, 256, 28, 28] 589,824
BatchNorm2d-48 [-1, 256, 28, 28] 512
ReLU-49 [-1, 256, 28, 28] 0
BasicBlock-50 [-1, 256, 28, 28] 0
Conv2d-51 [-1, 512, 14, 14] 1,179,648
BatchNorm2d-52 [-1, 512, 14, 14] 1,024
ReLU-53 [-1, 512, 14, 14] 0
Conv2d-54 [-1, 512, 14, 14] 2,359,296
BatchNorm2d-55 [-1, 512, 14, 14] 1,024
Conv2d-56 [-1, 512, 14, 14] 131,072
BatchNorm2d-57 [-1, 512, 14, 14] 1,024
ReLU-58 [-1, 512, 14, 14] 0
BasicBlock-59 [-1, 512, 14, 14] 0
Conv2d-60 [-1, 512, 14, 14] 2,359,296
BatchNorm2d-61 [-1, 512, 14, 14] 1,024
ReLU-62 [-1, 512, 14, 14] 0
Conv2d-63 [-1, 512, 14, 14] 2,359,296
BatchNorm2d-64 [-1, 512, 14, 14] 1,024
ReLU-65 [-1, 512, 14, 14] 0
BasicBlock-66 [-1, 512, 14, 14] 0
AdaptiveAvgPool2d-67 [-1, 512, 1, 1] 0
Dropout-68 [-1, 512] 0
Linear-69 [-1, 136] 69,768
ResNet-70 [-1, 136] 0
================================================================
Total params: 11,237,512
Trainable params: 11,237,512
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.19
Forward/backward pass size (MB): 251.13
Params size (MB): 42.87
Estimated Total Size (MB): 294.19
----------------------------------------------------------------
The model (excpectedly) does well in the training set.
Here are some predictions our network made on the test set--which at the first glance look great. The model doesn't seem to be able to predict outside the images' boundaries and thus collapses the facial points to the image limits. This happens when the face gets cropped. I personally think this is ok. One thing I could do in the future is to overlap 2 fully framed faces with different overlapping ratios, and then see whether this effect still exists.
Some full-sized examples: