To train a model to identify nose keypoints from images, we first create a PyTorch Dataset
that holds images and nose keypoints. Here are three examples of images/keypoints from our training set.
NoseTipNet( (conv1): Conv2d(1, 20, kernel_size=(3, 3), stride=(1, 1)) (conv2): Conv2d(20, 16, kernel_size=(3, 3), stride=(1, 1)) (conv3): Conv2d(16, 12, kernel_size=(3, 3), stride=(1, 1)) (fc1): Linear(in_features=480, out_features=120, bias=True) (fc2): Linear(in_features=120, out_features=2, bias=True) )
For training this model, I used 3 convolutional layers with 20, 16, and 12 neurons each. Then I used a fully connected layer with 120 neurons, finally projecting onto 2 output neurons (for the x and y position of the nose). I used the Adam optimizer and MSE loss. My training and validation loss per epoch is shown below.
Done with epoch 0, training loss: 3.1888229132655397, validation loss: 0.33840077044442296 Done with epoch 1, training loss: 1.1044911621345364, validation loss: 0.3055537597683724 Done with epoch 2, training loss: 0.9410652152146213, validation loss: 0.2150196764487191 Done with epoch 3, training loss: 0.7361474163071762, validation loss: 0.22840970719698817 Done with epoch 4, training loss: 0.47167071784963355, validation loss: 0.13664373198298563 Done with epoch 5, training loss: 0.3717792864290459, validation loss: 0.14921850803511916 Done with epoch 6, training loss: 0.2901686102468375, validation loss: 0.18013405842066277 Done with epoch 7, training loss: 0.3065151146474818, validation loss: 0.10420908059609246 Done with epoch 8, training loss: 0.18149715138315514, validation loss: 0.10389898047651513 Done with epoch 9, training loss: 0.20527198455692997, validation loss: 0.09057916551137168 Done with epoch 10, training loss: 0.16466897432087535, validation loss: 0.08595468351450108
Here are some of the validation set results with the ground truth (green) and predicted (red) keypoint for the nose. Images 1 and 2 were very successful, but 3 and 4 had significant error. This might be because of the lightning difference (images 3 and 4 had significantly less contrast between face and background), which might make the nose more difficult to detect for our model.
For this section, I'll be moving onto creating a model to detect all 58 keypoints. Here are some example photos (with ground truth keypoints in green) from within the training set.
For training this model, I used 5 convolutional layers with 20, 16, 14, 13, and 12 neurons each. Then I used a fully connected layer with 120 neurons, finally projecting onto 2 * 58 output neurons. I used the Adam optimizer and MSE loss. My training and validation loss per epoch is shown below.
Done with epoch 0, training loss: 4.85357520589605, validation loss: 0.21017058813595213 Done with epoch 1, training loss: 0.6813927794937626, validation loss: 0.23388455476379022 Done with epoch 2, training loss: 0.6776246287772665, validation loss: 0.18995635154715274 Done with epoch 3, training loss: 0.6842949464335106, validation loss: 0.18375414823822211 Done with epoch 4, training loss: 0.631536857021274, validation loss: 0.2618754438008182 Done with epoch 5, training loss: 0.5985652899144043, validation loss: 0.26280800107633695 Done with epoch 6, training loss: 0.5442319698195206, validation loss: 0.19077659198956098 Done with epoch 7, training loss: 0.4873185678879963, validation loss: 0.15939016304764664 Done with epoch 8, training loss: 0.422212222169037, validation loss: 0.2317342308233492 Done with epoch 9, training loss: 0.36253218107594876, validation loss: 0.16536742163589224 Done with epoch 10, training loss: 0.29173711489420384, validation loss: 0.11308583000209183 Done with epoch 11, training loss: 0.28007589421758894, validation loss: 0.11042622686363757 Done with epoch 12, training loss: 0.26548042416834505, validation loss: 0.0811194194102427
Here we have examples showing ground-truth (red) and predicted (green) keypoints. We predict well for the first two images, but struggle with the later two. It likely fails because of large head tilt, which the model doesn't handle as well as a straight-on headshot.
Now, we visualize some of the "learned filters" that our model has developed during training. Here they are for the first convolutional layer.
Now, we work with a much larger dataset with bounding boxes. Here are some of the example faces (along with ground-truth keypoints in green).
For training, I used the ResNet18 model, but replaced the first layer with a convolutional layer with a kernel size of 5 and 64 neurons. I also replaced the final fully-connected layer to output to 68 * 2 outputs to fit the x, y coordinates of keypoints. Below is the training and validation loss per epoch.
Mounted at /content/gdrive
--2020-11-06 22:31:44-- https://inst.eecs.berkeley.edu/~cs194-26/fa20/hw/proj4/labels_ibug_300W_test_parsed.xml Resolving inst.eecs.berkeley.edu (inst.eecs.berkeley.edu)... 128.32.42.199 Connecting to inst.eecs.berkeley.edu (inst.eecs.berkeley.edu)|128.32.42.199|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 131719 (129K) [text/xml] Saving to: ‘labels_ibug_300W_test_parsed.xml.2’ labels_ibug_300W_te 100%[===================>] 128.63K --.-KB/s in 0.06s 2020-11-06 22:31:44 (1.94 MB/s) - ‘labels_ibug_300W_test_parsed.xml.2’ saved [131719/131719]
After training, I apply my model onto the test set. Here are two example images (with predicted keypoints in green). On Kaggle, my mean squared error is $22.02754$.
Now, we try this model on my faces. It seems to work decently well, although it's hard to tell the accuracy when it's so zoomed out. It seemed to get the second photo pretty right, while failing on the first and third one.