CS194-26 Project 5 - Yin Deng

Part 1: Nose Tip Detection

Sampled image from dataloader visualized with ground-truth keypoints:

Plot the train and validation MSE loss during the training process:

On the left, the model has learning rate of 0.005 and different number of channels across convolutional layers.

On the right, the model has learning rate of 0.001 and same number of channels across convolutional layers.

They have similar performance.

2 facial images which the network detects the nose correctly:

2 facial images which the network detects the nose incorrectly:

In the two images where the network detects the nose incorrectly, the prediction is still pretty close to the actual result. The difference might come from the fact that the model needs further training to improve accuracy.

Part 2: Full Facial Keypoints Detection

Sampled image from dataloader visualized with ground-truth keypoints:

Report the detailed architecture of your model. Include information on hyperparameters chosen for training and a plt showing both training and validation loss across iterations.

Learning rate = 0.0001. The loss function does not converge very well when learning rate is too big.

2 facial images which the network detects the keypoints correctly:

2 facial images which the network detects the keypoints incorrectly:

In the two images where the network detects the keypoints incorrectly, it is likely because the person turn their faces sideways so the contour of their faces (especially chin) is not captured very precisely.

Visualize the learned filters of the first convolutional layer:

Part 3: Train with Larger Dataset

Report the mean absolute error by uploading your predictions on the testing set: 23.72660

Report the detailed architecture of your model. Include information on hyperparameters chosen for training and a plt showing both training and validation loss across iterations.

My neural network has the same architecture as ResNet18 except for 2 modifications: (1) The number of input channel to the first convolutional layer is 1 instead of 3. (2) The number of output nodes of the last full connected layer is 136. Learning rate = 0.0001.

Visualize some images with the keypoints prediction in the testing set.

Try running the trained model on no less than 3 photos from your collection. Which ones does it get right? Which ones does it fail on?

If a person's face is facing forward, the model seems to do a descent job. However, if a person's face is facing sideway, the model does a bad job predicting keypoints.