CS 194-26 Project 1

In this part, I had to build a fairly simple convolutional neural network (CNN) to detect nose tips for a database of faces.

First, I had to confirm that my data was being loaded correctly. Here are some samples from the dataloader that I created.

Now, I was ready to train my simple CNN with the data! Here are the plots for the training and validation loss.

To show how hyper parameters affect the results, I decided to change the learning rate from the original 1e-3 to 1e-2 and 1e-4. Here are the results.

As you can see, when the learning rate is increased to 1e-2, the curve drops and flattens out very quickly. In contrast, the higher learning rate of 1e-4 has a much slower drop until it begins to flatten out to similar values as those of 1e-3 and 1e-4.

I also decided to remove the last layer of the CNN and see how that affects the results.

It seems that removing a layer got the loss to drop quicker than with the extra layer. This was interesting because I expected a worse performance in some way (e.g. curve flattening out to a higher loss.)

Now, let's see some actual results! Here are two successes and two failures.

It is pretty clear to see that when a person's nose is generally in the middle area of the image, the point is almost spot-on. However, when a person's face is oriented to the side, the point seems to move only slightly towards the nose from the middle. I believe this is due to many different factors such as not having enough epochs, using a suboptimal CNN, etc.

For this part, I have to deal with not just the nose tip, but all facial landmarks for the same database of faces.

Here are some samples of my dataloader for this part.

This is the CNN architecture that I used:

conv1 (in_channels=1, out_channels=16, kernel_size=7, stride=3)
relu
max_pool (kernel_size=5, stride=3)
conv2 (in_channels=16, out_channels=16, kernel_size=3, padding=1)
conv2 (in_channels=16, out_channels=16, kernel_size=3, padding=1)
relu
conv3 (in_channels=16, out_channels=16, kernel_size=3, padding=1)
relu
conv4 (in_channels=16, out_channels=32, kernel_size=3, padding=1)
relu
conv5 (in_channels=32, out_channels=32, kernel_size=3, padding=1)
relu
max_pool (kernel_size=7, stride=3)
fc1 (in_features=256, out_features=128)
relu
fc2 (in_features=128, out_features=116)

Hopefully this is intuitive enough, but just in case, the layers that are seen here are convolutional, ReLU, max pooling, and fully connected layers. My idea was to aggressively downsample at the beginning, increase the channels by a factor of 2 every 2 convolutional layers, and pooling at the end. This was inspired by standard CNNs such as VGG and GoogleNet.

Also, I used a learning rate of 9e-4, 75 epochs, and a batch size of 4. Here are the results.

Here are the visualizations of the learned filters of the first convolutional layer.

Here are two successes and two failures.

Judging from my successes and failures, it seems that my CNN is trying to find the best area in the image where the face will be, and minimize the loss for the orientation of the face (though I may be wrong).

The standard CNN that I decided to go with was ResNet-18. I used it with pretrained weights, and modified the first and last layers to accomodate for grayscale images and the number of facial landmarks. Specifically, I changed the in_channels to 1 in the first convolutional layer, and changed the out_features to 136 for the last fully connected layer. I also used a trick where I modified each filter of the first convolutional layer to be a 1 x K x K rather than a 3 x K x K. Here are the results.

For reference, I decided to use a learning rate of 1e-4 for 25 epochs (batch size of 32) for "Run 1", but as I saw the learning rate going lower and lower, I decided to lower the learning rate (as I saw some stagnation despite the decrease) and try to go for another 10 epochs for "Run 2".