CS 194-26 Intro to Computer Vision and Computational Photography, Fall 2021

Project 5: Facial Keypoint Detection with Neural Networks!

Name: Sarthak Arora

Part 1: Nose Tip Detection

Show samples of data loader (5 points)

1 2 3 4 5

Plot train and validation loss (5 points)

MSE was the loss function used. As the spec said, we normalized our images and the coordinates were also between 0 and 1 (ratio) when the loss is calculated. We can see the loss reducing with epochs below and the train loss is less than the test, and the valiation loss is also decreasing - this gives us an indication that training has not resulted in under/over fitting and that we did a good job.

1

Show how hyper parameters effect results (5 points)

The first hyperparameter I varied was the learning rate. With 1e-2 (first picture) we see that was too big and we bounced around areas of high loss and did not find a low loss region on the loss surface. With 1e-4 (second picture) we see that is too small and our loss is barely decreasing and flat, maybe we get stuck in a flat region on our loss surface. 1e-3 (third picture) we see this is optimal as our loss is slowly decreasing but also not bouncing around. This is confirmed by our numbers, 1e-3 had the lowest training and valiation loss.

1
1e-2
1
1e-4
1
1e-3

The second hyperparameter I varied was number of convolution layers. I ran with 3 layers vs 4 and saw that 3 layers was slightly better and definitely quicker than 4 layers. This is because with 4 layers you have more backprop to do and many more parameters to tune which maybe didn't tune well enough in the given number of epochs. The performance for 3 layers vs 4 is shown below. 3 layers has a slighly lower loss.

1
3 layers
1
4 layers

Show 2 success/failure cases (5 points)

Two success cases

1 1

Two failure cases. I think the failures occur when the person looks to the left or the right and thus the nose moves from the center. In most pictures the head is straight so the nose is at the center and this is what the network learns which is why it does well for front on images but not non front on images.

1 1

Part 2: Full Facial Keypoints Detection

Show samples of data loader (5 points)

I have also included examples of augmented data. These include pictures where the brightness has been increased, the saturation has been changed, or a combination of both has been done.

1 2 3 4 5

Report detailed architecture (5 points)

I used 5 convolution layers followed by a couple of fully connected layers. All layers had RELu after them and most also had a max pooling layer after them. The exact architecture has been shown below. THe input was 120*160 image.

1

Plot train and validation loss (5 points)

MSE was the loss function used. As the spec said, we normalized our images and the coordinates were also between 0 and 1 (ratio) when the loss is calculated. We can see the loss reducing with epochs below and the train loss is less than the test, and the valiation loss is also decreasing - this gives us an indication that training has not resulted in under/over fitting and that we did a good job.

1

Show 2 success/failure cases (5 points)

Two success cases

1 1

Two failure cases. I think the failures occur when the person looks to the left or the right and thus the nose and rest of the face moves from the center. In most pictures the head is straight so this is what the network learns which is why it does well for front on images but not non front on images.

1 1

Visualize learned features (5 points)

These are the first 12 filters of the first convolution layer. It looks like these filters are detecting edges of the image and even to a certain extent the middle of the image which is where important facial features such as the nose lie.

1 1 1 1 1 1 1 1 1 1 1 1

Part 3: Train With Larger Dataset

Submit a working model to Kaggle competition (15 points)

I have submitted to Kaggle. My team name is just my name, Sarthak Arora.

Report detailed architecture (10 points)

For this part, I used a standard torchvision.models pretrained resnet18 model.
        The first change I made to this was change the number of input channels in the
        first convolution layer to 1. The second change I made was change the number of output
        features in the FC layer to 136 to match the number of outputs of our net. 68 points * 2 (x, y).
        This was the model.

        ResNet(
  (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer2): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer3): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer4): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=512, out_features=136, bias=True)
)
    

Plot train and validation loss (10 points)

Here is a picture (and one zoomed in) of how the train and validation loss progressed over 20 epochs.

1 1

Visualize results on test set (10 points)

Here are a few examples of the predictions of the neural net on the test set. The predictions seem to be doing well for pics of all angles.

1 1 1 1 1 1 1

Run on at least 3 of your own photos (10 points)

Here are the results of the net on 3 photos of my own. It worked well on images that were well zoomed in on just the face and not well on pictures where there was a lot of noise (things apart from face) due to the lack of a bounding box. Overall we see the network did a good job of predicting face shape, direction, and features.

1 1 1