Tony Lian's Project 5

Part 1: Nose Tip Detection

Show samples of data loader (5 points)

Plot train and validation loss (5 points)


The loss function is MSE with reduction set to "mean" and predictions and targets normalized. The validation losses keep low, meaning that we've finished learning.

Show how hyper parameters affect results (5 points)

Learning Rate: 1e-3 is too large and causes oscillations. 1e-5 is not enough to learn sufficiently. 1e-4 is the optimal one.
1e-3:

1e-4:

1e-5:

Number of conv layers: adding another layer helps a little, but also adds amount of computation (I use 3 layer one for other subparts.). 3 layers:

4 layers:

Show 2 success/failure cases (5 points)

Success cases:


Failure cases:


I think because noses are black spots on the image, the network learns to predict black spots near the center of the image and does not predict correct if the person turns too far aside (nose is too far from the center).

Part 2: Full Facial Keypoints Detection

Show samples of data loader (5 points)


Report detailed architecture (5 points)

Input: 120 (h) x 160 (w)
Conv1: Conv2d(1, 48, 7)
Conv2: Conv2d(48, 16, 7)
Conv3: Conv2d(16, 16, 7)
MaxPool: stride=2
Conv4: Conv2d(16, 16, 7)
Conv5: Conv2d(16, 16, 7)
Conv6: Conv2d(16, 8, 7)
Flatten
FC1: Linear(13992, 2048)
FC2: Linear(2048, 116)

Plot train and validation loss (5 points)


Show how hyper parameters effect results

Hyperparams: LR: 1e-4, Epochs: 15, weight_decay: 1e-5
Learning Rate: I chose 1e-4. As a comparison, 1e-3 given in the webpage for part 3 will lead to some fluctuations at early stage. 1e-4:

1e-3:

Number of conv layers:
Original:

Remove Conv5, change fc1's input dim to 18408:

This improves the overall validation loss.

Show 2 success/failure cases (5 points)

Success cases:


Failure cases:


It does work in common directions. It does not work when people turn towards uncommon directions that models do not see in the training data or do not generalize well due to lack of samples.

Visualize learned features (5 points)

Conv1:

Conv2 (first set of filters):

Conv1 looks to summarize the image and describing them in a smaller dimension (similar to what PCA does), and Conv2 starts detecting edges (the first image on the second row in Conv2 shows a edge detector).

Part 3: Train With Larger Dataset

Submit a working model to Kaggle competition (15 points)

My team name is aaaaaaaaaaaaaaaaaa, and the public score without B&W is 9.59216.

Report detailed architecture (5 points)

I use a ResNet18 (with pretrained weights offered by torch, pretrained=True). Except that I replaced FC with one that outputs a tensor with dimension 136 for each image. Since this is a standard model, here is the description of ResNet18.
A diagram from the page above:

Model:
ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer2): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer3): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer4): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=512, out_features=1000, bias=True)
)

Plot train and validation loss (10 points)

This is normalized MSE Loss.

Visualize results on test set (10 points)



It works well with frontal images. For side-facing images, the edges may be a little off from the actual edge of face.

Run on at least 3 of your chosen photos (10 points)




It works the best with the frontal photos. It works the lest with side-facing and face photos with glasses and distractions.

B&W

(12 pts) More keypoint detection networks such as Toshev et. al. (2014) or Jain et. al. (2014) turn the regression problem of predicting the keypoint coordinates into a pixelwise classification problem: for every pixel, they predict how likely it is this pixel is a keypoint. You can do this by using an architecture that outputs pixel-aligned heatmaps such as fully convolutional network or UNet. You can turn the ground truth keypoint coordinates into pixel-aligned heatmaps to supervise your model by placing 2D Gaussians at the coordinate location in the map. Try training your model with this setup and see how it does! Report on the details of your implementation and your findings.

I implemented this with a Unet architecture based on VGG16 (because pytorch offers VGG pretrained weights), which takes an image and outputs a map with 64 channels, each encoding a keypoint. Then I supervise the pixel-aligned heapmaps with gaussians. Example:

The loss is KL Divergence with target softmax among the spatial dimensions and the target heapmaps normalized so that each channel sums up to 1 in the spatial dimention.
Model:
UNet16(
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (encoder): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): ReLU(inplace=True)
    (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (20): ReLU(inplace=True)
    (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (22): ReLU(inplace=True)
    (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (25): ReLU(inplace=True)
    (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (27): ReLU(inplace=True)
    (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (29): ReLU(inplace=True)
    (30): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (relu): ReLU(inplace=True)
  (conv1): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
  )
  (conv2): Sequential(
    (0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
  )
  (conv3): Sequential(
    (0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (5): ReLU(inplace=True)
  )
  (conv4): Sequential(
    (0): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (5): ReLU(inplace=True)
  )
  (conv5): Sequential(
    (0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (5): ReLU(inplace=True)
  )
  (center): DecoderBlockV2(
    (block): Sequential(
      (0): Interpolate()
      (1): ConvRelu(
        (conv): Conv2d(512, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (activation): ReLU(inplace=True)
      )
      (2): ConvRelu(
        (conv): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (activation): ReLU(inplace=True)
      )
    )
  )
  (dec5): DecoderBlockV2(
    (block): Sequential(
      (0): Interpolate()
      (1): ConvRelu(
        (conv): Conv2d(576, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (activation): ReLU(inplace=True)
      )
      (2): ConvRelu(
        (conv): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (activation): ReLU(inplace=True)
      )
    )
  )
  (dec4): DecoderBlockV2(
    (block): Sequential(
      (0): Interpolate()
      (1): ConvRelu(
        (conv): Conv2d(576, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (activation): ReLU(inplace=True)
      )
      (2): ConvRelu(
        (conv): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (activation): ReLU(inplace=True)
      )
    )
  )
  (dec3): DecoderBlockV2(
    (block): Sequential(
      (0): Interpolate()
      (1): ConvRelu(
        (conv): Conv2d(320, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (activation): ReLU(inplace=True)
      )
      (2): ConvRelu(
        (conv): Conv2d(64, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (activation): ReLU(inplace=True)
      )
    )
  )
  (dec2): DecoderBlockV2(
    (block): Sequential(
      (0): Interpolate()
      (1): ConvRelu(
        (conv): Conv2d(144, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (activation): ReLU(inplace=True)
      )
      (2): ConvRelu(
        (conv): Conv2d(32, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (activation): ReLU(inplace=True)
      )
    )
  )
  (dec1): ConvRelu(
    (conv): Conv2d(72, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (activation): ReLU(inplace=True)
  )
  (final): Conv2d(8, 68, kernel_size=(1, 1), stride=(1, 1))
)
I train the model with 20 epochs with Adam, lr=1e-3, weight_decay=1e-5. A problem that I encounter is that: some of the facial keypoints are out of the bounding box. When cropped, there isn't a pixel corresponding to the keypoint. This is ok for regression, but for spatial wise correspondence, the model could not represent such points because it cannot output a point outside of the spacial dimension (we are taking the max location for the spatial dimension). As a fix, I use the -1 to 1 as the scale of output, and use lower-right quad for the bounding box. Then if a point is outside on the top or left or top-left, the model and put points there. Although this means no direct correspondence between each input pixel and one output pixel, this works on predicting keypoints outside of the bounding box on the top and left. However, right or bottom is not fixed, but this affects less as found in visualizations. Two example predictions:

As you can see, it does not predict points on the right, but predicts ones on the left. Fixing the right/bottom is supposed to improve it further. The kaggle public score is 8.43268, better than the ResNet18 regression version. However, UNet16 is larger in terms of the memory requirements, so it does not directly illustrate that it's because of the loss (training method) and/or the model.