Project 4

By Samuel Sunghun Lee [CS194-26 - Fall 2020]

Part 1 : Nose Tip Detection

We will first sample some images from my dataloader and visualize it with ground-truth keypoints given by the dataset. We will plot both the landmark points and the nose keypoint.

Now, we will plot the train and validation accuracy during the training process for both the training dataset and the validation data set. The neural network that I made is similar to a LeNet-5. It has 3 Convulation Layers followed by 2 Fully connected layers. During the training process, we use a Adam Optimizer, with learning rate 0.001 and run for 20 epochs.

Finally, I will show how 3 facial images which the network detects the nose correctly, or fairly reasonably close, and 2 more images where it detects incorrectly - way off. I believe there some images are predicted well becase some the training data does not have enough data and doesnt generalize well for various images with different angles or with various positions of where the face is positioned in the image.

Fairly Good results

Fairly Bad results

Part 2

Sampling some truth images from the dataloader

Again, we will first sample some images from my dataloader and visualize it with ground-truth keypoints given by the dataset. We will plot both the landmark points and the nose keypoint.

Accuracy i.e Loss Graphs in this project

Now, we will plot the train and validation accuracy during the training process for both the training dataset and the validation data set. We can see that they both go down drasticaly and plateuu around some convex point. According to the validationg raph, we might have picked a learning rate that is too high.

Architecture

The neural network that I made is similar to a LeNet-5. It has 5 Convulation Layers followed by 2 Fully connected layers. During the training process, we use a Adam Optimizer, with learning rate 0.001 and run for 20 epochs. The architecture of my neural network is based on the Resnet18. I simply changed the conv layer 1 and the FC layer to take in images that are grayscale and to output a tensor of shape 136,. The hyperparameters is the interesting part. I tried a lot of various hyperparameters, i.e diff learning rates (between 0.001 and 0.0001) and different batch sizes (6, 8, 256), and different epochs (10, 20, 25). I was very limited by compute because it took around 4-5 hours to train 20 epochs at various hyperparameters. I also tried various transforms, but it turned out that the least amount of transforms, i.e Gray, and Normalization, worked best on the test set. I also tried rotation and coloring randomization but it seemed to not improve performance.

Now, we will display some images predicted by the CNN that I mentioned earlier. The top 2 images show facial images which the network detects the nose correctly, and the bottom 2 images show more images where it detects incorrectly. I believe there some images are predicted well becase some the training data does not have enough data and doesnt generalize well for various images with different angles or with various positions of where the face is positioned in the image. Especially since the output vector is of larger space, i.e a vector of 58 number for each single sample, we need more data to train the network. Thus, there are still a lot of inaccuracies.

Fairly Good results

Fairly Bad results

Finally, we will visualize the Convolutional Layer filters, from Layer 1, Layer 2, and Layer 3. Here are the filters that are visualized in a grid structure with some of the convolutional layer filters visualized.

Part 3 Train with Larger Dataset

For the large dataset and the Kaggle submission, I scored a MSE of 8.79, at 3rd place at current time of writing this. It is evident that a smaller neural network with the right hyperparameters is sufficient to predict landmarks well. For this part, we will use a larger dataset, specifically the ibug face in the wild dataset for training a facial keypoints detector. This dataset contains 6666 images of varying image sizes, and each image has 68 annotated facial keypoints. The neural network used was a Resnet18, with the input layer changed and the output layer changed. Here are images with the predicted landmarks that were predicted from the test set.
The following is the loss graph that is characterized by the above mentioned neural network with the above hyperparameters

Architecture

The architecture of my neural network is based on the Resnet18. I simply changed the conv layer 1 and the FC layer to take in images that are grayscale and to output a tensor of shape 136,. The hyperparameters is the interesting part. I tried a lot of various hyperparameters, i.e diff learning rates (between 0.001 and 0.0001) and different batch sizes (6, 8, 256), and different epochs (10, 20, 25). I was very limited by compute because it took around 4-5 hours to train 20 epochs at various hyperparameters. I also tried various transforms, but it turned out that the least amount of transforms, i.e Gray, and Normalization, worked best on the test set. I also tried rotation and coloring randomization but it seemed to not improve performance.

Compute

I used the free tier of Google Colab but made sure to get the Tesla T4. I hyperoptimized both the GPU and CPU by loading all data into RAM so we can get O(1) fast access through caching, and I used mixed-precision from NVIDIA's APEX to reduce the GPU memory necessary so I can load up 128 Batch size no problem. This led me to compute each epoch in 48 seconds. Thus, 100 epochs ran under 1 hour.

New Architecture - After Making Compute Efficient

Now that I had optimized compute, i.e running 45 seconds per epoch, I got to try a lot of things. I wrote these down in the Bells and Whistles too. I was able to also try RESNET50. I loaded the pretrained model from pytorch and tuned it for 100 epochs. For an odd reason, I couldnt get the output to work correctly and did not submit to Kaggle with this particular model. However, it was clear that the valid loss for the Resnet50 at 100 epochs was far superior to the Resnet50 and 10 epochs. (obviously).

Hyperparameters

Epochs: 100
Learning Rate: 0.05
Batch Size: 128 for both validation and training

Output images on testset.

Bells and Whistles 1 - Anti-aliased max pool.

Modern convolutional networks are not shift-invariant, as small input shifts or translations can cause drastic changes in the output. Commonly used downsampling methods, such as max-pooling, strided-convolution, and average-pooling, ignore the sampling theorem. For this Bells and Whistle, I used the antialiased-cnns package from https://richzhang.github.io/antialiased-cnns/.

Dataset

For this part I used the same dataset as from part 3: the ibug dataset of 6666 images.

Neural Architecture

I compared two neural networks and compared them side. The first neural network I used was a Antialiased-cnns RESNET50 model. I instantiated it like this.
net = antialiased_cnns.resnet50(pretrained=True, filter_size=4)
net.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
net.fc = nn.Linear(2048, 116)
criterion = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=0.001)
net = net.double()
I compared this antialiased neural network with the vanilla counterpart. The out of the box, RESNET50.

Hyperparameters

Epochs: 100
Learning Rate: 0.05
Batch Size: 128 for both validation and training

Compute

Loss Comparison

The following graphs are graphed with tensorboard in pytorch. Graphing the validation loss per epoch.

Performance

As you can see from the graph, the performance was around the same - They both hit the minimum valid loss at around 88. Since I was computing with mixed precision, the model gradients blew up and became NAN after around 30 or 40 epochs. This is why the graph is not shown for anything after 40 epochs due to the results being NAN due to precision errors. It seems that with more epochs, the antialias network could perform better, but due to the APEX limitations of the network, I could not test further with the compute speed that I desired.

Bells and Whistles 2 - Facial Landmarks + Morphing Sequence

Output images on my favorite ladies.

For this bell and whistle, I took some images of the most beautiful woman [blackpink] and did facial landmark recognition with the model i had above. then, I used the morph sequence from the previous project to make a GIF. Here it is!. We can see that for the images that dont have any face coverings, the facial points are spot on. However, when there is hair covering the face, the points can obscure and hard to determine if those points are in the right location from true points. It is clear that since most of the images trained had no face coverings, i.e the hair was away from the face, the network will perform better on images that also have clear points of face

Another Blackpink Morph

Because blackpink deserves highquality morphs, I took very high resolution album pictures, and morphed them the same way i described above. The file size was quite large so I uploaded to IMGUR. Please click this link to see the second morph HIGH QUALITY BLACKPINK MORPH