Project 4: Classification and Segmentation

Justin Wang | cs194-26-aeh


Overview

In this project, we train a couple Convolutional Neural Nets, one to perform image classification, and the second to perform the very interesting problem of image semantic segmentation. Image classification is the classic scenario of inputting an image and getting a prediction of what that image is as the output. Semantic segmentation, on the other hand, acknolwedges that the world is far too complex to always get images that feature one subject, and therefore seeks to identify multiple types of objects within one image. I enjoyed gaining exposure the latter, and seeing the side-by-side comparison of the actual per-pixel labels with the net's predictions really emphasizes how neural nets don't possess any of implicit understandings humans have of the world.

Part 1: Image Classification

Implementation and Training Details

I use a CNN with two convolution layers, each followed with ReLU and max pooling over a 2 by 2 window, followed by two fully connected layers. I used a crude implementation of random searching to find sufficient values for various configuration variables for my networking. These variables include: learning rate, number of epochs, number of channels, and size of the "hidden" fully connected layer.

The final network was trained using the following values:

Variable

Value

Learning Rate

0.001

Epochs

15

Channels

57

FC Size

69

Training and Validation Accuracy
Per-class Accuracy

Validation Set
Test Set

The class that the model very evidentally struggled on was class 6, or shirts. I think that this can partially be explained because a lot of shirts appear very close to pullovers/t-shirts/coats (see below).

Examples

Left are correctly labeled examples; right are incorrectly labeled examples whose predicted labels are given

Learned Filters
missing img
The 57 visualized 3 by 3 filters of the first convolution layer

Part 2: Semantic Segmentation

Model Architecture

Before doing random sampling for hyperparameters, I looked into some of the literature regarding semantic segmentation. I found this paper on Fully Convolutional Networks. In their findings, they note that using a fully convolutional architecture (ie. no dense layers) that emulates the structure of the well-known VGG-16 CNN architecture, except with transposed convolution at the end for upsampling, performed the best. Therefore, I adapted these findings to our dataset.

The model architecture is composed of five convolutions as per spec, with ReLU after each. After every two convolutions (after one for the final one), there is a maxpool layer. Finally, there is a transposed convolution that effectively does upsampling. The channel sizes for the convolution layers are 32, 64, 128, 256, and 512. All convolution filters are 3 by 3 and each feature map is padded by 1. The maxpool kernel size is 2 and the stride is also 2, so as to halve the dimensions. Finally, the transposed convolution has a kernel size of 8 and a stride of 8, as well as 5 channels, to result in a 256 by 256 by 5 image.

I decided to maxpool only after two convolutions to avoid reducing the image size too much, and I made the final transposed convolution kernel size 8 to give some form of granularity over the 32 by 32 by 512 feature map it convolves over (not sure if this is the right intuition, but the results are good).

Training and Validation Loss
AP Values
Example
Original Building
Labels
Predictions

This example is interesting because it has examples of pillars, which is class 2 (AP = 0.18). The color map of pillars is green, and it's clearly visible that nearly none of the actual pillars are labeled properly, and some of the wall is labeled a pillar. However, the model is quite accurate with the facade of the building (blue), and relatively also the window (orange). I think this might be true because the designs of these might not vary as much in the training data.

The White House

I didn't have any pictures on hand and given the current covid-19 situation I didn't want to go out and take one, so I used a cropped picture of the white house.

Original Image
Prediction

The model gets the windows rather correctly, but also has some false positives. Given the White House's prominant and classically designed pillars, it also detects the pillars better than expected. However, it heavily struggles with facade in the mount of false positives it gives, namely the entire sky and lawn. This might be because the sky is rather similar in color to the white house, and a lawn might not appear much in the training data.