CS194-26 Project 4: Classification and Segmentation

by Heidi Dong

In this project, I learned about what CNNs are and how to use them in PyTorch!

Part 1: Image Classification

The goal is to classify images from the Fashion MNIST dataset. I followed PyTorch's classification and neural network tutorials to complete this part.

Dataloader

The dataset is already available in torchvision.datasets.FashionMNIST. There are 10 classes:

0 T-shirt
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot

Here are some sample images of an ankle boot, coat, bag, and sneaker:

CNN

My neural net architecture was:

  1. convolutional layer (1 input, 32 outputs, 5x5 kernel, stride of 1) followed by ReLU and max pooling
  2. convolutional layer (32 inputs, 32 outputs, 5x5 kernel, stride of 1) followed by ReLU and max pooling
  3. fully connected network (512 inputs, 120 outputs) followed by ReLU
  4. fully connected network (120 inputs, 10 outputs)

I experimented with the kernel size in the convolutional layers and the number of inputs and outputs in the first fully connected network layer, and found that the above values gave the highest average validation accuracy.

Loss Function and Optimizer

I initially trained the neural network using Adam with cross entropy loss and a learning rate of 0.01. I experimented with the learning rate and number of epochs, and found that a learning rate of 0.001 and around 5-10 epochs worked best. At 9 epochs, my validation accuracy was 90.66%.

Results

Below is a plot of the train and validation accuracy during the training process. Note that the y-axis does not start at zero. The train accuracy starts get higher because the model begins to overfit to the data.

This is a breakdown of the per-class accuracies on the test data:

0 T-shirt 87%
1 Trouser 96%
2 Pullover 88%
3 Dress 90%
4 Coat 80%
5 Sandal 96%
6 Shirt 65%
7 Sneaker 98%
8 Bag 98%
9 Ankle boot 95%

Correctly/incorrectly classified images

Most of the class accuracies are pretty high. Coats and shirts were the most misclassified. My hypothesis is that shirts are often misclassified as t-shirts and pullovers. This table shows examples of images in each class that were classified and misclassified.

Class Correctly classified images Misclassified images
T-shirt
Trouser
Pullover
Dress
Coat
Sandal
Shirt
Sneaker
Bag
Ankle boot

Visualizing the Learned Filters

I couldn't really figure out what these mean, but in general looking at the learned filters can help explain why and how a piece of data is classified. These are the learned filters of the first convolutional layer of the network:

Part 2: Semantic Segmentation

Semantic Segmentation refers to labeling each pixel in the image to its correct object class. In this second part, I attempted to train a neural network to label the different parts of a building facade.

Dataloader

I used 80% of the data for training and 20% for validation. These are the classes that will be used to label the parts of each image:

Class Color Value
others black 0
facade blue 1
pillar green 2
window orange 3
balcony red 4

Here is an example of a facade image and its expected, labeled version:

CNN

I googled "semantic segmentation cnn" to get ideas on how to structure my network. I decided to try and make mine similar to the U-net architecture I found in Ronneberger et al. 2015. After lots of experimentation, my CNN architecture is:

  1. convolutional layer (3 inputs, 64 outputs, 3x3 kernel) followed by max pooling (stride=2)
  2. convolutional layer (64 inputs, 128 outputs, 3x3 kernel) followed by max pooling (stride=2)
  3. convolutional layer (128 inputs, 256 outputs, 3x3 kernel) followed by max pooling (stride=2), and then upsampling (scale=2)
  4. convolutional layer (256 inputs, 128 outputs, 3x3 kernel) followed by upsampling (scale=2)
  5. convolutional layer (128 inputs, 64 outputs, 3x3 kernel) followed by upsampling (scale=2)
  6. convolutional layer (64 inputs, 5 outputs, 3x3 kernel)

Loss Function and Optimizer

I used cross entropy loss and trained my network on Adam with a learning rate of 1e-3 and weight decay of 1e-5.

Results

I trained my network for 20 epochs. Below is a plot of the train and validation losses during the training process.

I was able to achieve an average precision of 47%. Here is the breakdown for each class:

Class Color AP
others black 0.577917788847496
facade blue 0.5915926304701057
pillar green 0.09296728878460012
window orange 0.7973126495531374
balcony red 0.32900530940226963

I tried running the trained model on some other pictures. These are two buildings I like in SF, the Phelan Building and an apartment on Union Street.

In the Phelan facade below, my network was able to recognize most of the windows (orange). The parts with darker shadows were classified as balconies (red), though. Also, decorative elements like details on the building that are part of the facade were labelled as other (black). I do not take credit for the image; I found the facade image on Google Images.

facade of Phelan Building in SF labeled facade

The apartment building was not labelled as well, probably because it had the irregular shape from bay windows. Windows and balconies were confused. The image is from Google Street View.

Google Maps street view of 1229 Union St, SF labeled facade

Here is an example from the test set. The left is the facade, the middle is the expected classification, and the right is the result. Clearly, my neural net is not good at distinguishing pillars and balconies, as indicated by their low APs.

facade ground truth for labeled facade labeled facade

Final thoughts

As someone with no ML experience coming into this class, some parts of this project were pretty challenging for me. For example, I had no idea where to start when I was adding layers to the neural net for segmentation. This was a nice introduction to PyTorch and Google Colab, but I still don't think I understand what is really going on in the middle of all those layers.