CS 194 - Project 4: Classification and Segmentation

Overview

In this assignment, I will solve classification of images in the Fashion-MNIST dataset and semantic segmentation of images in mini Facade dataset using Deep Nets! I use Pytorch and run my code on the GPU provided by Google Colab.

Part 1: Image Classification

Dataloader

I used the torchvision FashionMNIST training dataset and the pytorch dataloader to extract the data and create a 90/10 train-validation split of the training data, while reserving the FashionMNIST test set for test data. The ToTensor transform allowed me to normalize the images. Each piece of the data is a 28x28 grayscale image of a fashion item. The labels are as follows:

0: T-Shirt
1: Trousers
2: Pullover
3: Dress
4: Coat
5: Sandal
6: Shirt
7: Sneaker
8: Bag
9: Ankle Boot

I have displayed a large number of examples below.

Here are a few examples (left to right: sneaker, coat, sandal, dress)

CNN

I used a CNN as instructed with two convolution layers followed by two fully connected layers. Note that I used LeakyReLU with weight=0.1 instead of ReLU in order to prevent a vanishing ReLU. The full structure is as follows:

Conv2d ( 1, 32, 5, 1 )
Leaky_ReLU ( 0.1 )
Conv2d ( 32, 32, 3, 1, )
Leaky_ReLU ( 0.1 )
MaxPool2d ( 2, 2 )
Reshape ( -1, 32 * 11 * 11 )
Linear ( 32 * 11 * 11, 120 )
Leaky_ReLU ( 0.1 )
Linear ( 120, 10 )

Loss Function and Optimizer

Cross Entropy Loss for the loss function and Adam with learning rate=1e-4 for the optimizer.

Results

Test Set Accuracy!

The classifier in my notebook achieved 91.22% on the test set. I had a classifier that achieved 92.2% but it was unfortunately lost.

Training & Validation Accuracy

Not too much overfitting which is good to see. The last two epochs begin to show signs of overfitting but ultimately it is only a percentage point and change off from the validation set.

Per class Accuracy

Not unsurprisingly, my classifier performs best on the items that are the most distinctive (Trouser, Sandal, Sneaker, Ankle Boot, Bag), while items that are most nondescript (Shirt, Coat, Pullover, T-Shirt) create the most trouble for my classifier, and indeed they are all quite similar in shape. Below the scores, I have placed two correctly predicted test examples from each class and two falsely predicted test examples from each class. The falsely predicted examples were predicted to be in the respective class, but in ground truth they belong to another class.

Class: validation accuracy | test accuracy
T-Shirt: 83.33 | 87.5
Trouser: 100.0 | 100.0
Pullover: 92.86 | 66.67
Dress: 100.0 | 77.78
Coat: 80.0 | 75.0
Sandal: 100.0 | 100.0
Shirt: 72.73 | 65.0
Sneaker: 83.33 | 100.0
Bag: 100.0 | 95.24
Ankle Boot: 100.0 | 94.12

Class 0: T-Shirt

correct, correct, incorrect, incorrect

Class 1: Trouser

correct, correct, incorrect, incorrect

Class 2: Pullover

correct, correct, incorrect, incorrect

Class 3: Dress

correct, correct, incorrect, incorrect

Class 4: Coat

correct, correct, incorrect, incorrect

Class 5: Sandal

correct, correct, incorrect, incorrect

Class 6: Shirt

correct, correct, incorrect, incorrect

Class 7: Sneaker

correct, correct, incorrect, incorrect

Class 8: Bag

correct, correct, incorrect, incorrect

Class 9: Ankle Boot

correct, correct, incorrect, incorrect

Visualizing the filters

Here are visualizations of my filters, displayed with plt and grouped using the grouping function of pytorch. There are 32 of them and each one is of kernel size 5. Pretty cool, right?

Part 2: Semantic Segmentation

Dataloader

I used the provided code to extract the dataset. I then created a 90/10 train-validation split of the training data, while reserving the test set for test data. Each piece of the data is a 256x256 rgb image of a building. Each pixel label color and the class it represents is as follows:

0: other | black
1: facade | blue
2: pillar | green
3: window | orange
4: balcony | red

CNN

I used a CNN as instructed with fewer than 6 convolution layers. The design of the network was inspired by U-Net. To upsample the image, I used ConvTranspose2d layers. Note that I used LeakyReLU with weight=0.1 instead of ReLU in order to prevent a vanishing ReLU and speed up the rate of convergence. The full structure is as follows:

Conv2d ( 3, 32, 7, 1 )
Leaky_ReLU ( 0.1 )
Conv2d ( 32, 64, 5, 1, )
Leaky_ReLU ( 0.1 )
Conv2d ( 64, 128, 3, 1, )
Leaky_ReLU ( 0.1 )
MaxPool2d ( 2, 2 )
ConvTranspose2d ( 128, 64, 4, 2 )
Leaky_ReLU ( 0.1 )
ConvTranspose2d ( 64, 32, 4, 2 )
Leaky_ReLU ( 0.1 )
ConvTranspose2d ( 32, 3, 4, 2 )
Leaky_ReLU ( 0.1 )
Conv2d ( 3, 5, 3, 1, 1 )

Loss Function and Optimizer

Cross Entropy Loss for the loss function and Adam with learning rate=3e-3 and weight decay=1e-5 for the optimizer.

Results

Training and Validation Loss

.

This is some good looking loss. The training and validation were quite close and there were no indicators of significant overfitting.

Average Precision on Test Set

The achieved average precision was approximately 48% which is a good deal greater than the 45% target. My neural net performed best on facades and windows. Given that those categories are rather distinctive, this result makes sense. Pillars were most challenging. Perhaps our perception of what is a pillar requires experiential and contextual knowledge that an algorithm lacks (i.e. Greek buildings and McMansions are assumed to have a disgusting number of pillars). For a detailed breakdown:

0 (other): AP = 0.49343701715454663
1 (facade): AP = 0.7355041036593365
2 (pillar): AP = 0.07536745512736691
3 (window): AP = 0.7714754315600332
4 (balcony): AP = 0.32007090009598366

My own image

This is a picture I took in Lisbon at the Praça do Comércio. The building is much more impressive than this picture suggests, but I thought this small corner of it was perfect, as it has windows, balconies, pillars, and of course a facade. My algorithm had a lot of trouble. Granted, this is a challenging image. It involves a shadow on the bottom half of the building, cars and people at the bottom of the building, arches, two different colors of facade, electric wires cutting through it, and an inner facade with windows and doors. The algorithm correctly identified the balconies, but also identified a bunch of tiny balconies situated amongst the cars and telephone wires. The engravings at the top of the pillars were also classified as balconies. The algorithm classified the bottoms of the pillars correctly but the wires interrupted this and the tops were classified as facade. The arches created some confusion for the classification of windows, although the algorithm did classify the car windows correctly (I do not know if this is good or bad). Little spots of other were situated throughout the photo in random places, though they were generally more common at the bottom of the building and where the line of the shadow extended across the building.