Project 4: Classification and Segmentation

Part 1: Image Classification

I trained a neural network with four convolution layers on the Fashion MNIST dataset with the following 10 classes:


Label 0 1 2 3 4 5 6 7 8 9
Class T-Shirt Trouser Pullover Dress Coat Sandal Shirt Sneaker Bag Ankle Boot

I split the 60,000 images in the training set data into a training set with 48,000 images and a validation set with 12,000 images. Here are some samples images from the training set:

Fashion MNIST Training Set Sample Images of a Pullover, Coat, Bag, and T-Shirt, respectively

Here is my neural network applied sequentially from left to right:


Conv2d: 1-32 (3 by 3) ReLU Conv2d: 32-32 (3 by 3) ReLU MaxPool2D (2 by 2) Conv2d: 32-64 (3 by 3) ReLU Conv2d: 64-64 (3 by 3) ReLU MaxPool2D (2 by 2) FC: 1024-256 FC: 256-10

Here is a plot of the training and validation accuracy per epoch.

Training and Validation Accuracy for 10 Epochs

Below is the per-class accuracy of the validation set & the testing set:

Validation Set

Label Classification Accuracy
0 T-Shirt 85.2%
1 Trouser 99.0%
2 Pullover 89.0%
3 Dress 94.5%
4 Coat 86.3%
5 Sandal 98.1%
6 Shirt 72.9%
7 Sneaker 96.8%
8 Bag 98.8%
9 Ankle Boot 97.2%

Overall Accuracy: 91.8%

Testing Set

Label Classification Accuracy
0 T-Shirt 84.3%
1 Trouser 98.9%
2 Pullover 88.1%
3 Dress 91.9%
4 Coat 87.0%
5 Sandal 98.3%
6 Shirt 71.9%
7 Sneaker 96.8%
8 Bag 98.6%
9 Ankle Boot 96.9%

Overall Accuracy: 91.2%


The network has the most trouble differentiating between t-shirts, pullovers, coats, and shirts. This is understandable because all four categories share the similar structure of a somewhat blocky rectangular shape.


For each class, here are some examples of correctly and incorrectly classified images:

Validation Set

Label Classification Two Correct and Two Incorrect
0 T-Shirt
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle Boot

Testing Set

Label Classification Two Correct and Two Incorrect
0 T-Shirt
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle Boot

Here are the visualized filters for the first layer of convolution. I had 32 channels, with each channel consisting of a 3x3 filter.


Part 2: Semantic Segmentation

I trained a neural network with six convolution layers on the Mini Facade dataset with the following 5 classes:


Class Color Pixel Value
others black 0
facade blue 1
pillar green 2
window orange 3
balcony red 4

Here is my neural network applied sequentially from left to right:


Conv2d: 3-32 (3 by 3) ReLU MaxPool2D (2 by 2) Conv2d: 32-64 (3 by 3) ReLU MaxPool2D (2 by 2) Conv2d: 64-128 (3 by 3) ReLU MaxPool2D (2 by 2) Conv-Transpose-2D: 128 Conv2d: 128-64 (3 by 3) ReLU Conv-Transpose-2D: 64 Conv2d: 64-32 (3 by 3) ReLU Conv-Transpose-2D: 32 Conv2d: 32-5 (3 by 3)

I modelled my neural network for this part after the one I created in the previous part, thus I opted to use the same loss function (CrossEntropyLoss) and to have a convolution network followed by a rectilinear unit before max pooling. In order for dimensions to work out, I followed the advice listed in specs recommending ConvTranspose2d for every MaxPool2d layer. I also noticed that I needed to add more layers to increase the validation accuracy, so I used the maximum number of convolution layers according to specs, which stated 5-6. To select hyperparameters, I just increased the number of epochs to 25 until the validation error seemed to stay the same. Some additional experimentation demonstrated that adding more channels to my convolution layers made the validation accuracy increase until I added too many and then it started decreasing, most likely due to overfitting.


Below is a plot of the training and validation accuracy per epoch.

Training and Validation Loss for 25 Epochs

Below is the average precision of my network calculated on the testing set:


Average AP: 0.55261863535

Class Average Precision
others 0.6705710087253407
facade 0.7861797879915271
pillar 0.07096875925096052
window 0.8125927422378851
balcony 0.42278087856745156

Here are some examples of my network's output from the testing set:


Input

Ground Truth

Output

Input

Ground Truth

Output

Input

Ground Truth

Output

Input

Output


The model performs well when identifying facade and windows. It does not do as well with balconies and does extremely poorly with pillars. This makes sense because the majority of a building in the test images is composed of mostly facade and windows. As a result, the prior probabilities of those two values are much higher so the model would be more likely to guess one of those two categories even when the conditional probability of another class may be higher than those of these two classes. For a similar reason, the area coverages of balconies and pillars of the training images are very low, thus their prior probabilities are also low. This means that the model would be less likely to guess pillars or balconies, and so the false negative rates for these two classes are very high.

The last image and output pair is a picture I found online when I just searched for building images that I then cropped to be of size 256 by 256 in RGB. It classified the large windows quite well, but seemed to do poorly with the many little windows, which it either classified as entirely facade or just one large windows with random balconies. It classified the sky as part of the facade, demonstrating the large false positive rate when classifying the facade category. For the parts of the building along the window, my model classified them as pillars, which makes sense because they share many pillar properties (ie long, vertical, and surrounding sides not facade), even though it should just be facade.