Project 4: Classification and Segmentation

Part 1: Image Classification

I trained a neural network with four convolution layers on the Fashion MNIST dataset with the following 10 classes:

Label	0	1	2	3	4	5	6	7	8	9
Class	T-Shirt	Trouser	Pullover	Dress	Coat	Sandal	Shirt	Sneaker	Bag	Ankle Boot

I split the 60,000 images in the training set data into a training set with 48,000 images and a validation set with 12,000 images. Here are some samples images from the training set:

Fashion MNIST Training Set Sample Images of a Pullover, Coat, Bag, and T-Shirt, respectively

Here is my neural network applied sequentially from left to right:

Conv2d: 1-32 (3 by 3)

ReLU

Conv2d: 32-32 (3 by 3)

ReLU

MaxPool2D (2 by 2)

Conv2d: 32-64 (3 by 3)

ReLU

Conv2d: 64-64 (3 by 3)

ReLU

MaxPool2D (2 by 2)

FC: 1024-256

FC: 256-10

Here is a plot of the training and validation accuracy per epoch.

Training and Validation Accuracy for 10 Epochs

Below is the per-class accuracy of the validation set & the testing set:

Validation Set

Label	Classification	Accuracy
0	T-Shirt	85.2%
1	Trouser	99.0%
2	Pullover	89.0%
3	Dress	94.5%
4	Coat	86.3%
5	Sandal	98.1%
6	Shirt	72.9%
7	Sneaker	96.8%
8	Bag	98.8%
9	Ankle Boot	97.2%

Overall Accuracy: 91.8%

Testing Set

Label	Classification	Accuracy
0	T-Shirt	84.3%
1	Trouser	98.9%
2	Pullover	88.1%
3	Dress	91.9%
4	Coat	87.0%
5	Sandal	98.3%
6	Shirt	71.9%
7	Sneaker	96.8%
8	Bag	98.6%
9	Ankle Boot	96.9%

Overall Accuracy: 91.2%

The network has the most trouble differentiating between t-shirts, pullovers, coats, and shirts. This is understandable because all four categories share the similar structure of a somewhat blocky rectangular shape.

For each class, here are some examples of correctly and incorrectly classified images:

Validation Set

Label	Classification	Two Correct and Two Incorrect
0	T-Shirt
1	Trouser
2	Pullover
3	Dress
4	Coat
5	Sandal
6	Shirt
7	Sneaker
8	Bag
9	Ankle Boot

Testing Set

Label	Classification	Two Correct and Two Incorrect
0	T-Shirt
1	Trouser
2	Pullover
3	Dress
4	Coat
5	Sandal
6	Shirt
7	Sneaker
8	Bag
9	Ankle Boot

Here are the visualized filters for the first layer of convolution. I had 32 channels, with each channel consisting of a 3x3 filter.

Part 2: Semantic Segmentation

I trained a neural network with six convolution layers on the Mini Facade dataset with the following 5 classes:

Class	Color	Pixel Value
others	black	0
facade	blue	1
pillar	green	2
window	orange	3
balcony	red	4

Here is my neural network applied sequentially from left to right:

Conv2d: 3-32 (3 by 3)

ReLU

MaxPool2D (2 by 2)

Conv2d: 32-64 (3 by 3)

ReLU

MaxPool2D (2 by 2)

Conv2d: 64-128 (3 by 3)

ReLU

MaxPool2D (2 by 2)

Conv-Transpose-2D: 128

Conv2d: 128-64 (3 by 3)

ReLU

Conv-Transpose-2D: 64

Conv2d: 64-32 (3 by 3)

ReLU

Conv-Transpose-2D: 32

Conv2d: 32-5 (3 by 3)

I modelled my neural network for this part after the one I created in the previous part, thus I opted to use the same loss function (CrossEntropyLoss) and to have a convolution network followed by a rectilinear unit before max pooling. In order for dimensions to work out, I followed the advice listed in specs recommending ConvTranspose2d for every MaxPool2d layer. I also noticed that I needed to add more layers to increase the validation accuracy, so I used the maximum number of convolution layers according to specs, which stated 5-6. To select hyperparameters, I just increased the number of epochs to 25 until the validation error seemed to stay the same. Some additional experimentation demonstrated that adding more channels to my convolution layers made the validation accuracy increase until I added too many and then it started decreasing, most likely due to overfitting.

Below is a plot of the training and validation accuracy per epoch.

Training and Validation Loss for 25 Epochs

Below is the average precision of my network calculated on the testing set:

Average AP: 0.55261863535

Class	Average Precision
others	0.6705710087253407
facade	0.7861797879915271
pillar	0.07096875925096052
window	0.8125927422378851
balcony	0.42278087856745156

Here are some examples of my network's output from the testing set:

Input

Ground Truth

Output

Input

Ground Truth

Output

Input

Ground Truth

Output

Input

Output

The model performs well when identifying facade and windows. It does not do as well with balconies and does extremely poorly with pillars. This makes sense because the majority of a building in the test images is composed of mostly facade and windows. As a result, the prior probabilities of those two values are much higher so the model would be more likely to guess one of those two categories even when the conditional probability of another class may be higher than those of these two classes. For a similar reason, the area coverages of balconies and pillars of the training images are very low, thus their prior probabilities are also low. This means that the model would be less likely to guess pillars or balconies, and so the false negative rates for these two classes are very high.

The last image and output pair is a picture I found online when I just searched for building images that I then cropped to be of size 256 by 256 in RGB. It classified the large windows quite well, but seemed to do poorly with the many little windows, which it either classified as entirely facade or just one large windows with random balconies. It classified the sky as part of the facade, demonstrating the large false positive rate when classifying the facade category. For the parts of the building along the window, my model classified them as pillars, which makes sense because they share many pillar properties (ie long, vertical, and surrounding sides not facade), even though it should just be facade.