Project 4: Classification and Segmentation

Part 1: Image Classification

I used the Fashion MNIST dataset available in torchvision.datasets.FashionMNIST to train an image classification neural net. The dataset has 10 different classes:

Label Classification
0 T-Shirt
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle Boot

Dataloader

There are 60,000 images in the training set, and 10,000 images in the testing set. I split the training set data into 50,000 train images and 10,000 validation images. Here are some samples images from the training set:

Fashion MNIST Training Set Sample Images

CNN Architecture

My architecture for this CNN is as follows:

Conv2d(1, 32, 5)
ReLu
MaxPool (2, 2)
Conv2d(32, 64, 5)
ReLu
MaxPool (2, 2)
Linear(64 * 4 * 4, 128)
ReLu
Linear(128, 10)

I started out using the recommended 2 convolution layers with 32 channels each, but in order to increase my accuracy, I decided to increase the channels in the second convolutional layer to 64 channels.


Loss Function & Optimizer

I used the following loss function and optimizer:

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)


I trained the network using the 50,000 test images, and tuned the hyperparameters based on the accuracy on the 10,000 validation set images. I trained over 10 epochs, using batches of 2,000 images at a time.


Results

Here is a plot of the training and validation accuracy per epoch.

Training and Validation Accuracy

Below is the per-class accuracy of the validation set & the testing set:

Testing Set

Label Classification Accuracy
0 T-Shirt 88%
1 Trouser 97%
2 Pullover 86%
3 Dress 90%
4 Coat 85%
5 Sandal 96%
6 Shirt 69%
7 Sneaker 96%
8 Bag 98%
9 Ankle Boot 97%

Overall Accuracy: 90%

Validation Set

Label Classification Accuracy
0 T-Shirt 91%
1 Trouser 98%
2 Pullover 85%
3 Dress 90%
4 Coat 87%
5 Sandal 97%
6 Shirt 72%
7 Sneaker 95%
8 Bag 97%
9 Ankle Boot 97%

Overall Accuracy: 91%


The network has the most trouble differentiating between T-Shirts, Coats, Shirts, and Pullovers. This is understandable because they all have similar image structures and can be easily confused.


For each class, here are some examples of correctly and incorrectly classified images:

Classified As Correct Incorrect
0

T-Shirt

T-Shirt

Pullover

Dress

1

Trouser

Trouser

Dress

Shirt

2

Pullover

Pullover

Coat

Shirt

3

Dress

Dress

Trouser

Coat

4

Coat

Coat

Pullover

Dress

5

Sandal

Sandal

T-Shirt

Bag

6

Shirt

Shirt

T-Shirt

Pullover

7

Sneaker

Sneaker

Ankle Boot

Sandal

8

Bag

Bag

T-Shirt

Dress

9

Ankle Boot

Ankle Boot

Sandal

Sneaker


Here are the visualized filters for the first layer of convolution. I had 32 channels, with each channel consisting of a 5x5 filter.


Part 2: Image Segmentation

I used the MiniFacade dataset to implement Semantic Segmentation in order to classify individual pixels in an image like so:


Class Color Pixel Value
others black 0
facade blue 1
pillar green 2
window orange 3
balcony red 4

Dataloader

There are 906 images in the training set, and 114 images in the testing set. I split the training set data into 725 train images and 181 validation images. Here is an example of a training image, and its corresponding ground truth segmentation.


Original Image
Segmentation

CNN Architecture

My architecture for this CNN is as follows:

Conv2d(3, 16, 5)
ReLu
Conv2d(16, 32, 5)
ReLu
Conv2d(32, 64, 5)
ReLu
MaxPool (2, 2)
ConvTranspose2d(64, 32, 6, stride=2)
ReLu
ConvTranspose2d(32, 16, 5)
ReLu
ConvTranspose2d(32, 16, 5)
ReLu
ConvTranspose2d(16, 3, 5)
ReLu
Conv2d(3, 5, 1)

I chose this architecture after looking at how the network performed on the validation set. I decided to only do one MaxPool operation because MaxPool tends to collapse individual pixel features, which would work well for image classification but so well for image segmentation. I chose to do a U-Net type architecture similar to the one described in this paper


Loss Function & Optimizer

I used the following loss function and optimizer:

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)


I trained the network using the 725 test images, and tuned the hyperparameters based on the accuracy on the 181 validation set images. I trained over 50 epochs.


Results

Below is a plot of the training and validation accuracy per epoch.

Training and Validation Loss

Below is the average precision of my network calculated on the testing set:


Average AP: 0.51897080071

Class Average Precision
others 0.6416514332595568
facade 0.7211610511400426
pillar 0.0618453133363771
window 0.818657223990157
balcony 0.35153898182387

Here are some examples of my network's output from the testing set:


Input Image

Ground Truth

Network Output

Input Image

Ground Truth

Network Output

Input Image

Ground Truth

Network Output

Input Image

Network Output


The network generally performs well when its identifying the facade and windows. It does not do as well with pillars and balconies. This may be for a variety of reasons; for one, every training image had facade & windows while only a smaller subset had balconies, and and even smaller subset had pillars. Additionally, windows have straighter edges & are much easier to visually identify. Therefore, it is expected that the network perform well in segmenting windows. However, balconies and pillars are harder to identify because their shapes are not standardized.

The last image & network output pair is a picture I took of the Royal Palace in Madrid, Spain that I cropped to (256, 256, 3) to run through the network. The network does not do well in identifying the pillars on the palace, and instead classifies it as part of the facade. It recognizes some of the balcony on the middle and top floors, but not very clearly. The network could perhaps not be performing as well because of the large amount of sky present in the picture, which was not present in many of the training images.