Project 4: Classification and Segmentation

Part 1: Image Classification

I used the Fashion MNIST dataset available in torchvision.datasets.FashionMNIST to train an image classification neural net. The dataset has 10 different classes:

Label	Classification
0	T-Shirt
1	Trouser
2	Pullover
3	Dress
4	Coat
5	Sandal
6	Shirt
7	Sneaker
8	Bag
9	Ankle Boot

Dataloader

There are 60,000 images in the training set, and 10,000 images in the testing set. I split the training set data into 50,000 train images and 10,000 validation images. Here are some samples images from the training set:

Fashion MNIST Training Set Sample Images

CNN Architecture

My architecture for this CNN is as follows:

Conv2d(1, 32, 5)

ReLu

MaxPool (2, 2)

Conv2d(32, 64, 5)

ReLu

MaxPool (2, 2)

Linear(64 * 4 * 4, 128)

ReLu

Linear(128, 10)

I started out using the recommended 2 convolution layers with 32 channels each, but in order to increase my accuracy, I decided to increase the channels in the second convolutional layer to 64 channels.

Loss Function & Optimizer

I used the following loss function and optimizer:

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

I trained the network using the 50,000 test images, and tuned the hyperparameters based on the accuracy on the 10,000 validation set images. I trained over 10 epochs, using batches of 2,000 images at a time.

Results

Here is a plot of the training and validation accuracy per epoch.

Training and Validation Accuracy

Below is the per-class accuracy of the validation set & the testing set:

Testing Set

Label	Classification	Accuracy
0	T-Shirt	88%
1	Trouser	97%
2	Pullover	86%
3	Dress	90%
4	Coat	85%
5	Sandal	96%
6	Shirt	69%
7	Sneaker	96%
8	Bag	98%
9	Ankle Boot	97%

Overall Accuracy: 90%

Validation Set

Label	Classification	Accuracy
0	T-Shirt	91%
1	Trouser	98%
2	Pullover	85%
3	Dress	90%
4	Coat	87%
5	Sandal	97%
6	Shirt	72%
7	Sneaker	95%
8	Bag	97%
9	Ankle Boot	97%

Overall Accuracy: 91%

The network has the most trouble differentiating between T-Shirts, Coats, Shirts, and Pullovers. This is understandable because they all have similar image structures and can be easily confused.

For each class, here are some examples of correctly and incorrectly classified images:

Classified As	Correct	Incorrect
0	T-Shirt T-Shirt	Pullover Dress
1	Trouser Trouser	Dress Shirt
2	Pullover Pullover	Coat Shirt
3	Dress Dress	Trouser Coat
4	Coat Coat	Pullover Dress
5	Sandal Sandal	T-Shirt Bag
6	Shirt Shirt	T-Shirt Pullover
7	Sneaker Sneaker	Ankle Boot Sandal
8	Bag Bag	T-Shirt Dress
9	Ankle Boot Ankle Boot	Sandal Sneaker

Here are the visualized filters for the first layer of convolution. I had 32 channels, with each channel consisting of a 5x5 filter.

Part 2: Image Segmentation

I used the MiniFacade dataset to implement Semantic Segmentation in order to classify individual pixels in an image like so:

Class	Color	Pixel Value
others	black	0
facade	blue	1
pillar	green	2
window	orange	3
balcony	red	4

Dataloader

There are 906 images in the training set, and 114 images in the testing set. I split the training set data into 725 train images and 181 validation images. Here is an example of a training image, and its corresponding ground truth segmentation.

Original Image

Segmentation

CNN Architecture

My architecture for this CNN is as follows:

Conv2d(3, 16, 5)

ReLu

Conv2d(16, 32, 5)

ReLu

Conv2d(32, 64, 5)

ReLu

MaxPool (2, 2)

ConvTranspose2d(64, 32, 6, stride=2)

ReLu

ConvTranspose2d(32, 16, 5)

ReLu

ConvTranspose2d(32, 16, 5)

ReLu

ConvTranspose2d(16, 3, 5)

ReLu

Conv2d(3, 5, 1)

I chose this architecture after looking at how the network performed on the validation set. I decided to only do one MaxPool operation because MaxPool tends to collapse individual pixel features, which would work well for image classification but so well for image segmentation. I chose to do a U-Net type architecture similar to the one described in this paper

Loss Function & Optimizer

I used the following loss function and optimizer:

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

I trained the network using the 725 test images, and tuned the hyperparameters based on the accuracy on the 181 validation set images. I trained over 50 epochs.

Results

Below is a plot of the training and validation accuracy per epoch.

Training and Validation Loss

Below is the average precision of my network calculated on the testing set:

Average AP: 0.51897080071

Class	Average Precision
others	0.6416514332595568
facade	0.7211610511400426
pillar	0.0618453133363771
window	0.818657223990157
balcony	0.35153898182387

Here are some examples of my network's output from the testing set:

Input Image

Ground Truth

Network Output

Input Image

Ground Truth

Network Output

Input Image

Ground Truth

Network Output

Input Image

Network Output

The network generally performs well when its identifying the facade and windows. It does not do as well with pillars and balconies. This may be for a variety of reasons; for one, every training image had facade & windows while only a smaller subset had balconies, and and even smaller subset had pillars. Additionally, windows have straighter edges & are much easier to visually identify. Therefore, it is expected that the network perform well in segmenting windows. However, balconies and pillars are harder to identify because their shapes are not standardized.

The last image & network output pair is a picture I took of the Royal Palace in Madrid, Spain that I cropped to (256, 256, 3) to run through the network. The network does not do well in identifying the pillars on the palace, and instead classifies it as part of the facade. It recognizes some of the balcony on the middle and top floors, but not very clearly. The network could perhaps not be performing as well because of the large amount of sky present in the picture, which was not present in many of the training images.