Classification and Segmentation

Chris Mitchell

Overview

For this project I applied pytorch to the computer vision problems of classification and segmentation.

Classification

Using the Fashion MNIST dataset, I trained a neural network to classify the clothing images into their different types

Data from the Fashion MNIST dataset

For this neural network, I had two layers of convolution and max pooling followed by two linear layers, with ReLU nonlinearities after convolutions and between the linear layers. I settled on the following parameters:

Layer Parameters
Convolution 1 1 to 32 channels, kernel size 7
Pooling 1 Kernel 2
Convolution 2 32 to 64 channels, kernel size 5
Pooling 1 Kernel 5
Linear 1 64 to 100
Linear 2 100 to number of classes
Global Parameters
Criterion Cross-Entropy Loss
Optimizer Adam
Learning Rate 0.0005
Weight Decay 0.001

I decided on the layer parameters as a tradeoff of time and accuracy. I increased the convolution kernel sizes and linear layer sizes until I didn't notice much difference. Then I decreased the learning rate until I noticed less oscillations in accuracy. After that, I increased the epochs until the accuracy seemed to steady out. Finally, I tuned weight decay to balance the inherent training bias, where it would not dominate validation accuracy but wouldn't be so removed that the overall accuracy is lower from not properly learning the system.

Results

Validation and Training Accuracy Over Time

We see the accuracy increasing over time. It began fairly high with the first iteration around 82%, and it increased to around 90% for training and 87% for validation. Adding more layers, larger kernels, and trying other forms of nonlinear units may improve this accuracy

Per Class Accuracy

Clothing Accuracy
T-shirt/top 85 %
Trouser 96 %
Pullover 83 %
Dress 90 %
Coat 68 %
Sandal 97 %
Shirt 71 %
Sneaker 95 %
Bag 96 %
Ankle Boot 95 %

These are all fairly high accuracies, with many being above 90%. Looking at the sample results below, I noticed that even I am having difficulty determining some of those images, so the inherent quality of the dataset may be preventing higher accuracy results.

Sample Results

T-Shirt/Top: Correct

Correct

Predicted: Shirt

Predicted: Pullover

Trousers: Correct

Correct

Predicted: Dress

Predicted: Dress

Pullover: Correct

Correct

Predicted: Shirt

Predicted: Coat

Dress: Correct

Correct

Predicted: Shirt

Predicted: Shirt

Coat: Correct

Correct

Predicted: Shirt

Predicted: Pullover

Sandal: Correct

Correct

Predicted: Sneaker

Predicted: Sneaker

Shirt: Correct

Correct

Predicted: T-shirt/Top

Predicted: T-shirt/Top

Sneaker: Correct

Correct

Predicted: Ankle Boot

Predicted: Sandal

Bag: Correct

Correct

Predicted: Trouser

Predicted: Dress

Ankle Boot: Correct

Correct

Predicted: Sneaker

Predicted: Sneaker

Filter Visualization

Semantic Segmentation

Using the Mini Facade dataset, I trained an NN to segment the images into their various components.

Network Structure

This network has a six layer convolution/max pooling setup, with the following parameters below, designed to maintain the 256 x 256 image dimensions:

Layer Parameters
Kernels 7
Padding 3
Stride 1
Start Channel 3
Internal Channels 20
Final Channel Number of Classes
Global Parameters
Criterion Cross-Entropy Loss
Optimizer Adam
Learning Rate 0.0005
Weight Decay 0.01
Nonlinearity ELU

Tuning of the Network

I started by establishing the 6 layers and determining the kernel sizes, padding, and channels needed to maintain proper input output dimensions. For simplicity, I made the padding, kernel sizes, and internal channels constant. With an odd kernel size and padding = (kernel size - 1) / 2, each layer maintained the input dimensions. Having an initial channel of 3 for RGB and output of number of classes solidified this structure as proper dimensions. I then tuned the learning rate to as high as possible while avoiding oscillations in convergence, and then increased the kernel size and internal channel number until I didn't notice a change anymore. I tuned the weight decay to offset the training bias without losing too much of the training data information. I still had fairly low average precision, but when I replaced the ReLU nonlinearities with ELU linearities my average precision reached the 45 % goal. I then trained on the entire training dataset and achieved the results below.

Results

Validation and Training Loss Over Time

Sample Images

Input Ground Truth Output

Average Precision Values

Class Color Average Precision
Others Black 0.6140
Facade Blue 0.7309
Pillar Green 0.1175
Window Orange 0.7696
Balcony Red 0.3960
Total Average Precision: 0.5256

While we see very low average precistion with pillar segmentation, facades, windows, and others are segmented decently well, helping us achieve an average precision of 0.5256