Project 4: Classification and Segmentation

Violet Yao (cs194-26-asf)

Part 1: Image Classification

For part 1, we will use the Fasion MNIST dataset available in torchvision.datasets.FashionMNIST for training our model. Fashion MNIST has 10 classes and 60000 train + validation images and 10000 test images.

Train & Validation Accuracy

My model reaches 91.59% training accuracy and 90.99 % validation accuracy over 30 epoches using a 0.001 learning rate, adam optimizer with 0.0005 weight decay

Final Validation Accuracy Final Test Accuracy
91.59 90.99

Per Class Accuracy

Classes Accuracy
T-shirt 0.836
Trouser 0.979
Pullover 0.871
Dress 0.896
Coat 0.853
Sandal 0.975
Shirt 0.771
Sneaker 0.968
Bag 0.980
Ankle Boot 0.970

Which classes were hardest to classify?

The model does not perform well on T-shirt (0.836) and Shirt (0.771). I took a look at the incorrectly predicted images and there are many cases where T-shirt predicted to be Shirt and Shirt predicted to be T-shirt. It makes sense because they are both tops with sleeves and visually difficult to tell.

Correctly and Incorrectly Classified Images for Each Class

Below I will show 2 images from each class which the network classifies correctly, and 2 more images where it classifies incorrectly.

Ankle Boot
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Bag
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Coat
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Dress
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Pullover
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Sandal
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Shirt
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Sneaker
Correct 1 Correct 2 Incorrect 1 Incorrect 2
T-Shirt
Correct 1 Correct 2 Incorrect 1 Incorrect 2
Trouser
Correct 1 Correct 2 Incorrect 1 Incorrect 2

Learned filter visualization

Here is the visualization of first conv layer!

Part 2: Semantic Segmentation

Semantic Segmentation refers to labeling each pixel in the image to its correct object class. For this part, we will use the Mini Facade dataset. Mini Facade dataset consists of images of different cities around the world and diverse architectural styles (in .jpg format), shown as the image on the left. It also contains semantic segmentation labels (in .png format) in 5 different classes: balcony, window, pillar, facade and others. I will train a network to convert image on the left to the labels on the right.

Train & Validation Accuracy

My mode decreases to 0.8544542159427676 test loss over 30 epoches using 0.001 learning rate and adam optimizer with 0.0005 weight decay.

Model Architecture

It turns out that the simple architecture of stacks of conv, relu, max_pool and a bit dropout works the best. Please find detailed structure below.

Average Precision on Test Set

The average precision of all classes is 0.53.

Model's Performance for My Own Collection

Overall the model does a good job on my collection. It is able to catch most of the pillars and windows. However, it predicts quite a few facade to be balcony (in the upper side of the original image) and some background to be facade.

Original image Segmented by my model