Nancy Shaw | CS 194-26

Image Classification

We will use the FasionMNIST dataset available in torchvision.datasets.FashionMNIST for training our model. Fashion MNIST has 10 classes and 60000 train + validation images (50000 were used for training and 10000 for validation) and 10000 test images. Each example is a 28x28 grayscale image, associated with a label from 10 classes: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, and Ankle boot.

Network Architecture

For our CNN, we used the recommended two convolution layers (3x3) with 32 channels and two fully connected networks, which worked quite well, netting an accuracy around 88%. To boost the accuracy to above 90%, we added padding to the convolutional layers. We used cross entropy loss with Adam at a learning rate of 0.001 and trained it on batch sizes of 64 for 10 epochs.

Results

This CNN was able to achieve a test set accuracy of \(91.2\%\).

Below are more class-categorized results.

Class	Accuracy	Incorrectly Identified
T-shirt/Top	Test: 87% Validation: 88%	Predicted T-Shirt/Top Predicted T-Shirt/Top
Trouser	Test: 98% Validation: 98%	Predicted Dress Predicted Dress
Pullover	Test: 88% Validation: 89%	Predicted Dress Predicted Shirt
Dress	Test: 93% Validation: 96%	Predicted Shirt Predicted T-Shirt/Top
Coat	Test: 89% Validation: 90%	Predicted Pullover Predicted Pullover
Sandal	Test: 98% Validation: 98%	Predicted Ankle Boot Predicted Sneaker
Shirt	Test: 62% Validation: 66%	Predicted Coat Predicted T-Shirt/Top
Sneaker	Test: 97% Validation: 97%	Predicted Ankle boot Predicted Ankle boot
Bag	Test: 98% Validation: 98%	Predicted T-Shirt/Top Predicted T-Shirt/Top
Ankle Boot	Test: 97% Validation: 97%	Predicted Sandal Predicted Sneaker

Filter Visualization

Below are the learned filters from the first convolutional layer

Semantic Segmentation

In this section, we will train a CNN to label each pixel in the image to its correct object class using the Mini Facade dataset. This dataset consists of images of different cities around the world and diverse architectural styles. We will try to segment these images into 5 different classes: balcony, window, pillar, facade and others.

Network Architecture

We split the training images into a 658 image train set and 165 image validation set. Then we used cross entropy loss with Adam at a learning rate of 0.001 and weight decay of 0.00001 and trained it on batch sizes of 1 for 10 epochs.

1. Use a 3x3 convolutional layer to go from 3 to 32 channels (padding 1).

2. Apply a ReLU.

3. Apply a 2x2 max pooling layer.

4. Use a 3x3 convolutional layer to go from 32 to 64 channels (padding 1).

5. Apply a ReLU.

6. Apply a 2x2 max pooling layer.

7. Use a 3x3 convolutional layer to go from 64 to 128 channels (padding 1).

8. Apply a ReLU.

9. Use a 3x3 transposed convolutional layer to go from 128 to 64 channels (stride 2).

10. Use a 3x3 convolutional layer to go from 64 to 64 channels (padding 1).

11. Apply a ReLU.

12. Use a 3x3 transposed convolutional layer to go from 64 to 32 channels (stride 2).

13. Use a 3x3 convolutional layer to go from 32 to 5 channels (padding 1).

Results

This CNN was able to achieve a test set accuracy of \(0.53\%\).

Below are the average precisions for each of the 5 classes:

The following table contains segmented images from the test set and an additional photo of Evans hall.

Original	Ground Truth	Segmented Image




	N / A

It seems like our classifier has a rather hard time judging between facade and others but is able to identify windows pretty well. Balcony and Pillar probably both have low average precision because there is comparatively less comapred to facades and windows and they maybe harder to identify from the image.

Classsification and Segmentation

Nancy Shaw, cs194-26-abo

CS 194-26 Spring 2020

Image Classification

Network Architecture

Results

Filter Visualization

Semantic Segmentation

Network Architecture

Results