CS 194-26: Image Manipulation, Computer Vision and Computational Photography

Project 4: Classification and Segmentation

Jason Qiao, CS194-26-aab



Overview

In this project, I implemented two CNNs. One was a image classification network using FashionMNIST, and the other was a semantic segmentation network using a facade dataset.

Part 1: Image Classification

I used a pretty basic four layer CNN architecture for this part; this included two convolutional layers and two fully connected layers. I used cross entropy loss and an Adam optimizer. I found the optimal learning rate (0.002882) and weight decay (0) through random search.

Training and Validation Accuracy
Class Validation Accuracy Testing Accuracy
T-Shirt/Top 0.917 0.842
Trouser 0.994 0.976
Pullover 0.926 0.857
Dress 0.966 0.922
Coat 0.898 0.841
Sandal 0.989 0.968
Shirt 0.842 0.679
Sneaker 0.964 0.950
Bag 0.988 0.973
Ankle Boot 0.989 0.972

Shirt seemed to be the hardest to classify followed by T-Shirt/Top and Coat. I attribute this to because all the classes are tops so aesthetically they look similar.

Below are some images my network classified correctly and incorrectly.

Class Correctly Labelled Image 1 Correctly Labelled Image 2 Incorrectly Labelled Image 1 Incorrectly Labelled Image 1
T-Shirt/Top
Labelled as: Shirt
Labelled as: Shirt
Trouser
Labelled as: Dress
Labelled as: Dress
Pullover
Labelled as: Coat
Labelled as: Coat
Dress
Labelled as: Coat
Labelled as: Coat
Coat
Labelled as: Pullover
Labelled as: Shirt
Sandal
Labelled as: Ankle-Boot
Labelled as: Ankle-Boot
Shirt
Labelled as: T-Shirt/Top
Labelled as: T-Shirt/Top
Sneaker
Labelled as: Ankle-Boot
Labelled as: Ankle-Boot
Bag
Labelled as: Shirt
Labelled as: T-Shirt/Top
Ankle Boot
Labelled as: Sneaker
Labelled as: Sneaker

Here are the learned filters from the first convolutional layer.

Part 2: Semantic Segmentation

For my network, I grouped four convolutional layers with a final transposed convolution layer to upsample back to the original dimension.

Input -> (Conv ReLU Pool) x 4 -> ConvTranspose2d -> Conv -> Output

I used 64, 128, 128, and 64 filters for the convolutional layers, respectively, and used default hyperparameters (lr=1e-3, wd=1e-5) for the Adam optimizer and cross entropy loss.

Training and Validation Loss

I achieved an average precision of 0.5096 on the test set (0.5920, 0.7070, 0.1026, 0.6641, 0.4823).

Here is my result from a photo I took.

Own Photo
Network Output

It seems to be very conservative with its non-facade labelling. It got ~70% of the windows. In hindsight, this photo was a poor choice because there isn't anything else other than the facade wall and windows. I was hoping my network would (incorrectly) label the awning under the top windows as balconies but it didn't.