CS 194-26 Spring 2020 Project 4: Classification and Segmentation

By: Annie Nguyen

Part 1: Image Classification

I loaded the Fashion MNIST dataset from torchvision.datasets.FashionMNIST to train the convoluntial neural network. The dataset contains images of clothing and 10 labels: top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. I split the dataset into 50,0000 training, 10,000 validation, and 10,000 test images. Here is the first batch of images with their labels.

Labels: shirt, sandal, dress, trouser, sneaker, sandal, shirt, pullover

Next, I wrote a CNN with the following layers

I chose cross enropy loss and experimented with different learning rates, weight decay, and optimizers. I evaluated my hyperparameters on the validation set. I trained the network for 10 epochs using Adam and a learning rate = 0.001. After 10 epochs, it had a training loss of 0.167. The two plots below show the training loss and training accuracy during the proces of training. Every 6,000 or x tick on the x-axis marks a new epoch.

Training loss
Training accuracy

Accuracy of the network on the 10000 validation images: 0.9016
Accuracy of the network on the 10000 test images: 0.8996
Here are the per class accuracies on the validation and test set. The hardest classes were shirt and pullover.

Validation Set
Test Set

For each class, the left 2 photos are 2 examples where the class was predicted correctly and the right 2 photos are 2 examples where the class was predicted incorrectly with its incorrect prediction.

shirt

shirt
shirt
dress
pullover

trouser

trouser
trouser
top
dress

pullover

pullover
pullover
shirt
shirt

dress

dress
dress
bag
shirt

coat

coat
coat
shirt
shirt

sandal

sandal
sandal
sneaker
sneaker

shirt

shirt
shirt
top
coat

sneaker

sneaker
sneaker
sandal
ankle boot

bag

bag
bag
top
sandal

ankle boot

ankle boot
ankle boot
sandal
sneaker

Here are the 32 learned 5x5 filters from the first convolution layer.

Part 2: Semantic Segmentation

In this part of the project, we tried to label each pixel of an image to a correct class using the Facade Datateset which contains 5 classes: balcony, pillar, window, facade, and other. The dataset contains 905 training images and 113 test images. I split the training dataset to use 800 images for training and 105 images for validation.

Results

The CNN I wrote had the following layers:

I chose to do convolution layers with 3x3 filters and padding 1 to keep the same image dimensions from layer to layer. After applying Maxpool, which halves the image dimensions, I use ConvTranspose to upsample and match the original image dimensions with stride 2 and filter size 2. After trying many optimizers and values for the hyperparameters I chose the Adam optimizer with learning rate = 1e-3 and weight decay = 1e-5 because it provided the best results. I used a batch size of 4. I decided to place the batch norm transformation after the non-linearity Relu because in practice and recent papers, when applied after Relu, the model performs better and gives higher ap score.

Here is the training and validation loss every epoch during the training process:

Epoch
Class AP on test set
other 0.5706324898350593
facade 0.6424847422315797
pillar 0.09759549263535051
window 0.7544610725258761
balcony 0.37914373106545085

The mean AP score on the test set was 0.48886350565 or ~48.9%

Here's the model output on a test image. The model is good at picking out windows correctly and shutters as other. It fails on noticing whether the roof is a facade or not. The blue at the top of the image means that the model thinks that the roof at the top is actually a facade when it should be classified as other (shaded in black).

Test image
Model output on test image