CS 194-26 Spring 2020 Project 4: Classification and Segmentation

By: Annie Nguyen

Part 1: Image Classification

I loaded the Fashion MNIST dataset from torchvision.datasets.FashionMNIST to train the convoluntial neural network. The dataset contains images of clothing and 10 labels: top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. I split the dataset into 50,0000 training, 10,000 validation, and 10,000 test images. Here is the first batch of images with their labels.

Labels: shirt, sandal, dress, trouser, sneaker, sandal, shirt, pullover

Next, I wrote a CNN with the following layers

Conv Layer, 32 channels, 5x5 filter
Relu
Maxpool Layer, 2x2 filter
Conv Layer, 32 channels, 5x5 filter
Relu
Maxpool Layer, 2x2 filter
Fully connected layer
Relu
Fully connected layer

I chose cross enropy loss and experimented with different learning rates, weight decay, and optimizers. I evaluated my hyperparameters on the validation set. I trained the network for 10 epochs using Adam and a learning rate = 0.001. After 10 epochs, it had a training loss of 0.167. The two plots below show the training loss and training accuracy during the proces of training. Every 6,000 or x tick on the x-axis marks a new epoch.

Training loss

Training accuracy

Accuracy of the network on the 10000 validation images: 0.9016
Accuracy of the network on the 10000 test images: 0.8996
Here are the per class accuracies on the validation and test set. The hardest classes were shirt and pullover.

Validation Set

Test Set

For each class, the left 2 photos are 2 examples where the class was predicted correctly and the right 2 photos are 2 examples where the class was predicted incorrectly with its incorrect prediction.

Here are the 32 learned 5x5 filters from the first convolution layer.

Part 2: Semantic Segmentation

In this part of the project, we tried to label each pixel of an image to a correct class using the Facade Datateset which contains 5 classes: balcony, pillar, window, facade, and other. The dataset contains 905 training images and 113 test images. I split the training dataset to use 800 images for training and 105 images for validation.

Results

The CNN I wrote had the following layers:

Conv Layer, 32 channels, 3x3 filter, padding 1, Relu, Batch norm
Conv Layer, 32 channels, 3x3 filter, padding 1, Relu, Batch norm
Conv Layer, 32 channels, 3x3 filter, padding 1, Relu, Batch norm
Maxpool Layer, 2x2 filter
Conv Transpose Layer, 64 channels, 2x2 filter, stride 2, Relu, Batch norm
Conv Layer, 64 channels, 3x3 filter, padding 1, Relu, Batch norm
Conv Layer, 64 channels, 3x3 filter, padding 1, Relu, Batch norm
Maxpool Layer, 2x2 filter
Conv Transpose Layer, 32 channels, 2x2 filter, stride 2, Relu, Batch norm
Conv Layer, 32 channels, 3x3 filter, padding 1, Relu, Batch norm
Conv Layer, 32 channels, 3x3 filter, padding 1, Relu, Batch norm
Maxpool Layer, 2x2 filter
Conv Transpose Layer, 5 channels, 2x2 filter, stride 2
Relu

I chose to do convolution layers with 3x3 filters and padding 1 to keep the same image dimensions from layer to layer. After applying Maxpool, which halves the image dimensions, I use ConvTranspose to upsample and match the original image dimensions with stride 2 and filter size 2. After trying many optimizers and values for the hyperparameters I chose the Adam optimizer with learning rate = 1e-3 and weight decay = 1e-5 because it provided the best results. I used a batch size of 4. I decided to place the batch norm transformation after the non-linearity Relu because in practice and recent papers, when applied after Relu, the model performs better and gives higher ap score.

Here is the training and validation loss every epoch during the training process:

Epoch

Class	AP on test set
other	0.5706324898350593
facade	0.6424847422315797
pillar	0.09759549263535051
window	0.7544610725258761
balcony	0.37914373106545085

The mean AP score on the test set was 0.48886350565 or ~48.9%

Here's the model output on a test image. The model is good at picking out windows correctly and shutters as other. It fails on noticing whether the roof is a facade or not. The blue at the top of the image means that the model thinks that the roof at the top is actually a facade when it should be classified as other (shaded in black).

Test image

Model output on test image