CS 194-26 Spring 2020 Project 4: Classification and Segmentation
By: Annie Nguyen
Part 1: Image Classification
I loaded the Fashion MNIST dataset from torchvision.datasets.FashionMNIST
to train the convoluntial neural network. The dataset contains images of
clothing and 10 labels: top, trouser, pullover, dress, coat, sandal, shirt,
sneaker, bag, and ankle boot. I split the dataset into 50,0000 training,
10,000 validation, and 10,000 test images. Here is the first batch of
images with their labels.
Next, I wrote a CNN with the following layers
- Conv Layer, 32 channels, 5x5 filter
- Relu
- Maxpool Layer, 2x2 filter
- Conv Layer, 32 channels, 5x5 filter
- Relu
- Maxpool Layer, 2x2 filter
- Fully connected layer
- Relu
- Fully connected layer
I chose cross enropy loss and experimented with different learning rates,
weight decay, and optimizers. I evaluated my hyperparameters on the
validation set. I trained the network for 10 epochs using Adam and a
learning rate = 0.001. After 10 epochs, it had a training loss of 0.167.
The two plots below show the training loss and training accuracy during
the proces of training. Every 6,000 or x tick on the x-axis marks a new
epoch.
Accuracy of the network on the 10000 validation images: 0.9016
Accuracy of the network on the 10000 test images: 0.8996
Here are the per class accuracies on the validation and test set. The
hardest classes were shirt and pullover.
For each class, the left 2 photos are 2 examples where the class was
predicted correctly and the right 2 photos are 2 examples where the class
was predicted incorrectly with its incorrect prediction.
shirt
trouser
pullover
dress
coat
sandal
shirt
sneaker
bag
ankle boot
Here are the 32 learned 5x5 filters from the first convolution layer.
Part 2: Semantic Segmentation
In this part of the project, we tried to label each pixel of an image to a
correct class using the Facade Datateset which contains 5 classes: balcony,
pillar, window, facade, and other. The dataset contains 905 training images
and 113 test images. I split the training dataset to use 800 images for
training and 105 images for validation.
Results
The CNN I wrote had the following layers:
- Conv Layer, 32 channels, 3x3 filter, padding 1, Relu, Batch norm
- Conv Layer, 32 channels, 3x3 filter, padding 1, Relu, Batch norm
- Conv Layer, 32 channels, 3x3 filter, padding 1, Relu, Batch norm
- Maxpool Layer, 2x2 filter
- Conv Transpose Layer, 64 channels, 2x2 filter, stride 2, Relu, Batch norm
- Conv Layer, 64 channels, 3x3 filter, padding 1, Relu, Batch norm
- Conv Layer, 64 channels, 3x3 filter, padding 1, Relu, Batch norm
- Maxpool Layer, 2x2 filter
- Conv Transpose Layer, 32 channels, 2x2 filter, stride 2, Relu, Batch norm
- Conv Layer, 32 channels, 3x3 filter, padding 1, Relu, Batch norm
- Conv Layer, 32 channels, 3x3 filter, padding 1, Relu, Batch norm
- Maxpool Layer, 2x2 filter
- Conv Transpose Layer, 5 channels, 2x2 filter, stride 2
- Relu
I chose to do convolution layers with 3x3 filters and padding 1 to keep
the same image dimensions from layer to layer. After applying Maxpool, which
halves the image dimensions, I use ConvTranspose to upsample and match
the original image dimensions with stride 2 and filter size 2. After trying
many optimizers and values for the hyperparameters I chose the Adam
optimizer with learning rate = 1e-3 and weight decay = 1e-5 because it
provided the best results. I used a batch size of 4. I decided to place
the batch norm transformation after the non-linearity Relu because in
practice and recent papers, when applied after Relu, the model performs
better and gives higher ap score.
Here is the training and validation loss every epoch during the training
process:
Class |
AP on test set |
other |
0.5706324898350593 |
facade |
0.6424847422315797 |
pillar |
0.09759549263535051 |
window |
0.7544610725258761 |
balcony |
0.37914373106545085 |
The mean AP score on the test set was 0.48886350565 or ~48.9%
Here's the model output on a test image. The model is good at picking out
windows correctly and shutters as other. It fails on noticing whether the
roof is a facade or not. The blue at the top of the image means that the
model thinks that the roof at the top is actually a facade when it should
be classified as other (shaded in black).