CS194-26 Project 4: Classification and Segmentation
by Heidi Dong
In this project, I learned about what CNNs are and how to use them in PyTorch!
Part 1: Image Classification
The goal is to classify images from the Fashion MNIST dataset. I followed PyTorch's classification and neural network tutorials to complete this part.
Dataloader
The dataset is already available in torchvision.datasets.FashionMNIST
. There are 10 classes:
0 |
T-shirt |
1 |
Trouser |
2 |
Pullover |
3 |
Dress |
4 |
Coat |
5 |
Sandal |
6 |
Shirt |
7 |
Sneaker |
8 |
Bag |
9 |
Ankle boot |
Here are some sample images of an ankle boot, coat, bag, and sneaker:
CNN
My neural net architecture was:
- convolutional layer (1 input, 32 outputs, 5x5 kernel, stride of 1) followed by ReLU and max pooling
- convolutional layer (32 inputs, 32 outputs, 5x5 kernel, stride of 1) followed by ReLU and max pooling
- fully connected network (512 inputs, 120 outputs) followed by ReLU
- fully connected network (120 inputs, 10 outputs)
I experimented with the kernel size in the convolutional layers and the number of inputs and outputs in the first fully connected network layer, and found that the above values gave the highest average validation accuracy.
Loss Function and Optimizer
I initially trained the neural network using Adam with cross entropy loss and a learning rate of 0.01. I experimented with the learning rate and number of epochs, and found that a learning rate of 0.001 and around 5-10 epochs worked best. At 9 epochs, my validation accuracy was 90.66%.
Results
Below is a plot of the train and validation accuracy during the training process. Note that the y-axis does not start at zero. The train accuracy starts get higher because the model begins to overfit to the data.
This is a breakdown of the per-class accuracies on the test data:
0 |
T-shirt |
87% |
1 |
Trouser |
96% |
2 |
Pullover |
88% |
3 |
Dress |
90% |
4 |
Coat |
80% |
5 |
Sandal |
96% |
6 |
Shirt |
65% |
7 |
Sneaker |
98% |
8 |
Bag |
98% |
9 |
Ankle boot |
95% |
Correctly/incorrectly classified images
Most of the class accuracies are pretty high. Coats and shirts were the most misclassified. My hypothesis is that shirts are often misclassified as t-shirts and pullovers. This table shows examples of images in each class that were classified and misclassified.
Class | Correctly classified images | Misclassified images |
---|---|---|
T-shirt | ||
Trouser | ||
Pullover | ||
Dress | ||
Coat | ||
Sandal | ||
Shirt | ||
Sneaker | ||
Bag | ||
Ankle boot |
Visualizing the Learned Filters
I couldn't really figure out what these mean, but in general looking at the learned filters can help explain why and how a piece of data is classified. These are the learned filters of the first convolutional layer of the network:
Part 2: Semantic Segmentation
Semantic Segmentation refers to labeling each pixel in the image to its correct object class. In this second part, I attempted to train a neural network to label the different parts of a building facade.
Dataloader
I used 80% of the data for training and 20% for validation. These are the classes that will be used to label the parts of each image:
Class | Color | Value |
---|---|---|
others | black | 0 |
facade | blue | 1 |
pillar | green | 2 |
window | orange | 3 |
balcony | red | 4 |
Here is an example of a facade image and its expected, labeled version:
CNN
I googled "semantic segmentation cnn" to get ideas on how to structure my network. I decided to try and make mine similar to the U-net architecture I found in Ronneberger et al. 2015. After lots of experimentation, my CNN architecture is:
- convolutional layer (3 inputs, 64 outputs, 3x3 kernel) followed by max pooling (stride=2)
- convolutional layer (64 inputs, 128 outputs, 3x3 kernel) followed by max pooling (stride=2)
- convolutional layer (128 inputs, 256 outputs, 3x3 kernel) followed by max pooling (stride=2), and then upsampling (scale=2)
- convolutional layer (256 inputs, 128 outputs, 3x3 kernel) followed by upsampling (scale=2)
- convolutional layer (128 inputs, 64 outputs, 3x3 kernel) followed by upsampling (scale=2)
- convolutional layer (64 inputs, 5 outputs, 3x3 kernel)
Loss Function and Optimizer
I used cross entropy loss and trained my network on Adam with a learning rate of 1e-3 and weight decay of 1e-5.
Results
I trained my network for 20 epochs. Below is a plot of the train and validation losses during the training process.
I was able to achieve an average precision of 47%. Here is the breakdown for each class:
Class | Color | AP |
---|---|---|
others | black | 0.577917788847496 |
facade | blue | 0.5915926304701057 |
pillar | green | 0.09296728878460012 |
window | orange | 0.7973126495531374 |
balcony | red | 0.32900530940226963 |
I tried running the trained model on some other pictures. These are two buildings I like in SF, the Phelan Building and an apartment on Union Street.
In the Phelan facade below, my network was able to recognize most of the windows (orange). The parts with darker shadows were classified as balconies (red), though. Also, decorative elements like details on the building that are part of the facade were labelled as other (black). I do not take credit for the image; I found the facade image on Google Images.
The apartment building was not labelled as well, probably because it had the irregular shape from bay windows. Windows and balconies were confused. The image is from Google Street View.
Here is an example from the test set. The left is the facade, the middle is the expected classification, and the right is the result. Clearly, my neural net is not good at distinguishing pillars and balconies, as indicated by their low APs.
Final thoughts
As someone with no ML experience coming into this class, some parts of this project were pretty challenging for me. For example, I had no idea where to start when I was adding layers to the neural net for segmentation. This was a nice introduction to PyTorch and Google Colab, but I still don't think I understand what is really going on in the middle of all those layers.