CS 194-26 Project 4: Classification and Segmentation

cs194-26-aga

Part 1: Classification



Objective

Our first objective was to classify images of clothing using our predictor, a convolutional neural network (CNN). For this project, we worked with the Fashion MNIST dataset, which consists of 10 classes and 60000 train + validation images and 10000 test images.



Convolutional Neural Network Architecture


My CNN's architecture consisted of two convolutional layers each separated by a ReLU and a max pooling layer, followed by 3 fully connected layers. Given our CNN, the next step was to define a loss function and optimizer to train our predictor.After trying different optimizers, learning rates, and values of weight decay, I found that I got the best results using an AdamOptimizer with a learning rate of lr=0.001. For the loss, I used CrossEntropyLoss. The batch size was 50 and I ran the network for 10 epochs. I also did an 80/20 split for the training and validation set, and used the validation set to select hyperparameters. A more detailed architecture description is below:


Layer type Parameters
conv2d in_channels=3, out_channels=32, kernel_size=3
relu
max_pool2d filter_size=2
conv2d in_channels=32, out_channels=32, kernel_size=3
relu
max_pool2d filter_size=2
linear in_features=32*5*5, out_features=120
linear in_features=120, out_features=84
linear in_features=84, out_features=10

The training and validation accuracy graphed with respect to the number of epochs is below. We see that the training accuracy becomes larger than the validation accuracy after a certain number of epochs, indicating that we are overfitting on the train set.


Below is a more detailed breakdown of the accuracy between classes, for both the training and validation sets. The results of training give an overall accuracy of 90%. I found that class 4 (coat) and class 6 (shirt) had the lowest rates of accuracy.


Class number Class name Training accuracy Validation accuracy
0 t-shirt 90% 93%
1 trouser 97% 98%
2 pullover 88% 91%
3 dress 92% 95%
4 coat 81% 84%
5 sandal 97% 99%
6 shirt 72% 78%
7 sneaker 94% 95%
8 bag 98% 98%
9 ankle boot 97% 98%

Below, we can see some examples from each class where the item was classified correctly and incorrectly. Some classes, such as the shirt and coat classes, seem to be incorrectly classified since they tend to have more ambiguous shapes.


Class name Correctly classified Incorrectly classified
t-shirt
trouser
pullover
dress
coat
sandal
shirt
sneaker
bag
ankle boot

Finally, the learned filters of the first layer can be seen as follows:



Part 2: Semantic Segmentation


Objective


Given a database of facades, the objective is to correctly label each pixel to its object class. In this project, the five semantic segmentation labels are as follows: alcony, window, pillar, facade and others. The Mini Facade dataset consists of images from places all around the world with various different architectural styles. The challenge here is to train our CNN to convert an image of a facade into an image of its corresponding labels according to the label map given. The process is similar to part one , however the structure of our convolutional neural network will be a little more complicated.


Convolutional Neural Network Architecture


My CNN's architecture consisted of 5 convolutional layers each separated by a ReLU. After the 2nd convolutional layer, I added a max pooling layer, and before some of the ReLUs, I added a BatchNorm layer to stabilize gradients. I started out by using an AdamOptimizer with a lr=1e-3 and weight_decay=1e-5. The loss I used was a CrossEntropyLoss. For training and validation splits, I used an 80% / 20% split. Finally, I trained for 5 epochs due to time constraints. The full architecture of my network is below:


Layer type Parameters
conv2d in_channels=3, out_channels=16, kernel_size=3, padding=1
relu
conv2d in_channels=16, out_channels=32, kernel_size=3, padding=1
relu
max_pool2d filter_size=2
conv2d in_channels=32, out_channels=64, kernel_size=3, padding=1
batchNorm2d num_features=64
relu
conv2d in_channels=64, out_channels=128, kernel_size=3, padding=1
relu
convTranspose2d in_channels=128, out_channels=256, kernel_size=4, padding=1, stride=2
conv2d in_channels=256, out_channels=5, kernel_size=1

Given the CNN, the next step was to define a loss function and optimizer to train the predictor. After tuning the hyperparameters using the validation set, I found that cross entropy loss was the best loss function and Adam as the optimizer with a learning rate of 1e-3 and weight decay 1e-5 worked best. Below is a plot showing both training and validation loss across epochs:



The average precision for each class is as follows:


Class Average Precision
Others 64.25%
Facade 68.8%
Pillar 7.73%
Window 75.4%
Balcony 26.8%
Average AP 48.63%

I observed that windows and facades were easiest to distinguish while pillars were the hardest to distinguish. This would make sense since there are many more windows in the training set, and perhaps pillars are often obscured or ambiguously shaped in comparison to windows. Here are some sample outputs from the test set:


Original image
Ground truth segmentation
Learned segmentation

Original image
Ground truth segmentation
Learned segmentation

Original image
Ground truth segmentation
Learned segmentation

As we can see, the network struggles with learning pillars, probably because pillars seem to be more ambiguously shaped or not as common. On the other hand, windows and facades generally tend to be classified correctly. Other objects that don't fall into the categorization defined by the facade dataset are also very hard for the network to segment.


Below, I show the results of the network on a few images I picked. They are houses in San Francisco.


Original image
Learned segmentation

Original image
Learned segmentation

We can see that the network does well on objects such as windows and facades, but there are some unique failure cases. The network seems to think the tree in the first image is a balcony, and seems to think that a large portion of the second image consists of balconies. I can see how it would be easy to think that the second house consists of large balconies, given that the architecture of the San Francisco Victorian houses is known for their distinctive bay windows.


Reflection

It was fun to observe classes and labels with various accuracies and ponder why this might be-- perhaps due to the training set not having as many instances of a class or because two classes are very similar to each other. For future development, it would be cool to work on methods of mitigating these incorrect classifications.