Part 1: Classification

Objective

Our first objective was to classify images of clothing using our predictor, a convolutional neural network (CNN). For this project, we worked with the Fashion MNIST dataset, which consists of 10 classes and 60000 train + validation images and 10000 test images.

Convolutional Neural Network Architecture

My CNN's architecture consisted of two convolutional layers each separated by a ReLU and a max pooling layer, followed by 3 fully connected layers. Given our CNN, the next step was to define a loss function and optimizer to train our predictor.After trying different optimizers, learning rates, and values of weight decay, I found that I got the best results using an AdamOptimizer with a learning rate of lr=0.001. For the loss, I used CrossEntropyLoss. The batch size was 50 and I ran the network for 10 epochs. I also did an 80/20 split for the training and validation set, and used the validation set to select hyperparameters. A more detailed architecture description is below:

Layer type	Parameters
`conv2d`	`in_channels=3, out_channels=32, kernel_size=3`
`relu`
`max_pool2d`	`filter_size=2`
`conv2d`	`in_channels=32, out_channels=32, kernel_size=3`
`relu`
`max_pool2d`	`filter_size=2`
`linear`	`in_features=3255, out_features=120`
`linear`	`in_features=120, out_features=84`
`linear`	`in_features=84, out_features=10`

The training and validation accuracy graphed with respect to the number of epochs is below. We see that the training accuracy becomes larger than the validation accuracy after a certain number of epochs, indicating that we are overfitting on the train set.

Below is a more detailed breakdown of the accuracy between classes, for both the training and validation sets. The results of training give an overall accuracy of 90%. I found that class 4 (coat) and class 6 (shirt) had the lowest rates of accuracy.

Class number	Class name	Training accuracy	Validation accuracy
`0`	`t-shirt`	`90%`	`93%`
`1`	`trouser`	`97%`	`98%`
`2`	`pullover`	`88%`	`91%`
`3`	`dress`	`92%`	`95%`
`4`	`coat`	`81%`	`84%`
`5`	`sandal`	`97%`	`99%`
`6`	`shirt`	`72%`	`78%`
`7`	`sneaker`	`94%`	`95%`
`8`	`bag`	`98%`	`98%`
`9`	`ankle boot`	`97%`	`98%`

Below, we can see some examples from each class where the item was classified correctly and incorrectly. Some classes, such as the shirt and coat classes, seem to be incorrectly classified since they tend to have more ambiguous shapes.

Class name	Correctly classified	Incorrectly classified
`t-shirt`
`trouser`
`pullover`
`dress`
`coat`
`sandal`
`shirt`
`sneaker`
`bag`
`ankle boot`

Finally, the learned filters of the first layer can be seen as follows:

Part 2: Semantic Segmentation

Objective

Given a database of facades, the objective is to correctly label each pixel to its object class. In this project, the five semantic segmentation labels are as follows: alcony, window, pillar, facade and others. The Mini Facade dataset consists of images from places all around the world with various different architectural styles. The challenge here is to train our CNN to convert an image of a facade into an image of its corresponding labels according to the label map given. The process is similar to part one , however the structure of our convolutional neural network will be a little more complicated.

Convolutional Neural Network Architecture

My CNN's architecture consisted of 5 convolutional layers each separated by a ReLU. After the 2nd convolutional layer, I added a max pooling layer, and before some of the ReLUs, I added a BatchNorm layer to stabilize gradients. I started out by using an AdamOptimizer with a lr=1e-3 and weight_decay=1e-5. The loss I used was a CrossEntropyLoss. For training and validation splits, I used an 80% / 20% split. Finally, I trained for 5 epochs due to time constraints. The full architecture of my network is below:

Layer type	Parameters
`conv2d`	`in_channels=3, out_channels=16, kernel_size=3, padding=1`
`relu`
`conv2d`	`in_channels=16, out_channels=32, kernel_size=3, padding=1`
`relu`
`max_pool2d`	`filter_size=2`
`conv2d`	`in_channels=32, out_channels=64, kernel_size=3, padding=1`
`batchNorm2d`	`num_features=64`
`relu`
`conv2d`	`in_channels=64, out_channels=128, kernel_size=3, padding=1`
`relu`
`convTranspose2d`	`in_channels=128, out_channels=256, kernel_size=4, padding=1, stride=2`
`conv2d`	`in_channels=256, out_channels=5, kernel_size=1`

Given the CNN, the next step was to define a loss function and optimizer to train the predictor. After tuning the hyperparameters using the validation set, I found that cross entropy loss was the best loss function and Adam as the optimizer with a learning rate of 1e-3 and weight decay 1e-5 worked best. Below is a plot showing both training and validation loss across epochs:

The average precision for each class is as follows:

Class	Average Precision
Others	`64.25%`
Facade	`68.8%`
Pillar	`7.73%`
Window	`75.4%`
Balcony	`26.8%`
Average AP	`48.63%`

I observed that windows and facades were easiest to distinguish while pillars were the hardest to distinguish. This would make sense since there are many more windows in the training set, and perhaps pillars are often obscured or ambiguously shaped in comparison to windows. Here are some sample outputs from the test set:

Original image

Ground truth segmentation

Learned segmentation

Original image

Ground truth segmentation

Learned segmentation

Original image

Ground truth segmentation

Learned segmentation

As we can see, the network struggles with learning pillars, probably because pillars seem to be more ambiguously shaped or not as common. On the other hand, windows and facades generally tend to be classified correctly. Other objects that don't fall into the categorization defined by the facade dataset are also very hard for the network to segment.

Below, I show the results of the network on a few images I picked. They are houses in San Francisco.

Original image

Learned segmentation

Original image

Learned segmentation

We can see that the network does well on objects such as windows and facades, but there are some unique failure cases. The network seems to think the tree in the first image is a balcony, and seems to think that a large portion of the second image consists of balconies. I can see how it would be easy to think that the second house consists of large balconies, given that the architecture of the San Francisco Victorian houses is known for their distinctive bay windows.

CS 194-26 Project 4: Classification and Segmentation

Part 1: Classification

Objective

Convolutional Neural Network Architecture

Part 2: Semantic Segmentation

Objective

Convolutional Neural Network Architecture

Reflection