Part One: Classification

In this project, we aimed to create a basic CNN classifier using the Pytorch framework. We sought to classify images in the FashionMNIST dataset into 10 categories: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot.

Random Examples of Items in the Training Data

Class Example
Bag
Pullover
T-shirt/top
Dress

Neural Network Design

For my neural network design, I have 2 convolutional layers (with output sizes of 64), separated by a SELU and a MaxPooling layer. I then send the data through two fully connected layers, separated by a SELU. I decided to increase the channels of the convolutions to be 64 instead of the 32 that was suggested in order to give me a higher accuracy. In addition to this, I experimented with a wide range of pytorch’s non linear activation functions. I found using nn.SELU() gave the best results and so I replaced all the nonlinearities that were RELUs with SELU in my best model. My first fully connected layer is (CONV_CHANNELS * N_PIXELS_IN_IMAGE)x120 which is (1600x120) and the second is 120x10 to reduce number of outputs to be able to determine the class.

Network summary: input -> conv2d(3x3, 64 channels) -> SELU -> MaxPool -> conv2d(3x3, 64 channels) -> SELU -> MaxPool -> flatten -> fully_connected(1600x120, 120) -> SELU -> fully_connected(120, 10)

Hyperparameters

I trained my network with a batch size of 512 in order to speed up processing time due to the GPU acceleration. I used the nn.CrossEntropyLoss criterion and the Adam optimizer with a learning rate of 0.001. I experimented with different learning rates, optimizers (such as SGD) and batch sizes, but I managed to get my best results with these. I experimented with the number of epochs and ended up with 10 my choice (although the difference between all values between 6 and 10 was very marignal and mostly dependent on the random order the training data was loaded).

I also split my training data into 5/6th training and 1/6 validation data so that in total the training data was 80% of the full dataset (50,000), the validation 10% (10,000) and the testing 10% (10,000).

Training and Validation Accuracies Over Time

Here we can see the model is being overfitted a bit as the training accuracy increases with only a marginal increase of testing accuracy, but with this number of epochs the validation loss was lowest.

Testing Accuracy

Over all the testing data, I ended up getting an accuracy of 90.52%. The following table details the per class accuracy of my best model.

Class Validation Accuracy Testing Accuracy
T-shirt/top 89.51% 86.1%
Trouser 98.7% 97.5%
Pullover 85.57% 86.5%
Dress 91.93% 90.9%
Coat 93.16% 89.2%
Sandal 98.7% 97.4%
Shirt 74.46% 68.0%
Sneaker 97.99% 97.4%
Bag 98.68% 98.0%
Ankle boot 97.56% 95.8%

From this table, we can see that Shirt is the hardest class to get, likely due to it’s similarity to other classes such as T-shirt/top and Coat. Most of the other classes had fairly high accuracies.

Per Class Output Examples

The following table contains example inputs and outputs for each class. In it, there are two inputs the model classified correctly, and two inputs that it mistakingly classified as that class.

Class Correct 1 Correct 2 Incorrect 1 True Class Incorrect 2 True Class
T-shirt/top Shirt Shirt
Trouser Dress Dress
Pullover Shirt Dress
Dress Shirt Coat
Coat Pullover Shirt
Sandal Sneaker Ankle boot
Shirt Coat Coat
Sneaker Ankle boot Ankle boot
Bag Sandal T-shirt/top
Ankle boot Sandal Sneaker

Learned Convolutional Filters

The following images are representions of the 64 3x3 convolutional filters that my model generated when learning. As I had two convolutional layers, I have two sets of 64 masks.

Conv1 Masks:

Conv2 Masks:

Part Two: Semantic Segmentation

Overview

In this part I aimed to separate different parts of an image of a facade of a building into the five classes pictured below. To do this, I used a deep convolutional neural network.

Network Architecture

My network has six convolutional layers with a kernel size of 5x5, each separated by a SELU() just like in the last section of this project. The first convolutional layer has 3 input channels and 32 output channels, all intermediate channels have 32 input and output channels. The last layer has 5 output channels to match the number of classes.

Although I ended up using SELU, I also ran the model with RELU and got a very similar result, with much less of a drastic change than in the first ssection.

I also experimented with down sampling (MaxPool, etc), up sampling (convTranspose2d, etc), BatchNorm and other layers, and changing the number of convolutions (either all smaller, all bigger, or gradual changes like 32->64->128->64->32). However, these all only had a minimal effect on the resulting AP score while slowing down the training process greatly. The biggest impact was increasing the kernel size from 3x3 to 5x5. When I did this I also managed to decrease the convolutional channels from 64 to 32 without the AP score decreasing, which helped speed up the training process. For tuning my architecture, I calculated the AP score of the validation set.

Network summary: input -> conv2d(5x5, 32 channels) -> SELU -> conv2d(5x5, 32 channels) -> SELU -> conv2d(5x5, 32 channels) -> SELU -> conv2d(5x5, 32 channels) -> SELU -> conv2d(5x5, 32 channels) -> SELU -> conv2d(5x5, 5 channels)

Hyperparameters

I used a batch size of 64 to increase training speed. I wanted to set it higher, but ran into sporadic errors with running out of GPU memory on Google Collab, but my results with 64 are still acceptable.

I kept the provided learning rate, weight decay, loss function and optimizer as the starter code (lr=1e-3, decay=1e-5, loss=crossCorrelation, optimizer=Adam). I experimented with changing these values, but was unable to get a significantly different result than the result from the one listed above. I also experimented with weighting the loss function for each class based on the inverse of their frequency, but this did not improve the result noticably.

I ended up having 50 epochs. This seemed to be the largest factor in the evaluation of the model as the loss kept going down as the epochs increased, but by 50 epochs the change became very gradual.

Results

AP Scores

Class AP
0 0.6227814966463495
1 0.7130142024844784
2 0.10545937256535659
3 0.7628823567930422
4 0.29451353677706443

Average AP Value: 0.49973019305325816

From these results, we can see my model struggled to classify balconies and pillars most of all. As these were some of the most infrequent classes in the training data, it’s possible that the model did not learn enough to differentiate them well from the others.

Training and Validation Loss Over Time

Example Images

Here are a few random examples of the input image, my output, and the ground truth images to give a feeling for how much my model learned.

My Own Image!

I ran my model on a photo I took of the Colosseum in Rome during my trip this winter break.

Input Photo Output Photo

Here we can see it did a decent job detecting the sky, the facade, and even the pillars in between the windows. However, it only managed to detect the edges of the windows and classified the interior as a balcony (which I suppose is not too far off of the physical structure). You can see it also struggled more with the pillars as the curvature of the struture became more prevalent. This likely due to the lack of curved/angled images in the training set.

Reflections

I enjoyed learning more about machine learning during this project as I have largely avoided it during my time here at Cal. As a final note, I am colorblind and find it very hard to distinguish between the red and the orange chosen for Part 2. A yellow, or dark green would have been much easier for my own internal neural network to clasify :)