Project 4

Part 1

For this project we are harnessing the power of convolutional neural networks to perform classification on images. The first dataset we will be using is the Fashion MNIST dataset. The task here is to identify various clothing items and label them with the appropriate category. Here we have four random images sampled from the dataset, along with their labels.

The Model

The convolutional neural net consists of four layers, with two convolutional layers and two fully connected layers. For every layer but the last, ReLU is used as the activation function, and for convolutional layers max pooling is also applied afterwards. After 10 epochs in training, an overall test accuracy of 0.9151 was achieved. Below is a graph of the training and validation accuracies while in training. In the graph, accuracies are reported at the start of each epoch, so the first point at epoch 1 represents a net with no training at all.

If we calculate the accuracies by class, we find that shirts and pullovers are the hardest to classify correctly. Shirts are clearly the worst, as our accuracy for them is below 80%, unlike all the other classes. This is reflected in both the test and validation sets.

We can also take a random sample of 2 images from each class which the network classifies correctly.

Naturally, we can also find 2 random images of each class that the network classifies incorrectly. Each image is titled with what the network incorrectly predicted the label as, and what the correct label should be.

Finally, we can visualize all of the trained filters in the first convolutional layer. There are 128 of them in total, and each is 3x3 in size.

Part 2

The second dataset is the Mini Facade dataset, which consists of images of building exteriors. The goal is to take an image of a building and label each pixel in the image with a class that represents what kind of object the pixel is a part of. This is known as semantic segmentation. Once the network is trained, it can be used to generate images with regions colored according to the parts of the building it has identified. There are 5 different classes possible: facade, pillar, window, balcony, and other.

The Model

Compared to the Fashion MNIST dataset, a more complex convolutional neural network is necessary. We will be using 5 convolutional layers, each one using 3x3 kernels. The first layer starts off with 500 channels, and the number of channels gradually goes down with each layer until we get to the desired 5 classes. The progression is (500, 300, 200, 100, 5). All layers except the last use a ReLU activation function, and batch normalization (in an attempt to speed up training). After every even layer, max pooling is applied and then upsampling in order to compensate for the dimension loss from performing max pooling.

For the optimization function, I tried Adadelta since that seemed to work well for Fashion MNIST but it did not do very well here out of the box. I stuck with Adam, and tried a few learning rate and weight decay values with different orders of magnitude. Changing the learning rate generally hurt performance, but a lower weight decay of 1e-6 helped a bit. The learning rate remained at 1e-3. Maximum epochs was tricky to tune, since it was hard to predict the point at which further training would stop helping validation performance and would instead result in overfitting. A maximum epoch count that was too high would also make it take prohibitively long to tune other hyperparameters. I chose 40 epochs in an attempt to strike a balance.

To measure classification performance, I decided to calculate total loss over the training set in addition to running loss, which would make it easier to compare to the total loss calculated over the validation set. From the graphs, it appears that after 40 epochs the validation loss starts to level out, and the training loss goes below the validation loss. This indicates possible overfitting, and seems like a good stopping point.

In the graph below, total loss is measured at the end of each epoch, so the point at epoch 40 represents the final loss. At epoch 0 is the loss before any iterations have run.

The average precision on the test set was 0.5088576900919027. For the different classes (others, facade, pillar, window, balcony), we have 0.645927760245138, 0.7130420679749304, 0.10956444493365089, 0.7073868320189732, and 0.36836734528682114 respectively. The net seems to struggle with pillars and balconies the most.

Now let's run the trained network on an image of my own. Below is a photo of one of the campus residence halls.

<matplotlib.image.AxesImage at 0x28081f4dbe0>

We can see that in the output, the network labels most of the image as windows. Most of the exterior is indeed windows, but the network went a bit overboard with classifying things as windows. The trees and background parts in the image get mislabeled as balconies or facades. There are no pillars in the image, but there are spurious detections of them in the output. The few actual balconies in the image are somewhat labeled as such, albeit crudely. In conclusion, it seems the network could use some improvement.

<matplotlib.image.AxesImage at 0x28082f6c940>