Classification and Segmentation

James Fong (cs194-abd)

CS194-26: Image Manipulation and Computational Photography Spring 2020

Image Classification

Here I trained a simple convolutional neural network to recognize different items of clothing. I used the torchvision.datasets.FashionMNIST dataset.

The human-readable class names are not included in pytorch, so I used the names given in the official GitHub. I also invert the images to display in the original grayscale.

Here are a few sample images and their classes:

I started off by using the recommended CNN architecture as given in the spec. However, I found that increasing the number of channels from 32 to 64 increased the validation accuracy by 1-2%. Also, ReLU ended up being the best non-linearity to use, keeping all else constant. Additionally, lowering the learning rate from 0.01 to 0.001 also boosted performance.

I trained for a total of 100 epochs. Here is the validation and training accuracies evaluated at the end of each epoch:

At the end of training, I ran my model on both the validation and the test sets each, and found the following per-class accuracies:

Class Val acc. Test Acc.
Shirt 71.20% 68.00%
Pullover 86.38% 85.40%
T-Shirt 90.25% 89.60%
Coat 89.36% 88.70%
Dress 94.88% 94.00%
Sneaker 95.33% 96.40%
Ankle boot 97.46% 97.30%
Trouser 98.21% 97.70%
Sandal 97.71% 97.70%
Bag 98.32% 97.70%
TOTAL 91.95% 91.25%

As we can see, the most challenging class is “shirt”, which achieves only 68% accuracy on the test data. After that is “pullover” which achieves only 85% accuracy.

Looking at the examples below, it is easy to see why. The two classes are quite visually similar, and so the CNN can often confuse the two.

Additionally, here are 2 examples per class for the test data that were correctly classified by my model:

Here are 2 examples per class for the test data that were incorrectly classified by my model:

Both the correct label and the incorrect classification are shown in the title.

Learned filters

First layer. Darker values are negative and lighter values are positive:

Second layer. Since the second layer has 64 channels, only the first 3 are shown as RGB:

Semantic Segmentation

This model uses 6 convolutional layers:

Layer Input Channels Output Channels Kernel Size
1 3 16 17x17
2 16 32 17x17
3 32 64 17x17
4 64 128 17x17
5 128 128 17x17
6 128 5 17x17

To speed up training, I apply the “ResNet” technique described in lecture. That is, each time we apply a layer to \(x\), we re-add \(x\) back into the output:

\[x' = \text{layer}(\text{ReLU}(x), w) + x\]

(I do not add \(x\) to back to itself for the final layer. Instead I just apply ReLU again at the end. See the code on bCourses for implementation.)

For loss, I use the provided Adam, except with a learning rate of 3e-5 instead of the default 1e-3. Weight decay is left at 1e-5.

For learning, I use a minibatch size of \(1\) and train for \(2^4=16\) epochs.

Here is a graph showing the training and validation loss during training:

Here are the resulting per-class AP scores:

Class AP Score
others 0.5189
facade 0.6278
pillar 0.1057
window 0.7563
balcony 0.4417
Average 0.4901

Here, I run the algorithm on the familiar facade.jpg image from project 2. I applied a perspective transform beforehand to align the image like in the test data. To visualize the resulting output, I took the argmax of the output tensor and applied the provided colors for each class.

As we can see, the large balcony is correctly classified, and all the windows have some amount of recognition. The six smaller balconies are harder for the model to see, likely due to the blue “cloud” effect. The model also has difficulty handling the sharp shadows next to the windows, and ends up classifying those shadows as “other” near the top of the image. There is also a spruious pillar being classified near the bottom left.

For comparison, here are some results taken from the test set. Notice how the sharp shadows are absent in the photograph, which might help the model recognize windows.

Misc

HTML theme for pandoc found here: https://gist.github.com/killercup/5917178