In this section, I have trained a convolutional neural network to classify the
Fashion MNIST dataset into 10 different classes: top, trouser, pullover, dress,
coat, sandal, shirt, sneaker, bag, and ankle-boot.
My CNN consists of 2 convolution layers with 32 channels each. Each convolution layer is followed by a ReLU nonlinearity and 2x2 maxpooling layer. It is then followed by 2 fully connected networks with a ReLU applied after the first fully connected layer. I trained this network using the Adam optimizer over 15 epochs with a learning rate of 1e-3, a weight decay of 1e-5, and batch size 40.
The plot below shows how the network improves over time. By the 8th epoch, the learning has pretty much converged while the accuracy of the training set shows increased improvement, suggesting overfitting on the training set.
The chart below shows the results of the CNN tested on the validation set and the test set. As you can see, the CNN does the worst job at classifying a shirt, probably because there's more variation in shape and style of the shirts compared to the other classes. The overall accuracy of both the validation and test sets is 90%.
Class | Test Accuracy | Val Accuracy |
---|---|---|
Top | 82% | 84% |
Trouser | 98% | 99% |
Pullover | 91% | 91% |
Dress | 85% | 87% |
Coat | 84% | 82% |
Sandal | 97% | 97% |
Shirt | 72% | 73% |
Sneaker | 93% | 94% |
Bag | 97% | 97% |
Ankle Boot | 97% | 98% |
Here we have some examples of images that were classified either correctly or incorrectly by the network.
These are the 32 filters learned by the first convolution layer:
My model has 6 convolution layers with 64, 64, 128, 128, 256, and 5 channels respectively, all with a kernel size 3 and padding of 1.
There are also two ConvTranspose layers each with kernel size 2 and stride 2 at the end to handle upsampling.
The structure is as follows:
1. conv2d(64 channels) → ReLU
2. conv2d(64 channels) → ReLU → 2x2 MaxPool2d
3. conv2d(128 channels) → ReLU
4. conv2d(128 channels) → ReLU → 2x2 MaxPool2d
5. convTranspose2d(128 channels) → conv2d(256 channels)
6. convTranspose2d(256 channels) → conv2d(5 channels)
The average precision on my test set came down to 0.58. Each class AP is shown below.
AP = 0.6627658634014512 (Other)
AP = 0.7748032505153853 (Facade)
AP = 0.1436585109758356 (Pillar)
AP = 0.8230532488307044 (Window)
AP = 0.5039691926403682 (Balcony)
Image | Ground Truth | Model Prediction |
---|---|---|
|
|
|
|
|
|
And here are some results on my own images:
|
|
|
|
The network does a pretty good job identifying windows and facades on the test set as evidenced by having the highest APs. With the lowest AP of 0.14, the pillars were definitely the most variable in each output. On my own inputs, you can see the similar consistency in identifying windows and facades. However, there's some variability in Wheeler Hall's windows because some have its blinds shut while some are partially open or fully open. The network doesn't seem to be familiar with "different types of window" situations and therefore incorrectly identified some parts as balconys. Similarly with the test set, the pillars are the hardest to identify.
I thought this project was significantly more difficult compared to other projects because I don't personally have any real background/knoweldge of machine learning so it took me awhile to understand how everything really worked. Many thanks to the tutorials provided as well as friends who helped me with all the concepts! It's pretty interesting that given all the packages there are these days, it's pretty easy to write up a CNN that can perform these classifications without the person implementing it to need to understand all the underlying innerworkings going on in the network.