The network I ended up using had the following structure
It trained pretty well. Here is the accuracy versus epoch curve.
I made a table to investigate what mistakes the classifier was making. Here it is!
Shirts, T shirts and shirts are often mixed up, and as a result have the two lowest accuracy numbers
Here are some correct and incorrect classifications:Class | First Correct | Second correct | First Wrong | Second Wrong |
Ankle Boot |
|
|
|
|
Bag |
|
|
|
|
Coat |
|
|
|
|
Dress |
|
|
|
|
Pullover |
|
|
|
|
Sandals |
|
|
|
|
Shirt |
|
|
|
|
Sneaker |
|
|
|
|
Trousers |
|
|
|
|
T-shirts |
|
|
|
|
As you can see, the error images tend to overlap with a certain other class. For example, the errors in sandles and sneakers look like the other, as do the errors in T shirts and shirts
Here are the learned filters for the first layer. Brighter colors are higher values
Here is the architecture I ended up with. I first defined a standard block which is pretty standard in the literature:
These layers were then arranged as follows. Channel Size, which I experimented with, will be denoted as C
I used the following hyperparameters
Here is my AP (first try :) )
Image | Ground Truth | Prediction |
|
|
|
|
|
|
|
|
|
As you can see, the images tend to struggle a bit with borders but overall perform very well.