Part 1: Classification
Training and Validation
The suggested CNN architecture was used, LR and weight decay of 0.001, and in the end of 20 epochs was able to perform with ~88-90% accuracy on the test set. Below, the training and validation loss are described. The following accuracies are listed per row per class. Distinguishing between shirts and tshirts was the hardest, with tshirts having a 71% correctness, which is rather expected, being the most similar articles of clothing in the dataset.
Layer 1 filters also exhibit signs of oriented gradients and gaussian filters, matching the results expected nearing optimal behavior.
Segmentation
CNN Architecture
Multiple architectures were attempted, many of which initially failing to drop loss below 1.0, and other designs which had exploding gradients. 6 Layers: 3x3 Conv, 32 channels, max pool, 3x3 Conv, 64 channels, Batchnorm to control gradients, 3x3 Conv, 128 channels, 3x3 ConvTranspose, 64 channels, stride 2, 3x3 Conv, 32 channels, 3x3 ConvTranspose, 32 channels, stride 2, 3x3 Conv, 5 channels, all padded to maintain size, ReLU between all layers except the final output. Initially, padding was not used, and UpSampling layers were used to compensate size changes, but these likely caused huge instabilities in gradient descent, and were thus abandoned. Having multiple BatchNorms was also tested, but did not produce any huge deviations in results. Learning rates below 1e-5 are required, though weight decay can vary between 1e-5 to 1e-6 and still produce similar results. Average test set AP was 0.578 after 100 epochs using an 800 image training set.
The facade output does reasonably well for facades, windows, and some balconies, but does very poorly with pillars, and does not have enough well established boundaries to discern the "other" category with great fidelity.