I split the original training dataset into 75% and 25% for training data and validation data, respectively. I used a network with 2 convolutional layers, each with 32 filters, a kernel size of 5 and padding of 2 on each side. The convolutional layers are each followed by a max pooling layer with kernel size of 2, stride 2, and padding 0. Then I have two fully connected layers with 256 and 10 neurons respectively. The first fully connected layer is followed by a ReLU. In short, the architecture of my conv net is:
I started off with a learning rate of 0.01, and after the performance plateus, I decreased the learning rate to 0.01 to further boost the performance. I did not use a weight decay value. I stopped training when the performance on validation dataset plateued and the training accuracy was still increasing. I stopped because I did not want my model to overfit the training data. Here is the graph of training and validation accuracy across iterations.
Here I list the per-class accuracy on the validation and test dataset. It looks like the class "shirt" is the hardest to get, and I suspect this is because shirt is very generic and can be misclassified as tops or pullovers pretty easily.
Class | Validation Accuracy | Test Accuracy |
---|---|---|
T-shirt/top | 0.8053040103492884 | 0.809 |
Trouser | 0.9678522571819426 | 0.972 |
Pullover | 0.8086544962812712 | 0.782 |
Dress | 0.8850574712643678 | 0.888 |
Coat | 0.8054474708171206 | 0.773 |
Sandal | 0.8054474708171206 | 0.958 |
Shirt | 0.6829733163913596 | 0.659 |
Sneaker | 0.9353535353535354 | 0.947 |
Bag | 0.9744107744107744 | 0.968 |
Ankle Boot | 0.9513008672448299 | 0.945 |
Here are the misclassified and the correctly classified pictures
Class | Wrong | Wrong | Correct | Correct |
---|---|---|---|---|
T-shirt/top | ||||
Trouser | ||||
Pullover | ||||
Dress | ||||
Coat | ||||
Sandal | ||||
Shirt | ||||
Sneaker | ||||
Bag | ||||
Ankle Boot |
Here are the learned filters of the first convolution layer. As Professor Efros mentioned in class, we can't really interpret the filters of non-first layers, so I will omit that on this webpage
My original intention was to use UNet for semantic segmentation, and set up the architecture as the following picture
However, this model was relatively unstable, and my guess is that there are too many layers and not enough training data. Moreover, the large number of layers can potentially also lead to vanishing/exploding gradients. Thus, after some consideration, I switched to the following architecture:
I first trained 50 epochs with a learning rate of 1e-2 and weight decay of 1e-4. Then, after seeing it plateuing, I changed the learning rate to 1e-3 and trained epochs 50 to 150. After that, I further decreased the learning rate to 1e-4 and increased the weight decay to 5e-4 and trained 50 more epochs. Below is the graph of training and validation loss across iterations
Here are the five AP scores for classes "others", "facade", "pillar", "window", and "balcony": 0.6440673343504246, 0.5844189455125556, 0.03467086295283239, 0.740272625138316, 0.2586885785630189. The average across all 5 classes is 0.4524236693. Below I perform semantic segmentation on two images I got from the Internet However, the result for this model is highly suboptimal, as shown in the graph below.
After consulting my good friend Bob Cao, I decided to use ResNet, and replaced cross entropy loss with dice loss, which is better for unbalanced class distributions. The architecture for my model is presented as follows:
Each ResBlock has 4 3x3 convolutional filters, and the ResBlocks have widths 32, 32, 32, 64, 64, respectively. I trained the model with 1e-3 learning rate and 1e-4 weight decay for the first 50 epochs, after which I decreased the learning rate to 1e-4. The AP scores are as follows: 0.705067161413679, 0.7682877297065847, 0.032804324238772394, 0.8201839825114601, 0.6239557507644303, with an average of 0.59. Here I passed some pictures from online into the network and get the following results.
We can see that the net that I trained got the location of most of the windows, facade, and balcony correctly. However, it doesn't recognize the pillars at all despite their presence in the second picture.