Project 4: Classification and Segmentation

Jie Qiu

Part 1: Image Classification

I split the original training dataset into 75% and 25% for training data and validation data, respectively. I used a network with 2 convolutional layers, each with 32 filters, a kernel size of 5 and padding of 2 on each side. The convolutional layers are each followed by a max pooling layer with kernel size of 2, stride 2, and padding 0. Then I have two fully connected layers with 256 and 10 neurons respectively. The first fully connected layer is followed by a ReLU. In short, the architecture of my conv net is:

conv -> relu -> max pool -> conv -> relu -> max pool -> fc -> fc -> softmax

I started off with a learning rate of 0.01, and after the performance plateus, I decreased the learning rate to 0.01 to further boost the performance. I did not use a weight decay value. I stopped training when the performance on validation dataset plateued and the training accuracy was still increasing. I stopped because I did not want my model to overfit the training data. Here is the graph of training and validation accuracy across iterations.

val_train_acc

Here I list the per-class accuracy on the validation and test dataset. It looks like the class "shirt" is the hardest to get, and I suspect this is because shirt is very generic and can be misclassified as tops or pullovers pretty easily.

Class Validation Accuracy Test Accuracy
T-shirt/top 0.8053040103492884 0.809
Trouser 0.9678522571819426 0.972
Pullover 0.8086544962812712 0.782
Dress 0.8850574712643678 0.888
Coat 0.8054474708171206 0.773
Sandal 0.8054474708171206 0.958
Shirt 0.6829733163913596 0.659
Sneaker 0.9353535353535354 0.947
Bag 0.9744107744107744 0.968
Ankle Boot 0.9513008672448299 0.945

Here are the misclassified and the correctly classified pictures

Class Wrong Wrong Correct Correct
T-shirt/top
0_3_2
0_6_1
0_0_1
0_0_2
Trouser
1_3_1
1_3_2
1_1_1
1_1_2
Pullover
2_4_1
2_6_2
2_2_1
2_2_2
Dress
3_1_1
3_6_2
3_3_1
3_3_2
Coat
4_2_1
4_6_2
4_4_1
4_4_2
Sandal
5_7_1
5_9_2
5_5_1
2_2_2
Shirt
6_4_1
2_6_2
6_6_1
6_6_2
Sneaker
7_9_1
7_5_2
7_7_1
7_7_2
Bag
8_6_1
8_9_2
8_8_1
8_8_2
Ankle Boot
9_5_1
9_7_2
9_9_1
9_9_2

Here are the learned filters of the first convolution layer. As Professor Efros mentioned in class, we can't really interpret the filters of non-first layers, so I will omit that on this webpage

filters

Part 2: Semantic Segmentation

My original intention was to use UNet for semantic segmentation, and set up the architecture as the following picture

unet

However, this model was relatively unstable, and my guess is that there are too many layers and not enough training data. Moreover, the large number of layers can potentially also lead to vanishing/exploding gradients. Thus, after some consideration, I switched to the following architecture:

Conv (64) -> ReLU -> Max Pool -> Conv (256) -> ReLU -> Max Pool -> Conv (1024) -> ReLU -> Max Pool -> Conv (4096) -> ReLU -> Max Pool -> (Deconv -> ReLU) * 4

I first trained 50 epochs with a learning rate of 1e-2 and weight decay of 1e-4. Then, after seeing it plateuing, I changed the learning rate to 1e-3 and trained epochs 50 to 150. After that, I further decreased the learning rate to 1e-4 and increased the weight decay to 5e-4 and trained 50 more epochs. Below is the graph of training and validation loss across iterations

part2_loss

Here are the five AP scores for classes "others", "facade", "pillar", "window", and "balcony": 0.6440673343504246, 0.5844189455125556, 0.03467086295283239, 0.740272625138316, 0.2586885785630189. The average across all 5 classes is 0.4524236693. Below I perform semantic segmentation on two images I got from the Internet However, the result for this model is highly suboptimal, as shown in the graph below.

ex1_output

After consulting my good friend Bob Cao, I decided to use ResNet, and replaced cross entropy loss with dice loss, which is better for unbalanced class distributions. The architecture for my model is presented as follows:

conv (32 filters) -> ResBlock * 5 -> conv (5 filters) -> sigmoid

Each ResBlock has 4 3x3 convolutional filters, and the ResBlocks have widths 32, 32, 32, 64, 64, respectively. I trained the model with 1e-3 learning rate and 1e-4 weight decay for the first 50 epochs, after which I decreased the learning rate to 1e-4. The AP scores are as follows: 0.705067161413679, 0.7682877297065847, 0.032804324238772394, 0.8201839825114601, 0.6239557507644303, with an average of 0.59. Here I passed some pictures from online into the network and get the following results.

ex2
ex2_output
ex3
ex3_output
ex5
ex5_output

We can see that the net that I trained got the location of most of the windows, facade, and balcony correctly. However, it doesn't recognize the pillars at all despite their presence in the second picture.