Project 4: Classification and Segmentation

Wenlong Huang | cs194-26-agf


Part 1: Image Classification

Implementation and Training Details

We use a network that has two convolution layer, each of which has 32 channels and kernel size of 3 and is follow the ReLU nonlinearity and a max-pooling layer of size 2. At the very end we have three fully-connected layers of size 120, 84, and 10 respectively. We use the cross-entropy loss and the network is trained with Adam with a learning rate of 0.001. 20% of the training set is used as validation set and the rest is used for training.

Learning Curve

We plot the training accuracy and validation accuracy during training. Note that as training progresses, the network slowly overfits to the training set as the accuracy on the validation set no longer increases.

Analysis of Different Classes

Accuracy by Classes

Class Name Accuracy (%)
T-shirt/top 76.9
Trouser 97.1
Pullover 80.7
Dress 93.2
Coat 87.7
Sandal 98.1
Shirt 78.5
Sneaker 95.7
Bag 97.6
Ankle boot 96.5

As shown above, T-shirt is the hardest class to get probably due to its large variation. We also show 2 images from each class which the network classifies correctly and 2 more images which the network classifies incorrectly.

Class Name Classified Correctly Classified Incorrectly
T-shirt/top    
Trouser    
Pullover    
Dress    
Coat    
Sandal    
Shirt    
Sneaker    
Bag    
Ankle boot    

Below we visualize the learned filters by heat maps. We can see that some of the filters correspond to some interesting structures of the images in the training set.

Layer Filter Heatmap (32 each)
Conv1              

             

             

             

Part 2: Semantic Segmentation

Implementation and Training Details

We use a network that has seven convolution layers, having 64, 128, 256, 512, 4096, 4096, 5 channels respectively and followed by the ReLU nonlinearity and two max-pooling layers of size 2. At the very end we have one tranposed convolution layer (though inaccurately a.k.a. deconvolution) for upsampling. We also use two drop-out masking layers before the We use the cross-entropy loss and the network is trained with Adam with a learning rate of 1e-3 and weight decay 1e-5.

Learning Curve

We plot the training loss and validation loss during training. Note that as training progresses, the network slowly overfits to the training set as the loss on the validation set no longer decreases.

Analysis

We are able to achieve an average precision (AP) of 0.64 on the test set.
Below we also show a photo of a building that I took in Macau and the trained network's segmentation result from it. Note that the photo was taken at night and has different structure than those that are in the training set. From this photo we can see how well the network generalize to those that is not from the training data distribution. Note that it does relatively well on recognizing the windows (orange) but does poorly on recognizing the pillars (green) which is prevalent in the test image.

Image Segmentation