Project 4 - Classification and Segmentation

Part 1: Image Classification


Part 2: Semantic Segmentation


CNN Architecture & Training Details


The CNN architecture has six convolution layers, having 1 * 48, 2 * 48, 4 * 48, 8 * 48, 8 * 48, 5 channels respectively and each followed by the ReLU activation except at the very end. The padding size is 1 and kernel size is 3 * 3. For the training, I use the cross-entropy loss and Adam optimizer with a learning rate of 1e-3 and weight decay 1e-5. The network is trained for 50 epochs, and 20% of training data is used as validation set.

Loss Curve


Below is the training loss and validation loss curve against the number of epochs during training for 50 epochs. Note that as training progresses, the network slowly overfits as the validation loss no longer decreases.

Average Precision


I achieve an average precision (AP) of 0.533 on the test set, with precision of 0.657, 0.751, 0.100, 0.781, 0.375 on each class. The pillar channel and balcony channel have relatively low precision.


Example


Below is an example from the test set.



CNN Architecture & Training Details


The CNN architecture has two convolution layers having 32 channels and kernel size of 5 respectively and each followed by the ReLU activation and a maxpool layer of size 2. At the very end we have two fully-connected layers of size 120 and 10 respectively. For the training, I use the cross-entropy loss and Adam optimizer with a learning rate of 0.001. The network is trained for 30 epochs, and 20% of training data is used as validation set.


Accuracy Curve


Below is the training accuracy and validation accuracy curve against the number of epochs during training for 30 epochs. Note that as training progresses, the network slowly overfits as the validation accuracy no longer increases.

Analysis of Different Classes


Below is the per class accuracy of the classifier on the validation and test set. Shirt class is the hardest to get, probably because it can be mixed up with T-shirt, Pullover or Coat, which also has relatively low accuracy.


Here is the image of Blackwell Hall and the trained network's segmentation result from it. From the segmentation, we can see that this network performs well on window prediction, which is the orange area. However, it does not predict well on pillar(green) and balcony(red) in the image, mainly due to the difference style between Blackwell Hall and the training sets.

Image

Label

Segmentation

Image

Segmentation

Class Name

Validation Acc (%)

Test Acc (%)

T-Shirt

84

85

Trouser

97

96

Pullover

77

79

Dress

88

87

Coat

88

88

Sandal

97

97

Shirt

67

65

Sneaker

95

95

Bag

97

96

Ankle Boot

96

95

Below in the left two columns are correctly classified images from each class, in the right two columns are misclassified images from each class, the incorrect prediction is written above the image.


Class Name

T-Shirt

Trouser

Pullover

Dress

Coat

Sandal

Shirt

Sneaker

Bag

Ankle Boot

Correctly Classified

Misclassified

Visualization of Learned Filters


Below are the learned filters by the first convolution layer. Some of the filters corresponds to some structure of images in the training set.