Page 1

Project 4 - Classification and Segmentation

Part 1: Image Classification

Part 2: Semantic Segmentation

CNN Architecture & Training Details

The CNN architecture has six convolution layers, having 1 * 48, 2 * 48, 4 * 48, 8 * 48, 8 * 48, 5 channels respectively and each followed by the ReLU activation except at the very end. The padding size is 1 and kernel size is 3 * 3. For the training, I use the cross-entropy loss and Adam optimizer with a learning rate of 1e-3 and weight decay 1e-5. The network is trained for 50 epochs, and 20% of training data is used as validation set.

Loss Curve

Below is the training loss and validation loss curve against the number of epochs during training for 50 epochs. Note that as training progresses, the network slowly overfits as the validation loss no longer decreases.

Average Precision

I achieve an average precision (AP) of 0.533 on the test set, with precision of 0.657, 0.751, 0.100, 0.781, 0.375 on each class. The pillar channel and balcony channel have relatively low precision.

Example

Below is an example from the test set.

CNN Architecture & Training Details

The CNN architecture has two convolution layers having 32 channels and kernel size of 5 respectively and each followed by the ReLU activation and a maxpool layer of size 2. At the very end we have two fully-connected layers of size 120 and 10 respectively. For the training, I use the cross-entropy loss and Adam optimizer with a learning rate of 0.001. The network is trained for 30 epochs, and 20% of training data is used as validation set.

Accuracy Curve

Below is the training accuracy and validation accuracy curve against the number of epochs during training for 30 epochs. Note that as training progresses, the network slowly overfits as the validation accuracy no longer increases.

Analysis of Different Classes

Below is the per class accuracy of the classifier on the validation and test set. Shirt class is the hardest to get, probably because it can be mixed up with T-shirt, Pullover or Coat, which also has relatively low accuracy.

Here is the image of Blackwell Hall and the trained network's segmentation result from it. From the segmentation, we can see that this network performs well on window prediction, which is the orange area. However, it does not predict well on pillar(green) and balcony(red) in the image, mainly due to the difference style between Blackwell Hall and the training sets.

Image

Label

Segmentation

Image

Segmentation

Class Name

Validation Acc (%)

Test Acc (%)

T-Shirt

84

85

Trouser

97

96

Pullover

77

79

Dress

88

87

Coat

88

Sandal

97

Shirt

67

65

Sneaker

95

Bag

97

96

Ankle Boot

96

95

Below in the left two columns are correctly classified images from each class, in the right two columns are misclassified images from each class, the incorrect prediction is written above the image.