Project 4 Classification and Segmentation

Yibin Li, March 2020

In this project, we will use state-of-art deep learning tool (PyTorch) to train a image classification CNN and a semantic segmentation CNN.

Part 1 Image Classification

The first part is training a two convolution layer CNN to classify images from the FashionMNIST dataset. The model structure is defined as below.

Model architecture

fashionNet(
  (cnn_model): Sequential(
    (0): Conv2d(1, 32, kernel_size=(5, 5), stride=(1, 1))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(32, 32, kernel_size=(5, 5), stride=(1, 1))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (fc_model): Sequential(
    (0): Linear(in_features=512, out_features=128, bias=True)
    (1): ReLU(inplace=True)
    (2): Linear(in_features=128, out_features=10, bias=True)
  )
)

The model is trained for 30 epochs with Adam optimizer of 1e-3 learning rate and 5e-4 weight decay. The model is evaluated by the cross entropy loss. Below are the training and validation plots for loss and accuracy, respectively.

Model training and validation loss

svg

Model training and validation accuracy

svg

I am able to achieve 91.5% accuracy on the validation dataset, and 90.7% accuracy on the test dataset. I calculate the per-class accuracy here on the validation dataset.

T-Shirt: 0.9182
Trouser: 0.9861
Pullover: 0.8750
Dress: 0.9404
Coat: 0.8404
Sandal: 0.9822
Shirt: 0.7048
Sneaker: 0.9561
Bag: 0.9858
Ankle Boot: 0.9640

Trouser, Sandal, and Bag class are all have around 98% accuracy during training and validation; Trouser is the highest among all other classes. Shirt is the lowest among all classes.

Below are the correct and wrong predictions for each of the class in the FashiMNIST dataset. First two images are the correct prediction for that class, and the last two images are the false prediction.

T-Shirt

svg

Trouser

svg

Pullover

svg

Dress

svg

Coat

svg

Sandal

svg

Shirt

svg

Sneaker

svg

Bag

svg

Ankle Book

svg

And finally, here are the 32 learned 5x5 filters from the first convolution layer.

svg

Part 2

In this part, I trained a semantic segmentation model on the Facade dataset. The dataset have a total of 905 training-time images and 113 test-time images. I split the first 905 images to 0.8 and 0.2 for model training and validation, respectively. The rest 113 images are only for the test. The model structure is defined as below.

Model Architecture

I used 6 Conv layers stacked with batchnorm layers and dropout layers to prevent overfitting. Between the block2 and block3 Conv layer, I added a upsampling layer so that the output image size matched the original input image size.

FacadeNet(
  (block1): Sequential(
    (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Dropout2d(p=0.2, inplace=False)
    (3): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): Dropout2d(p=0.2, inplace=False)
    (6): ReLU(inplace=True)
  )
  (pool): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
  (block2): Sequential(
    (0): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Dropout2d(p=0.2, inplace=False)
    (3): ReLU(inplace=True)
  )
  (up): UpsamplingNearest2d(scale_factor=2.0, mode=nearest)
  (block3): Sequential(
    (0): Conv2d(96, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Dropout2d(p=0.2, inplace=False)
    (3): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): Dropout2d(p=0.2, inplace=False)
    (6): ReLU(inplace=True)
  )
  (classifier): Sequential(
    (0): Dropout2d(p=0.2, inplace=False)
    (1): Conv2d(32, 5, kernel_size=(1, 1), stride=(1, 1))
  )
)

Model training and validation loss

svg

I used the cross entropy loss as the model criterion, Adam optimizer with learning rate of 1e-3 and weight decay of 1e-5, and trained the model for 100 epochs. Using the default calculate_AP function, the model achieves mAP of 0.5 on the test dataset.

AP = 0.63634177927691
AP = 0.7176073269837082
AP = 0.10667374877478818
AP = 0.7724571373714736
AP = 0.2672858728373275
Average AP over 5 class = 0.5000731730488415

Test the model on images of my collections

svg

I intentionally choose these iamges from internet to see how well the model generized to the real world. As we can see from above, the model doesn't perform well on pillars. This might due to the blurring and color distoration after resizing, but it is consistent with the pillar's low AP value. Although the layout is weired too, the model successfully identifies window, facade, and balcony on all three images.