Project 4: Classification and Segmentation

spring 2020

Image Classification

The Neural Network Structure

The neural network has the following structure.

  • 2 convolution layers with 32 channels. The first convolution layer uses a $5\times 5$ kernel, and the second convolution layer uses a $3\times 3$ kernel.
  • each convolution layer is followed by ReLU and then followed by maxpool of size (2, 2).
  • 2 fully connnected layers. The first FC layer has 120 nodes and is followed by ReLU, and the second FC layer has 10 nodes.

Data

The data used to train the model is the Fashion MNIST dataset, which has 10 classes, 60000 train/validation images and 10000 test images. 50000 images are used as train images and 10000 images are used as validation images. The batch size used is 100. Five example images are plotted below, with labels 8, 7, 5, 4, 8 respectively.

Training

To train the neural network, 6-fold cross validation is used. The epoch number is 60 (10 times of cross validation). Using $n$ denote the epoch number, validation dataset has indice in $[n \% 6 * 100, (n \% 6 + 1) * 100 -1]$. The loss function is cross entropy, and the optimizer is Adam with learning rate 0.002. The training and validation accuracy is plotted below. The validation accuracy is almost saturated after 25 epoches.

Result

The per class test accuracy is $[87.4, 98.3, 84.4, 88.6, 86.6, 98.8, 71.5, 96.9, 96.3, 96.5]$ for classes 0-9 respectively.

The per class validation accuray is $[99.9, 99.7, 97.9, 99.3, 98.9, 99.9, 96.8, 99.1, 99.3, 99.6]$ for classes 0-9 respectively.

The hardest class to predict is class 6. Even when the validation accuracy is almost perfect, the test accuracy is low for class 6. One reason might be overfitting. Two examples of correct predictions and two examples of incorrect predictions are shown below. By inspecting the incorrect predictions, it is clear that some of the images were mislabelled in the test set, i.e., class 5.

The learned kernels for the first convolution layer is also plotted below. Though it captures some characteristics of the original image, it is hard to visualize the meaning of each kernel.

Semantic Segmentation

The Neural Network Structure

The neural network has the following structure:

  • Conv2d(input_channel=3, output_channel=16, kernel=5, padding=2), followed by
    • ReLU
    • MaxPool2d(2)
  • ConvTranspose2d(input_channel=16, output_channel=64, kernel=7, stride=2, padding=3, output_padding=1), followed by
    • ReLU
  • Conv2d(input_channel=64, output_channel=128, kernel=5, padding=2), followed by
    • ReLU
  • Conv2d(input_channel=128, output_channel=128, kernel=7, padding=3), followed by
    • ReLU
    • MaxPool2d(2)
  • ConvTranspose2d(input_channel=128, output_channel=16, kernel=5, stride=2, padding=2, output_padding=1), followed by
    • ReLU
  • Conv2d(input_channel=16, output_channel=5, kernel=3, padding=1)

To train the network, the loss is cross entropy loss, and the optimizer is Adam with a learning rate of 1e-3 and weight decay 1e-5. The batch size used for train set and validation set is 6 (so there are 151 batches in total, 30 batches are used are validation set). The epoch number is 20. The plot of training and validation loss during training is shown below.

Average Precision

The average precision on the test set is:

  • Others: 0.673
  • Facade: 0.772
  • Pillar: 0.171
  • Window: 0.828
  • Balcony: 0.412
  • Mean: 0.571

Examples

Two sets of examples are shown below. Based on both examples, the windows in orange and the facades in blue are correctly predicted, whereas the pillars in green and the balcony in red are failures. The example images agree with the average precision calculated previously.