In the first part of the project we are going to classifty Fashion_MNIST dataset into ten classes. Firstly, dataset is loaded and converted to tensors. Severl sample images are displayed. Then we build a CNN consisting of 2 convolutional layer, 2 fully connected linear layer and one output layer. After each conv layer and linear layer, a ReLU followed by a maxpool is added. The cross entropy loss and Adam optimizer are used for prediction.
The second part of the project aims to achieve semantic segmentation of Mini Facade Dataset which is able to label each pixel in the image to its correct object class. There are five classes encoded by different colors. A CNN made of 5 convolutional layer and 1 tansposed convolutional layer is used for training. Cross entropy loss and Adam optimizer are used here to achieve an Average Preciseion (AP) over 0.45.
The batch_size is 32 and 20% of the training dataset are chosen to be the validation dataset.
Sample images and their labels: |
---|
Train and validation accuracy during training process: |
Validation: | Test: | |
---|---|---|
T-shirt/top | 87% | 86% |
Trouser | 95% | 97% |
Pullover | 64% | 69% |
Dress | 88% | 86% |
Coat | 87% | 85% |
Sandal | 96% | 96% |
Shirt | 38% | 39% |
Sneaker | 96% | 96% |
Bag | 94% | 95% |
Ankle boot | 91% | 88% |
Correct: | Incorrect: | |
---|---|---|
T-shirt/top | ||
Trouser | ||
Pullover | ||
Dress | ||
Coat | ||
Sandal | ||
Shirt | ||
Sneaker | ||
Bag | ||
Ankle boot |
Five convolutional layers with two maxpool layers and one transposed convolutional layer for upsampling are used for training:
Conv2d(3,128,3,1,1) -> ReLU -> Conv2d(128,256,3,1,1) -> ReLU -> Maxpool(2,2) -> Conv2d(256,128,3,1,1) -> ReLU -> Conv2d(128,128,3,1,1) -> ReLU -> Maxpool(2,2) -> ConvTranspose2d(128,64,6,4,1) -> Conv2d(64,5,3,1,1)
20% of the training dataset are chosen to be the validation set. Cross entropy loss is used as prediction loss. Adam optmiizer with learning rate 1e-3 and weight decay 1e-5 is used for training.
Train and validation loss during training process: |
---|
Classes: | AP: |
---|---|
Others | 0.6619 |
Facade | 0.7851 |
Pillar | 0.1336 |
Window | 0.8087 |
Balcony | 0.3766 |
Average | 0.5532 |
Input Image | Ground truth | Predicted label |
---|---|---|
Input Image | Predicted label |
---|---|
As it can been seen from the examples above, Window(orange) and Facade(blue) are well predicted while it is difficult to identify Pillars(green). Furthermore, due to the fact that training dateset images are always within the boundaries of the building, it will fail to predict the surroundings outside the building like the Railway Station case above.