CS 194-26 Project 4 [acc id: aez]

Overview

CS 194-26 Project 4 [acc id: aez]

Part 1: Image Classification

Dataloader and CNN: Used torch.utils.data.DataLoader and torch.nn.Module respectively
Loss function and Optimizer: nn.CrossEntropyLoss and optim.Adam respectively. See parameters below.

CNN model specifics

Dataset sizes
- Training = 40000
- Validation = 5000
- Test = 5000
Hyperparameters
- batch_size = 32
- n_epochs = 10
- learning_rate = 0.001
Layers
- Specified in spec: 2 conv layers with 32 channels each, followed by ReLU and max pool.

Results

Train and validation accuracy

Accuracy

Overall

Accuracy of the network on the 60000 train images: 93.19 %
Accuracy of the network on the 5000 validation images: 91.34 %
Accuracy of the network on the 10000 test images: 90.52 %

Per-class

Class      Accuracy (%)
0            85.40   
1            98.10   
2            88.90   
3            94.20   
4            82.10   
5            96.90   
6            67.00   
7            98.90   
8            98.40   
9            95.30

Classified images

Correctly classified	Wrongly classified

Class 6 did the worst, class 7 did the best.
- Class 6 corresponds to shirt, which may be difficult to categorize given that there are other categories that also take on the look of a shirt (t-shirt, pullover, dress, top)
- Class 7 did the best, possibly because it looks much more distinct than other footwear, or that all classes of footwear in this dataset have distinct features.

Visualization of filter

Filters from the convolution was extracted and plotted, as seen below:

Part 2: Semantic Segmentation

Dataloader and CNN: Used torch.utils.data.DataLoader and torch.nn.Module respectively
Loss function and Optimizer: nn.CrossEntropyLoss and optim.Adam respectively. See parameters below.

CNN model specifics

Dataset sizes
- Training = 800
- Validation = 106
- Test = 114
Hyperparameters
- batch_size = 8
- n_epochs = 30
- learning_rate = 0.001
- weight_decay = 0.00001
Layers
- 6 layers of convolution, with maximum channel size of 128, and kernel sizes varying from 3x3 - 7x7.
- ReLU is done after every convolution except for the last
- 3 maxpooling and upsampling of scale factor (2,2) was done after the first three convolutions.

Results

Train and validation accuracy

Average Precision (AP)

AP = 0.663781527613688
AP = 0.7659576334469919
AP = 0.13393651712680177
AP = 0.8007558902959213
AP = 0.1794153896020382
Average Loss: 0.5087693916170883

Own image

Input	Output (output model)	Output (with 20 epochs)

Everything actually looks pretty much in place! The model did well on recognizing general facade, windows and could also pick up pillars at the appropriate places.
One noticeable shortfall is the identification of balconies. Eventhough all instances of balconies are identified, there were balconies that were identified in the middle of the window.
- This could be due to the dark colour that is present in the middle of the window that led the model to think that it is a balcony (which usually casts a dark shadow around it).
- It is possible that due to the training images having balconies directly underneath the windows, the model is trained to identify balconies very close by repeated windows.
To test my overtraining on balconies-near-windows hypothesis, I tried training my model with only 20 epochs. The model ended up identifying more balconies on the correct location and less on the windows. However, it has less resolution on other things that the original model did better (windows and facade).