CS 194 Project 4

Part 1: Image Classification

Training and Validation Accuracy

Training and Validation Accuracy
Training and Validation Accuracy

Per class accuracy

T-shirt/top 84%
Trouser/pants 98%
Pullover shirt 79%
Dress 91%
Coat 88%
Sandal 97%
Shirt 67%
Sneaker 98%
Bag 98%
Ankle boot 94%

I found that shirts were the hardest to classify, while pants, sneakers, and bags were fairly easy. The following are correctly and incorrectly images for each class.

T-Shirt Correct
T-Shirt Correct
T-Shirt Incorrect
T-Shirt Incorrect
Pants Correct
Pants Correct
Pants Incorrect
Pants Incorrect
T-Shirt Correct
Pullover Correct
Pullover Incorrect
Pullover Incorrect
Dress Correct
Dress Correct
Dress Incorrect
Dress Incorrect
Coat Correct
Coat Correct
Coat Incorrect
Coat Incorrect
Sandal Correct
Sandal Correct
Sandal Incorrect
Sandal Incorrect
Shirt Correct
Shirt Correct
Shirt Incorrect
Shirt Incorrect
Sneaker Correct
Sneaker Correct
Sneaker Incorrect
Sneaker Incorrect
Bag Correct
Bag Correct
Bag Incorrect
Bag Incorrect
Ankle Boot Correct
Ankle Boot Correct
Ankle Boot Incorrect
Ankle Boot Incorrect

Learned Filters

Conv Layer 1
Conv Layer 1
Conv Layer 2
Conv Layer 2 (Only 9 of them)

Part 2: Semantic Segmentation

Model architecture

UNet Model Architecture

I used a UNet with 14 convolutional layers, as well as 3 pooling and 3 tranpose convolutional layers. Further information about the model architecture can be seen here:

Layer Output Shape # Learnable Parameters
Conv2d-1 [-1, 16, 256, 256] 448
Conv2d-2 [-1, 24, 256, 256] 3,480
MaxPool2d-3 [-1, 24, 128, 128] 0
Conv2d-4 [-1, 36, 128, 128] 7,812
Conv2d-5 [-1, 52, 128, 128] 16,900
MaxPool2d-6 [-1, 52, 64, 64] 0
Conv2d-7 [-1, 72, 64, 64] 33,768
Conv2d-8 [-1, 96, 64, 64] 62,304
MaxPool2d-9 [-1, 96, 32, 32] 0
Conv2d-10 [-1, 96, 32, 32] 83,040
Conv2d-11 [-1, 96, 32, 32] 83,040
ConvTranspose2d-12 [-1, 96, 64, 64] 36,960
Conv2d-13 [-1, 72, 64, 64] 62,280
Conv2d-14 [-1, 52, 64, 64] 33,748
ConvTranspose2d-15 [-1, 52, 128, 128] 10,868
Conv2d-16 [-1, 36, 128, 128] 16,884
Conv2d-17 [-1, 24, 128, 128] 7,800
ConvTranspose2d-18 [-1, 24, 256, 256] 2,328
Conv2d-19 [-1, 16, 256, 256] 3,472
Conv2d-20 [-1, 5, 256, 256] 725
ReLU-21 [-1, 5, 256, 256] 0

Model statistics

Total params 465,857
Trainable params 465,857
Non-trainable params 0
Input size (MB) 0.75
Forward/backward pass size (MB) 89.00
Params size (MB) 1.78
Estimated Total Size (MB) 91.53

I used Cross Entropy as my loss, and Adam as my optimizer, with a learning rate of 1e-4 and a weight decay of 1e-4

Training and Validation Loss

Training and Validation Loss
Training and Validation Loss

Average Precision

I got an AP of 0.474 on this model

Result Example

My segmentation seemed to capture the windows and ledges well, however, it obviously didn't know what a wall was (it had a very low AP for this as well) and wasn't very good at railings either.

Input Image
Input Image
Ground Truth Segmentation
Ground Truth Segmentation
Output Segmentation
Output Segmentation