CS 194 Project 4 Soham Kale

Part 1: Image Classification

Training and Validation Accuracy

Per class accuracy

T-shirt/top

84%

Trouser/pants

98%

Pullover shirt

79%

Dress

91%

Coat

88%

Sandal

97%

Shirt

67%

Sneaker

98%

Bag

98%

Ankle boot

94%

I found that shirts were the hardest to classify, while pants, sneakers, and bags were fairly easy. The following are correctly and incorrectly images for each class.

Learned Filters

Part 2: Semantic Segmentation

Model architecture

I used a UNet with 14 convolutional layers, as well as 3 pooling and 3 tranpose convolutional layers. Further information about the model architecture can be seen here:

Layer	Output Shape	# Learnable Parameters
Conv2d-1	[-1, 16, 256, 256]	448
Conv2d-2	[-1, 24, 256, 256]	3,480
MaxPool2d-3	[-1, 24, 128, 128]	0
Conv2d-4	[-1, 36, 128, 128]	7,812
Conv2d-5	[-1, 52, 128, 128]	16,900
MaxPool2d-6	[-1, 52, 64, 64]	0
Conv2d-7	[-1, 72, 64, 64]	33,768
Conv2d-8	[-1, 96, 64, 64]	62,304
MaxPool2d-9	[-1, 96, 32, 32]	0
Conv2d-10	[-1, 96, 32, 32]	83,040
Conv2d-11	[-1, 96, 32, 32]	83,040
ConvTranspose2d-12	[-1, 96, 64, 64]	36,960
Conv2d-13	[-1, 72, 64, 64]	62,280
Conv2d-14	[-1, 52, 64, 64]	33,748
ConvTranspose2d-15	[-1, 52, 128, 128]	10,868
Conv2d-16	[-1, 36, 128, 128]	16,884
Conv2d-17	[-1, 24, 128, 128]	7,800
ConvTranspose2d-18	[-1, 24, 256, 256]	2,328
Conv2d-19	[-1, 16, 256, 256]	3,472
Conv2d-20	[-1, 5, 256, 256]	725
ReLU-21	[-1, 5, 256, 256]	0

Layer

Output Shape

# Learnable Parameters

Conv2d-1

[-1, 16, 256, 256]

448

Conv2d-2

[-1, 24, 256, 256]

3,480

MaxPool2d-3

[-1, 24, 128, 128]

Conv2d-4

[-1, 36, 128, 128]

7,812

Conv2d-5

[-1, 52, 128, 128]

16,900

MaxPool2d-6

[-1, 52, 64, 64]

Conv2d-7

[-1, 72, 64, 64]

33,768

Conv2d-8

[-1, 96, 64, 64]

62,304

MaxPool2d-9

[-1, 96, 32, 32]

Conv2d-10

[-1, 96, 32, 32]

83,040

Conv2d-11

[-1, 96, 32, 32]

83,040

ConvTranspose2d-12

[-1, 96, 64, 64]

36,960

Conv2d-13

[-1, 72, 64, 64]

62,280

Conv2d-14

[-1, 52, 64, 64]

33,748

ConvTranspose2d-15

[-1, 52, 128, 128]

10,868

Conv2d-16

[-1, 36, 128, 128]

16,884

Conv2d-17

[-1, 24, 128, 128]

7,800

ConvTranspose2d-18

[-1, 24, 256, 256]

2,328

Conv2d-19

[-1, 16, 256, 256]

3,472

Conv2d-20

[-1, 5, 256, 256]

725

ReLU-21

[-1, 5, 256, 256]

Model statistics

I used Cross Entropy as my loss, and Adam as my optimizer, with a learning rate of 1e-4 and a weight decay of 1e-4

Training and Validation Loss

Average Precision

Result Example

My segmentation seemed to capture the windows and ledges well, however, it obviously didn't know what a wall was (it had a very low AP for this as well) and wasn't very good at railings either.

Total params	465,857
Trainable params	465,857
Non-trainable params	0
Input size (MB)	0.75
Forward/backward pass size (MB)	89.00
Params size (MB)	1.78
Estimated Total Size (MB)	91.53