Classification

Intro

I decided to stick with the recommended structure for the network (although there definitely are better architectures). At least when testing with a small number of epochs to check speed of convergence, my results were best using the default settings. I tried changing the learning rate, number of channels in the second conv layer (which seems to improve results considering case studies like VGG-16), activation function, and regularization, but the default results converged faster and proved sufficient.

Training

Here are our training and validation accuracies over time.

Results

Accuracy

The accuracies, per class, output by my network are below:

Class #	Label	Accuracy
0	T-shirt/top	78.1
1	Trouser	95.8
2	Pullover	66.3
3	Dress	83.9
4	Coat	85.3
5	Sandal	92.7
6	Shirt	29.1
7	Sneaker	94.0
8	Bag	95.1
9	Ankle boot	93.8

Success and Failure Cases

Below are two success and two failure cases for my model. It seems that the lowest accuracy is the shirt, which makes sense due to heavy visual similarity to Pullovers and Coats

Filters

Below are each of the 32 3x3 filters used for the first convolutional layer.

Segmentation

Architecture

My model is based off of a standard symmetric U-Net architecture. Ideally, we would add skip connections between each of the equivalent hidden convolution and deconvolution layers, but we can achieve satisfactory results without this. My final architecture is made up of three convolution and three deconvolution layers, followed by a final 1x1 convolution to yield our output labels. Each convolution layer consists of a 3x3 stride 1 convolution (with padding to maintain size), a ReLU activation, then a 2x2 max-pool that effectively downsamples our features by 2x. With each convolution layer we (basically) double the number of channels. Each deconvolution layer effectively does the opposite, but uses the same structure - a 3x3 deconvolution (with padding to maintain size), ReLU, 2x upsample, and halving of number of channels. The hyperparameters for each layer is detailed in the table below:

Layer #	Layer Type	Kernel Size	Padding	Activation	Input Dims (WxHxC)	Output Dims (WxHxC)
1	Convolution	3x3	1	ReLU	256x256x3	128x128x32
2	Convolution	3x3	1	ReLU	128x128x32	64x64x64
3	Convolution	3x3	1	ReLU	64x64x64	32x32x128
4	Deconvolution	3x3	1	ReLU	32x32x128	64x64x64
5	Deconvolution	3x3	1	ReLU	64x64x64	128x128x32
6	Deconvolution	3x3	1	ReLU	128x128x32	256x256x32
7	Final "Convolution"	1x1	None	None	256x256x32	256x256x5

I experimented a bit with hyperparameters such as learning rate, regularization but since my model took a while to train it was difficult to compare results. Surprisingly, regularization did not seem to help in the final results with this case either. Regardless, the results are satisfactory. I did an experiment with 100 epochs, but the training and validation losses seemed to converge around 30, so I stopped there to prevent overfitting.

Training

Here are our training and validation accuracies over time.

Results

Accuracy

The APs, per class, output by my network are below:

Class #	Label	AP
0	Others	0.6745672042556656
1	Facade	0.7712309942334139
2	Pillar	0.21752339148767363
3	Window	0.8418040619619384
4	Balcony	0.617856161443973

This puts our average AP at around 62.4%.

Example Output

Below is an example input image, our model output, and the corresponding ground truth:

My model seems to most readily misclassify things as balconies, especially in occluded regions. The rest of the classes have reasonable accuracy.