Project 4: Classification and Segmentation

Sean Farhat

cs194-26-afb

Part 1: Image Classification

In this part, we had to set up a Convolutional Neural Network to model a classifier on the Fashion MNIST datset. We had 60,000 images to train on and 10,000 to test on. For hyperparameter tuning, I split up the training set into 48,000 (80%) for training and 12,000 (20%) for validation. The batch size was 50. It was a nice and gentle introduction to setting up and tuning light neural networks. Here are some examples of an image for each class:

Category	Image
T-shirt
Trouser
Pullover
Dress
Coat
Sandal
Shirt
Sneaker
Bag
Boot

1.1 Model Architecture and Results

Since this was just a simple classification task on low resolution images, we didn't need a very deep or fancy network. The network architecture was:

1. Convolution (32 channels, 3x3 kernel, padding=1) → ReLU → MaxPool (2x2 kernel, stride=2)
2. Convolution (32 channels, 3x3 kernel, padding=1) → ReLU → MaxPool (2x2 kernel, stride=2)
3. Fully Connected Layer (120 nodes) → ReLU 4. Fully Connected Layer (10 nodes)

For loss, I used CrossEntropy, which is the standard loss for multi-class classification, and for the optimizer, I used ADAM with a learning rate of 0.001 and no weight decay. This gave us a good accuracy after training for many epochs, as seen below:

1.2 Qualitative Results

To get a better idea of how good our model is, let's look at the per-class classification accuracy:

Class	Validation Accuracy	Test Accuracy
T-shirt/top	0.881	0.891
Trouser	0.986	0.985
Pullover	0.870	0.881
Dress	0.902	0.890
Coat	0.853	0.854
Sandal	0.976	0.969
Shirt	0.757	0.742
Sneaker	0.972	0.982
Bag	0.984	0.981
Ankle boot	0.950	0.948

Overall, the model classified them pretty well though it's worth noting that it had trouble with the "Shirt" class. I would think this is due to the fact that it is quite a general category and therefore is susceptible to false positives (such as t-shirts or pullovers being classified as shirts). Below you can see examples of incorrectly and correctly classified images for each class.

Correct Classification	Incorrect Classification	Incorrect Classification
T-shirt/top	Bag	Shirt
Trouser	Dress	Pullover
Pullover	Coat	Dress
Dress	Bag	Shirt
Coat	Dress	Shirt
Sandal	Ankle boot	Sneaker
Shirt	Pullover	T-shirt/top
Sneaker	Ankle boot	Sandal
Bag	Dress	T-shirt/top
Ankle boot	Sneaker	Sneaker

1.3 Visualizing the Learned Filters

Sometimes, it is informative to look at the learned filters. At lower levels of a convolutional network, the filters usually correspond to high level features, whereas later levels filter out low-level details. Since our first layer has 32 channels, we have 32 filters that we can look at.

Part 2: Semanatic Segmentation

Semantic segmentation is a much harder task than classification: we now need to label different parts of an image into different classes. This is difficult because not only does there need to be a binary classification of "There's an object of interest in this part of the image", but also classify that part into a class. The combination of "what" and "where" requires a more robust network. While commonly seen in uses such as self-driving cars, we were interested in segmenting building facades from the Mini Facade dataset, looking to segment the building faces into

Balconies (red)
Windows (orange)
Pillars (green)
Facades (blue)
Other (black)

Here's an example of an image and what our model is expected to do:

Facade	Ground Truth

2.1 Model Architecture and Results

This time, we had much less training data, unfortunately. Out of 907 training examples, 86 (~9.5%) were used for validation, as I prioritized more training of the actual network over hyperparameter tuning. The batch size was 3. As the task here is more complex, we need a deeper network. My network architecture was:

1. Convolution (64 channels, 3x3 kernel, padding=1) → ReLU
2. Convolution (128 channels, 3x3 kernel, padding=1) → ReLU
3. Convolution (256 channels, 3x3 kernel, padding=1) → ReLU
4. Convolution (256 channels, 3x3 kernel, padding=1) → ReLU
5. Convolution (64 channels, 3x3 kernel, padding=1) → ReLU
6. Convolution (5 channels, 1x1 kernel, padding=1)

A couple notes on my design decisions: I couldn't make it too deep (more than the 6 suggested layers) since I don't think we had enough data to train on. This was evident by the fact that overfitting did not occur over the 50 epochs that I trained it over: validation accuracy kept decreasing. In addition, downsampling is usually done to get an encoding (as done in U-Net, for example), and then upsample to get a decoding. But, I don't think there's enough data to learn a complex model that will encode and decode. If I had more time/resources/training data, I would probably explore implementing U-Net, FCC-VGG16, or even some pretrained model and utilize transfer learning. However, I was interested to see if I could get a simple network, without any fancy layers, to get the job done.

I trained my model for 50 epochs, using Cross Entropy loss, and the ADAM optimizer with a learning rate of 0.001 and weight decay of 0.00001. Below are the results:

2.2 Average Precision

Similar to part 1, it is more informative to look at the "per-class accuracy" for each class. The analogue to this class is the Average Precision for each category. Below are the results:

AP = 0.6130107297103308 (Other)
AP = 0.7002639662925594 (Facade)
AP = 0.10904919137331827 (Pillar)
AP = 0.7604168908216299 (Window)
AP = 0.3590639398443542 (Balcony)

The Average Position over the entire test set was 0.5083609436084385. As we can see, pillars were the toughest to segment correctly, followed by balconies. This can be seen on our results on some of the test images:

Image	Ground Truth	Model Prediction

In addition, we can see how it works on some other images:

Facade	Ground Truth

The model does a great job classifying windows and balconies, as seen in the second image, but a bad job at pillars, as seen in the facade of Sproul Hall.

Conclusion

Overall it was a cool project, but nothing really too fascinating, as I've worked with CNNs before. I think I'd be more interested in trying to create a GAN instead, since those results are cooler, for lack of a better word. Or even doint something like style transfer. It definitely has inspired what I might attempt for my final project in the course!