CS194-26: Image Manipulation and Computational Photography

Spring 2020

Project 4: Classification and Segmentation

Kamyar Salahi

Overview

This project consists of leveraging the magic of Convolutional Neural Networks in order to correctly classify images as well

as segment them.


Image classification refers to the ability to determine what an image is. Humans are able to look at a photograph of a dog and

identify it as a dog. In our case, we will use a convolutional neural network to perform this task.


Image segmentation refers to the ability to figure out what pixels in an image correspond to which class. For example, the

pixels corresponding to a child can be marked as child by a segmentation network.


What is a CNN?

A CNN is a type of deep neural network that leverages the mathematical operation known as the convolution. CNNs are

generally used for image data. Due to their nature, they are often prone to overfitting. As to avoid this, a procedure known as

batch normalization is often employed in CNN. Batch normalization improves performance and training time by normalizing the

input layer. We will employ several different layers to optimize our network

Part 1: Image Classification

In this project, we will be classifying the images in the Fashion MNIST Dataset. This is a series of 60,000 ~30px images of

clothing and accessories.







From Left to Right:

Ankle Boot (9), Ankle Boot (9), Dress (3), Coat (4)



CNN Architecture:

1. Convolutional Layer to go from one grayscale channel to 32 channels with a kernel size of 5

2. ReLU

3. Max Pooling with a kernel size of 2 and stride of 2

4. Convolutional Layer to go from 32 channels to 32 channels with a kernel size of 5

5. ReLU

6. Max Pooling with a kernel size of 2 and stride of 2

7. Fully Connected Network Connecting 512 nodes to 120 nodes

8. ReLU

9. Fully Connected Network Connecting 120 nodes to 10 nodes


No Batch Normalization was used here since I was able to get decent results without it.

I used Adam as the optimizer with a learning rate of 0.001. I used CrossEntropyLoss to calculate loss. The network was trained for

60 iterations. A batch size of 100 was used for training data.

It appears that the shirt and T-Shirt were consistently the hardest to get. The coat and the pullover also gave the network some trouble.

This is most likely because these categories are all quite similar to one another. This is very clearly evident below in which the confused

pictures are ambiguous enough to confuse a human (for the most part).

Loss and Accuracy:

Learned Filters:

Layer 1:

Layer 2:

Part 2: Semantic Segmentation

Here we will be working creating a network to semantically segment an image. This means that we will give every pixel value a classification.


This implementation is quite similar to U-Net in that it downsamples then upsamples, but using a residual block rather than a standard

convolutional layers. This both improves performance and training time.

U-Net essentially first downsamples an image then subsequently upsamples it.

Architecture:

1. Convolutional Layer to go from one grayscale channel to 256 channels with a kernel size of 5 and padding of 2

2. Batch Normalization

3. ReLU

4. Max Pooling with a kernel size of 3, stride of 2, and padding of 1

5. Convolutional Layer to go from 256 channels to 128 with a kernel size of 1

6. Batch Normalization

7. ReLU

8. Convolutional Layer to go from 128 channels to 128 with a kernel size of 5, padding of 2, and dilation of 1

9. Batch Normalization

10. ReLU

11. Convolutional Layer to go from 128 channels to 256 with a kernel size of 1

12. Batch Normalization + Layer from 4

13. ReLU

14. Convolutional Layer to go from 256 channels to 512 with a kernel size of 5 and padding of 2

15. Batch Normalization

16. ReLU

17. Convolutional Transpose Layer to upsample from 512 channels to 256 with a kernel size of 3, stride of 2, output padding of 1, and padding of 1

18. Batch Normalization

19. ReLU

20. Convolutional Layer to go from 256 channels to 5 classification channels with a kernel size of 1

I had a batch size of 16, learning rate of 1e-4, and weight decay of 1e-4 using Adam as the optimizer and cross entropy loss to define the loss function. I trained for 233 epochs.

Loss:

Results:

AP Values

Others: 0.683

Facade: 0.776

Pillar: 0.149

Window: 0.845

Balcony: 0.576

Average: 0.600

Image from my collection

Segmentation

To compare my results with that of state of the art semantic segmentation, I also created and trained an implementation of DeepLab that utilizes ResNet-101 pretrained on ImageNet.

Original Image from Test Set

Original

ResNet-101

Ground Truth

Original Image from Test Set

My Design

Ground Truth

AP Values For ResNet-101 Implementation

Others: 0.783

Facade: 0.842

Pillar: 0.441

Window: 0.903

Balcony: 0.821

Average: 0.757

Evidently, windows and facades are the easiest to find, and pillars and balconies are the hardest. Looking at the test data, it becomes clear why this is the case, as some cases are again difficult for even humans to distinguish.

Conclusion

Although this project was a lot of work, I really enjoyed experimenting with different designs and reading various machine learning papers that discuss the various

approaches to this problem. Given more layers, and more compute, I’m certain that I would be able to make a neural network train to 0.80+ AP.