CS194-26: Image Manipulation and Computational Photography
Spring 2020
Project 4: Classification and Segmentation
Kamyar Salahi
Overview
This project consists of leveraging the magic of Convolutional Neural Networks in order to correctly classify images as well
as segment them.
Image classification refers to the ability to determine what an image is. Humans are able to look at a photograph of a dog and
identify it as a dog. In our case, we will use a convolutional neural network to perform this task.
Image segmentation refers to the ability to figure out what pixels in an image correspond to which class. For example, the
pixels corresponding to a child can be marked as child by a segmentation network.
What is a CNN?
A CNN is a type of deep neural network that leverages the mathematical operation known as the convolution. CNNs are
generally used for image data. Due to their nature, they are often prone to overfitting. As to avoid this, a procedure known as
batch normalization is often employed in CNN. Batch normalization improves performance and training time by normalizing the
input layer. We will employ several different layers to optimize our network
Part 1: Image Classification
In this project, we will be classifying the images in the Fashion MNIST Dataset. This is a series of 60,000 ~30px images of
clothing and accessories.
From Left to Right:
Ankle Boot (9), Ankle Boot (9), Dress (3), Coat (4)
CNN Architecture:
1. Convolutional Layer to go from one grayscale channel to 32 channels with a kernel size of 5
2. ReLU
3. Max Pooling with a kernel size of 2 and stride of 2
4. Convolutional Layer to go from 32 channels to 32 channels with a kernel size of 5
5. ReLU
6. Max Pooling with a kernel size of 2 and stride of 2
7. Fully Connected Network Connecting 512 nodes to 120 nodes
8. ReLU
9. Fully Connected Network Connecting 120 nodes to 10 nodes
No Batch Normalization was used here since I was able to get decent results without it.
I used Adam as the optimizer with a learning rate of 0.001. I used CrossEntropyLoss to calculate loss. The network was trained for
60 iterations. A batch size of 100 was used for training data.
It appears that the shirt and T-Shirt were consistently the hardest to get. The coat and the pullover also gave the network some trouble.
This is most likely because these categories are all quite similar to one another. This is very clearly evident below in which the confused
pictures are ambiguous enough to confuse a human (for the most part).
Loss and Accuracy:
Learned Filters:
Layer 1:
Layer 2:
Part 2: Semantic Segmentation
Here we will be working creating a network to semantically segment an image. This means that we will give every pixel value a classification.
This implementation is quite similar to U-Net in that it downsamples then upsamples, but using a residual block rather than a standard
convolutional layers. This both improves performance and training time.
U-Net essentially first downsamples an image then subsequently upsamples it.
Architecture:
1. Convolutional Layer to go from one grayscale channel to 256 channels with a kernel size of 5 and padding of 2
2. Batch Normalization
3. ReLU
4. Max Pooling with a kernel size of 3, stride of 2, and padding of 1
5. Convolutional Layer to go from 256 channels to 128 with a kernel size of 1
6. Batch Normalization
7. ReLU
8. Convolutional Layer to go from 128 channels to 128 with a kernel size of 5, padding of 2, and dilation of 1
9. Batch Normalization
10. ReLU
11. Convolutional Layer to go from 128 channels to 256 with a kernel size of 1
12. Batch Normalization + Layer from 4
13. ReLU
14. Convolutional Layer to go from 256 channels to 512 with a kernel size of 5 and padding of 2
15. Batch Normalization
16. ReLU
17. Convolutional Transpose Layer to upsample from 512 channels to 256 with a kernel size of 3, stride of 2, output padding of 1, and padding of 1
18. Batch Normalization
19. ReLU
20. Convolutional Layer to go from 256 channels to 5 classification channels with a kernel size of 1
I had a batch size of 16, learning rate of 1e-4, and weight decay of 1e-4 using Adam as the optimizer and cross entropy loss to define the loss function. I trained for 233 epochs.
Loss:
Results:
AP Values
Others: 0.683
Facade: 0.776
Pillar: 0.149
Window: 0.845
Balcony: 0.576
Average: 0.600
Image from my collection
Segmentation
To compare my results with that of state of the art semantic segmentation, I also created and trained an implementation of DeepLab that utilizes ResNet-101 pretrained on ImageNet.
Original Image from Test Set
Original
ResNet-101
Ground Truth
Original Image from Test Set
My Design
Ground Truth
AP Values For ResNet-101 Implementation
Others: 0.783
Facade: 0.842
Pillar: 0.441
Window: 0.903
Balcony: 0.821
Average: 0.757
Evidently, windows and facades are the easiest to find, and pillars and balconies are the hardest. Looking at the test data, it becomes clear why this is the case, as some cases are again difficult for even humans to distinguish.
Conclusion
Although this project was a lot of work, I really enjoyed experimenting with different designs and reading various machine learning papers that discuss the various
approaches to this problem. Given more layers, and more compute, I’m certain that I would be able to make a neural network train to 0.80+ AP.