Project 4 - Classification and Segmentation

Mark Presten - cs194-26-ada

Part 1 - Image Classification

For the first part of this assignment, we want to create a convolutional neural network (CNN) to identify 10 different clothing items. To do so, we train our network using the FasionMNIST data set, which has over 60,000 training images (including which picture belongs to which class). The classes include T-Shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, and Ankle Boot.

CNN Architecture

The architecture of my CNN has two convolutional layers as well as two fully connected layers. My convolutional layers both output 32 channels, are followed by a Relu, and subsequently a MaxPool. After these sequences of convolutional layers, Relu, and a MaxPool, I apply my fully connected layers. In between the two fully connected layers, I apply a Relu.

I used CrossEntropyLoss for my loss and Adam as my optimizer with a learning rate of .002 and decay of 0. I also had epochs of 5 and batch size of 100.

CNN Results by Number

For 10,000 images, my network was 89% accurate. Below is the accuracy per class. As you can see, the Shirt class is the hardest to identify, followed by the Coat class. The reasons for this could be that many other classes have similar characteristics as the Shirt and Coat classes. For example, because the Shirt class encompasses long sleeve shirts, an image of a shirt, a coat, a pullover, or even a t-shirt could all resemble the same shape that the class embodies. Thus, because of how similar the Coat and Shirt class are to other classes, they have the lowest accuracy rating.

Accuracy per Class

Snow

Loss Graphs

Below is a graph of loss per iterations and loss per Epoch. I am including both graphs to show that as we progress to the next epoch, our loss generally trends down. However, we can see from the graph on the left, loss per iterations, that we still occasionally have spikes in loss, which is normal when training a data set.

Loss per Iteration

Snow

Loss per Epoch

Snow

Example Output Images

Here you can see two examples from each class where the network correctly classified the image. This is shown on the left. On the right, we have two miss-classifications in which the network classified the image as said class, but the image was actually a member of a different class. By visualizing results like this, we can interpret why some classes have lower classifications and why some objects may have been misclassified, be it because of their shape, texture, or other features of the image.

Correctly Classified as T-SHIRT

Snow

Miss-Classified as T-SHIRT

Snow

Correctly Classified as Trouser

Snow

Miss-Classified as Trouser

Snow

Correctly Classified as PULLOVER

Snow

Miss-Classified as PULLOVER

Snow

Correctly Classified as DRESS

Snow

Miss-Classified as DRESS

Snow

Correctly Classified as COAT

Snow

Miss-Classified as COAT

Snow

Correctly Classified as SANDAL

Snow

Miss-Classified as SANDAL

Snow

Correctly Classified as SHIRT

Snow

Miss-Classified as SHIRT

Snow

Correctly Classified as SNEAKER

Snow

Miss-Classified as SNEAKER

Snow

Correctly Classified as BAG

Snow

Miss-Classified as BAG

Snow

Correctly Classified as ANKLE BOOT

Snow

Miss-Classified as ANKLE BOOT

Snow

Visualizing Learned Filters

As mentioned before, I used two convolutional layers for my CNN. Below are visualizations of the 3 by 3 kernels for both the first layer and the second layer. Because both layers are 32 channels each, we can see the 32 kernels for each layer.

Layer 1 Learned Filters

Snow

Layer 2 Learned Filters

Snow

Part 2 - Semantic Segmentation

In Part 2 of this project, we want to assign pixel values to an image that correspond to different features. For this assignment, we specifically want to identify features in a building. The features include the facade, pillar, window, balcony, and other. To train our neural network, we use the Mini Facade dataset.

CNN Architecture

For my CNN, I used 6 convolutional layers, with a kernel size of 3. My layers took in 3, 64, 128, 128, 64, and 64 channels respectively. All layers were subsequently followed by a Relu, except the last two layers were separated by a MaxPool instead. For training my network, I trained over the entire set of training data. I defined my loss as CrossEntropyLoss and used Adam as my optimizer with a learning rate of 1e-3 and a decay of 1e-5. I set the number of epochs to be 10.

Below is a graph of my loss per Epoch when training my network. As we would expect, and similar to my results from part 1, the loss decreases as we progress to the next epoch.

The loss for each of the 10 epochs was 1.097021191522775, 1.01061884371387, 0.9793256626487041, 0.9595928613186935, 0.9460316482956046, 0.9278132735564483, 0.9115987146400721, 0.9072839343718871, 0.9019022035940857, and 0.8901689441064599 respectively.

Layer 1 Learned Filters

Snow

CNN Results

My AP, or average precision, in total was 0.517. However, my network was much better at assigning pixel values for different features. It was best at assigning pixel values for the facade and windows, and the worst at assigning pixel values to pillars. This could be because pillars can very much resemble a facade of a building. See below for my average precision values for each class.

AP for Other: 0.665712

AP for Facade: 0.754356

AP for Pillar: 0.108690

AP for Window: 0.755854

AP for Balcony: 0.302059

CNN Example

Here is an example of the output of the CNN on an image of a building. As we can see, it does the worst on recognizing columns, but is pretty good with respect to the facade on windows! Because there is no balcony in the image, we should not see any red. However, because of some of the features of the windows, my network classified some window areas as balcony pixels. The areas of the image that are black correspond to other features, and I think the network did pretty well with regard to these.

Original Image

Snow

Segmented image

Snow