CS 194 Classification and Segmentation

Part 1: Image Classification

Dataset

I built a CNN to classify images of fashion items. I built the network using PyTorch and used Fashion MNIST as the dataset. Fashion MNIST has 10 classes and 60000 train + validation images and 10000 test images. Here are some images from the dataset:

Bag, Ankle boot, Pullover, Coat

Implementation Details

The network has two convolutional layers with 40 channels and a 3x3 kernel each. Each convolutional layer is followed by a ReLU followed by a max pool with a 2x2 kernel and stride of 2. There are then 2 fully connected layers with output size 120 and 10 respectively. The first fc layer is followed by a ReLU, but not the second. I used cross entropy loss as the prediction loss and Adam as the optimizer with a learning rate of 1e-3 and a weight decay of 1e-5.

Network: input -> conv -> relu -> pool -> conv -> relu -> pool -> fc1 -> relu -> fc2 -> outputs

Optimizer: Adam (lr=1e-3, weight_decay=1e-5)

I tried different activiation functions (sigmoid, elu), number of channels (16, 32, 64), and convolution kernel sizes (5x5), as well as some other learning rates (1e-2, 1e-4), optimizers (SGD), and weight decays (0, 1e-4, 1e-6), and got the best results with the structure detailed above.

Results

The network is trained on 80% of the entire training set, and the rest is used for validation. I calculated accuracies after 100, 200, 500, 1000, 2000, 5000, 12000, 24000, and 48000 training samples.

Training and Validation Accuracy during training

Training Samples	Training Accuracy	Validation Accuracy
100	70%	58.1%
200	74.5%	71.1%
500	81.8%	75.1%
1000	85.5%	78.8%
2000	86.7%	82.0%
5000	88.5%	84.9%
12000	91.2%	87.8%
24000	92.2%	89.2%
48000	92.5%	90.2%

I trained the network for 30 epochs. After about 5 epochs, the validation accuracy starts to decline as the model is overfitting the training data.

Training and Validation Accuracy Across Epochs

Here are the per class accuracies of the network for both validation and test sets. The hardest classes to classify are shirt and coat.

Class	Class Name	Validation Accuracy	Test Accuracy
0	T-shirt/top	88.3%	86.8%
1	Trouser	98.6%	98.0%
2	Pullover	85.2%	84.4%
3	Dress	93.3%	93.5%
4	Coat	81.6%	79.1%
5	Sandal	97.3%	97.8%
6	Shirt	68.2%	65.9%
7	Sneaker	95.9%	95.6%
8	Bag	97.5%	98.0%
9	Ankle boot	95.8%	95.7%

Validation Set Per Class Accuracy

Test Set Per Class Accuracy

Here is a table showing some images that were classified correctly and some that weren't. The first column is the actual class name of the images in the middle, and the last column are the classes predicted by the model of the two incorrectly classified images.

Class Name	Classified Correctly		Classified Incorrectly		Predicted Class Name
T-shirt/top					Dress, Shirt
Trouser					Dress, Dress
Pullover					Shirt, Shirt
Dress					Shirt, Shirt
Coat					Shirt, Pullover
Sandal					Ankle boot, Sneaker
Shirt					T-shirt/top, Coat
Sneaker					Ankle boot, Ankle boot
Bag					Sandal, Sandal
Ankle boot					Sandal, Sneaker

The first convolution layer has forty 3x3 filters. The learned filters are displayed below.

Visualizing Learned Filters

Part 2: Semantic Segmentation

Dataset

In this part we use a CNN to do image segmentation. We use the Mini Facade dataset to label structural elements in building facades. The Mini Facade dataset consists of images of different cities around the world and diverse architectural styles (in .jpg format). It also contains semantic segmentation labels (in .png format) in 5 different classes: balcony, window, pillar, facade and others.

Example of image data and corresponding segmentation labels

Implementation Details

The network has four convolutional layers each with 32 channels, a 3x3 kernel, and padding=1 on all sides of the image. Each convolutional layer is followed by a ReLU and then a max pool with a 2x2 kernel and a stride of 2. Each max pool halves the resolution, so we upsample with four transposed convolutions each with 32 channels, a 3x3 kernel, padding=1, and a stride of 2 to double the size of the input. The network has a final convolutional layer with 5 channels, a 3x3 kernel, and padding=1. This layer is not followed by ReLU or max pool.

I used cross entropy loss as the prediction loss and Adam as the optimizer with a learning rate of 1e-3 and a weight decay of 1e-5.

Network: input -> [conv -> relu -> pool] (4 times) -> upsample (4 times) -> conv -> outputs

Optimizer: Adam (lr=1e-3, weight_decay=1e-5)

I tried different hyperparameters (eg. learning rate, weight decay), optimizers, and network architectures, and got the best efficiency and results with these settings.

Results

The network was trained on 80% of the training data and the rest was used for validation. It trained for 50 epochs, and the loss per epoch is displayed below.

Training and Validation Loss Across Epochs

We also use Average Precision (AP) on the test set to evaluate the learned model. The per class AP and overall AP are shown below.

Class	Class Name	Average Precision
0	others	0.640
1	facade	0.752
2	pillar	0.114
3	window	0.808
4	balcony	0.526
Overall Average Precision		0.568

Here is an image I took of Radcliffe Camera in Oxford, padded with black to make it 256x256. On the right is the output of Camera segemented by the model. Notice that some classes like window are segemented well, while others like balcony and pillars aren't. I think this may be because the image of the facade of the building is relatively small compared to the ones in the training set, so it's harder for the model to distinctly recognize the more detailed building features.

Camera (original)

Camera (segmentation)

Project 4: Classification and Segmentation

Alex Yang

Part 1: Image Classification

Dataset

Implementation Details

Results

Part 2: Semantic Segmentation

Dataset

Implementation Details

Results