CS194 Project #4: Classification and Segmentation

By: Scott Shao

Part I: Image Classification


Fashion MNIST dataset

For this problem, I used keras with tensorflow as my backend. I defined my model as the following:

Model summary
Model visualization
Model classification correspondence

The loss was categorical crossentropy and the optimizer was Adam with learning rate of 0.001.

First I evaluated the model using 5-fold cross validation on training data with 10 epochs to see how well the model will perform before actually fit the model with more epochs. The results of the 5-fold cross validation are 90.467, 92.575, 96.058, 98.342, and 99.025. Overall the model seems pretty good. I have also plot the 5-fold cross validation cross entropy loss and Classification accuracy, as well as the model confidence interval as shown below.

5-Fold cross validation cross entropy loss
and Classification accuracy
5-Fold cross validation confidence interval

For the actual training I fit the model with 300 epochs with checkpoints to save the model with the best validation loss and the best validation accuracy. The training loss and the training accuracy are shown below. The loss keeps increases after about 20 epochs while the accuracy plateaued at around 92%. This suggested that the model might be overfitting to the training data. For the next section, I will present the results of the best accuracy model and the best loss model side by side. The best accuracy model predicted 92.210% of the test data correctly and 100% of the training data correctly. The best loss model predicted 90.960% of the test data correctly and 94.255% of the training data correctly.

Model training loss after 300 epochs
Model training accuracy after 300 epochs
The best accuracy model test data confusion matrix
The best loss model test data confusion matrix
The best accuracy model test data classification report
The best loss model test data classification report
The best accuracy model train data confusion matrix
The best loss model train data confusion matrix
The best accuracy model train data classification report
The best loss model train data classification report

The hardest classes to get are class 0 T-shirt/top and class 6 Shirt. It is especially difficult to distinguish between T-shirt/top and Shirt.

For the following ten sets of images. Each set represent a class. The left two images are the correct classification and the two on the right are the wrong classification where the images were classified as some other classes.

Tshirt/top correct classification Tshirt/top wrong classification
Trouser correct classification Trouser wrong classification
Pullover correct classification Pullover wrong classification
Dress correct classification Dress wrong classification
Coat correct classification Coat wrong classification
Sandal correct classification Sandal wrong classification
Shirt correct classification Shirt wrong classification
Sneaker correct classification Sneaker wrong classification
Bag correct classification Bag wrong classification
Ankle boot correct classification Ankle boot wrong classification

The next two images showed the learned filter for the first 2D convolutional layer of the best accuracy model and the best loss model. Since the accuracy did not improve much between the two models, the learned filters should be very similar. However, the best loss model has a loss that was much smaller than the best accuracy model. This shows that loss is a better metrics compared to the accuracy in training the model.

The best accuracy model first convolutional layer learned filter
The best loss model first convolutional layer learned filter

As expected, they are generally the same.



Part II: Semantic Segmentation

Sample input image Sample ground truth
Classification correspondence

The dataset used is the Mini Facade dataset. PyTorch was used as the backend. The model is defined as the follows with crossentropy loss and Adam optimizer with 0.001 learning rate and 1e-5 weight decay:

Model summary

The average precision on the test set output from the training. The average AP is 0.518073757:

Average precision (AP) on the test set after training
Class Average Precision
Others 0.6330090515953898
Facade 0.7532332824894723
Pillar 0.0892949022197207
Window 0.7755921759864337
Balcony 0.33923937436496526
Average 0.518073757

The model was trained on the training data with 115 images reserved for validation. The model was trained for 100 epochs.

Training and validation loss

The following are some examples from the test output. The difference between the ground truth and the predicted was calculated using SSD to get the best and the worst predictions and some in betweens.

Best prediction
Worst prediction
Average prediction
Average prediction
Average prediction
25
27
6
8

In terms of the model's ability to predict classes, window > facade > otehrs > balcony > pillar. Windows and the facade are the easier ones to get right and the pillar are the most difficult. Upon examine the results, others is also very hard to predict as it always gets predicted as the facade.