Project 4: Classification and Segmentation

Shide Li

Part 1: Image Classification

In this part, we train a model to classify images in the FashionMNIST dataset into 10 classes. Here are some sample images and their respective classes:

Sample images

CNN Structure

I trained a convolutional neural network consisting 2 conv layers, each followed by a maxpool (2 x 2) and a ReLU, and two fully connected linear layers. Printing out the net object gives the following structure:

Network Structure for FashionMNIST Classification

As shown above, the first conv layer takes in 1 channel and outputs 32 channels, and the second conv layer takes in those 32 channels and spits out 64 channels. Both layers use a filter size of 5 x 5. For the fully connected layers, the first one takes in 64 x 4 x 4 input channels and outputs 1024 channels and the last layer outputs the final results from 10 classes.

I used cross entropy loss as the prediction loss and an Adam optimizer with a learning rate of 0.001.

Accuracy

For this problem, I used a training size of 50000 and a validation set of size 10000. The validation set was used for cross validation to tune hyperparameters such as step size. Here the respective accuracies are plotted against the number of epoches:

Step size 0.01
Step size 0.001 (Chosen model)

After training the CNN, I tested the network against the test set and got the following results:

Test accuracy of trained model: 90.95%

In addition, here are the per class accuracies of the validation and test sets. We can see that the shirt class is the hardest to get, and the coat class is the second hardest:

Examples

For each class, two images classified correctly and two misclassified are shown below:

Visualizing filters for the first conv layer

32 5x5 filters for first layer

Part 2: Image Segmentation

Model Architecture

Before training the model, I split the trainset into training and validation sets, using the first 700 samples for the training set and the remaining 206 samples for validation.

Below is the structure of the neural net:

As shown above, there are 6 convolution layers, each with filters of varying size (from 3x3 to 7x7) and followed by a ReLU (except the final layer). The number of channels are 3->32->64->128->64->32->5. All paddings and strides of the conv layers are padding = 2 and stride = 1.

In addtion, I tuned the following hyperparameters using the validation set: number of epoches = 20, batch size = 4. For the Adam optimizer, I ended up using the suggested parameters (a learning rate of 1e-3 and weight decay 1e-5).

The loss across iterations is plotted below:

AP on Test Set

Here is the test result. The average precision I got was 49 %.

Examples

Here are the results of applying the model on a few test samples:
Sample image
Result
Ground Truth
Sample image
Result
Ground Truth
Finally, here is an example of running the trained model on an image of a building outside of the testset:
Image
Result

In this picture, we can see that the model gets the windows and facades right the most and fails on balconys. In other test sample, it also fails on pillars. This is consistent with the difference in AP for each class.