import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import skimage.io as io
import skimage as sk
from glob import glob

from IPython.core.display import HTML
HTML("""
<style>

div.cell { /* Tunes the space between cells */
margin-top:1em;
margin-bottom:1em;
}

div.text_cell_render h1 { /* Main titles bigger, centered */
font-size: 2.2em;
line-height:0.9em;
}

div.text_cell_render h2 { /*  Parts names nearer from text */
margin-bottom: -0.4em;
}


div.text_cell_render { /* Customize text cells */
font-family: 'Georgia';
font-size:1.2em;
line-height:1.4em;
padding-left:3em;
padding-right:3em;
}

.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}

</style>

<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.

""")

Project 4: Classification and Segmentation¶

Daniel Zhu, CS194-26-abh¶

Part 1: Image Classification¶

This part of the project was training a convolutional neural net to classify images of the FashionMNIST dataset. The dataset contains images of 10 different types of clothing: t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. I split the training dataset of 60,000 images with a 0.9/0.1 train-validation split. The test set contains 10,000 images.

Architecture¶

Conv2d(in: 1 channel, out: 32 channels): filter size=3x3 --> ReLU --> MaxPool2d: filter size=2x2
Conv2d(in: 32 channels, out: 32 channels): filter size=3x3 --> ReLU --> MaxPool2d: filter size=2x2
Linear(800 channels, 120 channels)--> ReLU
Linear(120 channels, 84 channels)--> ReLU
Linear(84 channels, 10 channels)

Hyperparameters¶

batch_size: 50
optimizer: optim.Adam, learning rate = 0.002
loss function: nn.CrossEntropyLoss
training epochs: 5

Results¶

Accuracy¶

The train and validation accuracies during the training process for 5 epochs (there are 10 batches per epoch).

im1=plt.imread("train_val_accuracy.jpg")
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(1,1,1)
ax1.imshow(im1)
plt.axis("off");

Accuracy of the network on the 10000 test images: 90.23%¶

Class Accuracies¶

For both the validation and test set, "Trouser" had the highest accuracy and "Shirt" had the lowest. Shirt and T-shirt/top look very similar, so it was harder to classify correctly.

Class	Validation Accuracy	Test Accuracy
T-shirt/top	90.9 %	86.9%
Trouser	98.7%	97.7%
Pullover	83.4%	85.6%
Dress	93.8%	90.9%
Coat	84.7%	83.3%
Sandal	98.0%	95.7%
Shirt	77.6%	72.4%
Sneaker	97.4%	96.9%
Bag	98.5%	97.0%
Ankle boot	97.2%	95.9%

Classified Correctly:

imgs = np.array(glob("*.jpg"))
correct = imgs[["_correct_" in x for x in imgs]]
incorrect = imgs[["_incorrect_" in x for x in imgs]]

fig=plt.figure(figsize=(10, 10))
columns = 4
rows = 5
for i in range(0, columns*rows):
    img = plt.imread(correct[i])
    fig.add_subplot(rows, columns, i+1)
    plt.title(correct[i][0:correct[i].find("_")])
    plt.axis('off')
    plt.imshow(img, cmap='gray')
plt.show()

Classified Incorrectly (Actual Class:Predicted Class)

fig=plt.figure(figsize=(10, 10))
columns = 4
rows = 5
for i in range(0, columns*rows):
    img = plt.imread(incorrect[i])
    fig.add_subplot(rows, columns, i+1)
    temp = incorrect[i].split("_")
    if temp[0]=="T-shirt":
        plt.title(temp[0] + ":" + temp[2])
    else:
        plt.title(temp[0] + ":" + temp[1])
    plt.axis('off')
    plt.imshow(img, cmap='gray')
plt.show()

Visualized filters¶

These are the learned 3x3 filters of the first convolutional layer (before nonlinearities were applied).

im1=plt.imread("learned_filters.jpg")
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(1,1,1)
ax1.imshow(im1)
plt.axis("off");

Part 2: Semantic Segmentation¶

This part of the project was training a convolutional neural net to "semantically separate" an image, labeling each pixel of the image to its correct class. The dataset contains images of 5 labels with its corresponding color: others (black), facade (blue), pillar (green), window (orange), balcony (red). I split the training dataset of 906 images with a 0.9/0.1 train-validation split. The test set contains 114 images.

Architecture¶

The architecture was based on this paper on deconvolution networks for semantic segmentation. The underly idea is you train a network as an "autoencoder", where you first compress the original image and then decompress it to relearn the labels/way the image is segmented.

Compression:
Conv2d(in: 3 channels, out: 32 channels): filter size=3x3, padding=1 --> MaxPool2d: filter size=2x2 --> BatchNorm2d --> ReLU
Conv2d(in: 32 channels, out: 64 channels): filter size=3x3, padding=1 --> MaxPool2d: filter size=2x2 --> BatchNorm2d --> ReLU
Conv2d(in: 64 channels, out: 128 channels): filter size=3x3, padding=1 --> MaxPool2d: filter size=2x2 --> BatchNorm2d --> ReLU
Conv2d(in: 128 channels, out: 128 channels): filter size=3x3, padding=1 --> MaxPool2d: filter size=2x2 --> BatchNorm2d --> ReLU

Decompression/Deconvolution:
ConvTranspose2d(in: 128 channels, out: 128 channels): filter size=3x3, stride=2, padding=1 --> BatchNorm2d --> ReLU
ConvTranspose2d(in: 128 channels, out: 64 channels): filter size=3x3, stride=2, padding=0 --> BatchNorm2d --> ReLU
ConvTranspose2d(in: 64 channels, out: 32 channels): filter size=3x3, stride=2, padding=0 --> BatchNorm2d --> ReLU
ConvTranspose2d(in: 32 channels, out: 16 channels): filter size=3x3, stride=2, padding=0, output_padding=1 --> BatchNorm2d --> ReLU
Conv2d(in: 16 channels, out: 5 channels): filter size=3x3, padding=1

Hyperparameters¶

I didn't change the default hyperparemeters because they worked well, except for batch_size for faster training.

batch_size: 10
optimizer: optim.Adam, learning rate = 0.001, weight decay = 0.00001
loss function: nn.CrossEntropyLoss
training epochs: 10

Results¶

Loss¶

The train and validation accuracies during the training process for 20 epochs. We can see overfitting begin around 10 epochs, so for the final model I only trained it for 10 epochs.

im1=plt.imread("part2/train_val_loss.jpg")
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(1,1,1)
ax1.imshow(im1)
plt.axis("off");

Average Precision¶

Class	AP Score
others	0.7194
facade	0.7930
pillar	0.1864
window	0.8590
balcony	0.6684

Average	0.6453

Results on own images¶

I picked a few images of buildings in Berkeley to run the model on. For Evans and Dwinelle, the model does a very good job at labeling the windows, facade and even gets the balcony on Dwinelle correctly. For these two buildings I'd say it did pretty well overall. For Doe, it doesn't do such a good job. The windows are labeled facade and most of the pillars are labeled window rather than pillar. The windows are very dark, and it's hard to see the individual panes, so the net could have had a harder time discerning the area. For the pillars, the shadows may have caused the model to misinterpret them.

fig=plt.figure(figsize=(10, 10))
columns = 2
rows = 3
images = ["doe.jpg","doe_label.png","evans.jpg","evans_label.png","dwinelle.jpg","dwinelle_label.png"]
for i in range(0, columns*rows):
    img = plt.imread("part2/" + images[i])
    fig.add_subplot(rows, columns, i+1)
    temp = incorrect[i].split("_")
    plt.title(images[i].split(".")[0])
    plt.axis('off')
    plt.imshow(img, cmap='gray')
plt.show()