import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import skimage.io as io
import skimage as sk
from glob import glob
from IPython.core.display import HTML
HTML("""
<style>
div.cell { /* Tunes the space between cells */
margin-top:1em;
margin-bottom:1em;
}
div.text_cell_render h1 { /* Main titles bigger, centered */
font-size: 2.2em;
line-height:0.9em;
}
div.text_cell_render h2 { /* Parts names nearer from text */
margin-bottom: -0.4em;
}
div.text_cell_render { /* Customize text cells */
font-family: 'Georgia';
font-size:1.2em;
line-height:1.4em;
padding-left:3em;
padding-right:3em;
}
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.
""")
This part of the project was training a convolutional neural net to classify images of the FashionMNIST dataset. The dataset contains images of 10 different types of clothing: t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. I split the training dataset of 60,000 images with a 0.9/0.1 train-validation split. The test set contains 10,000 images.
Conv2d(in: 1 channel, out: 32 channels): filter size=3x3 --> ReLU --> MaxPool2d: filter size=2x2
Conv2d(in: 32 channels, out: 32 channels): filter size=3x3 --> ReLU --> MaxPool2d: filter size=2x2
Linear(800 channels, 120 channels)--> ReLU
Linear(120 channels, 84 channels)--> ReLU
Linear(84 channels, 10 channels)
batch_size: 50
optimizer: optim.Adam, learning rate = 0.002
loss function: nn.CrossEntropyLoss
training epochs: 5
im1=plt.imread("train_val_accuracy.jpg")
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(1,1,1)
ax1.imshow(im1)
plt.axis("off");
For both the validation and test set, "Trouser" had the highest accuracy and "Shirt" had the lowest. Shirt and T-shirt/top look very similar, so it was harder to classify correctly.
Class | Validation Accuracy | Test Accuracy |
---|---|---|
T-shirt/top | 90.9 % | 86.9% |
Trouser | 98.7% | 97.7% |
Pullover | 83.4% | 85.6% |
Dress | 93.8% | 90.9% |
Coat | 84.7% | 83.3% |
Sandal | 98.0% | 95.7% |
Shirt | 77.6% | 72.4% |
Sneaker | 97.4% | 96.9% |
Bag | 98.5% | 97.0% |
Ankle boot | 97.2% | 95.9% |
Classified Correctly:
imgs = np.array(glob("*.jpg"))
correct = imgs[["_correct_" in x for x in imgs]]
incorrect = imgs[["_incorrect_" in x for x in imgs]]
fig=plt.figure(figsize=(10, 10))
columns = 4
rows = 5
for i in range(0, columns*rows):
img = plt.imread(correct[i])
fig.add_subplot(rows, columns, i+1)
plt.title(correct[i][0:correct[i].find("_")])
plt.axis('off')
plt.imshow(img, cmap='gray')
plt.show()
Classified Incorrectly (Actual Class:Predicted Class)
fig=plt.figure(figsize=(10, 10))
columns = 4
rows = 5
for i in range(0, columns*rows):
img = plt.imread(incorrect[i])
fig.add_subplot(rows, columns, i+1)
temp = incorrect[i].split("_")
if temp[0]=="T-shirt":
plt.title(temp[0] + ":" + temp[2])
else:
plt.title(temp[0] + ":" + temp[1])
plt.axis('off')
plt.imshow(img, cmap='gray')
plt.show()
These are the learned 3x3 filters of the first convolutional layer (before nonlinearities were applied).
im1=plt.imread("learned_filters.jpg")
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(1,1,1)
ax1.imshow(im1)
plt.axis("off");
This part of the project was training a convolutional neural net to "semantically separate" an image, labeling each pixel of the image to its correct class. The dataset contains images of 5 labels with its corresponding color: others (black), facade (blue), pillar (green), window (orange), balcony (red). I split the training dataset of 906 images with a 0.9/0.1 train-validation split. The test set contains 114 images.
The architecture was based on this paper on deconvolution networks for semantic segmentation. The underly idea is you train a network as an "autoencoder", where you first compress the original image and then decompress it to relearn the labels/way the image is segmented.
Compression:
Conv2d(in: 3 channels, out: 32 channels): filter size=3x3, padding=1 --> MaxPool2d: filter size=2x2 --> BatchNorm2d --> ReLU
Conv2d(in: 32 channels, out: 64 channels): filter size=3x3, padding=1 --> MaxPool2d: filter size=2x2 --> BatchNorm2d --> ReLU
Conv2d(in: 64 channels, out: 128 channels): filter size=3x3, padding=1 --> MaxPool2d: filter size=2x2 --> BatchNorm2d --> ReLU
Conv2d(in: 128 channels, out: 128 channels): filter size=3x3, padding=1 --> MaxPool2d: filter size=2x2 --> BatchNorm2d --> ReLU
Decompression/Deconvolution:
ConvTranspose2d(in: 128 channels, out: 128 channels): filter size=3x3, stride=2, padding=1 --> BatchNorm2d --> ReLU
ConvTranspose2d(in: 128 channels, out: 64 channels): filter size=3x3, stride=2, padding=0 --> BatchNorm2d --> ReLU
ConvTranspose2d(in: 64 channels, out: 32 channels): filter size=3x3, stride=2, padding=0 --> BatchNorm2d --> ReLU
ConvTranspose2d(in: 32 channels, out: 16 channels): filter size=3x3, stride=2, padding=0, output_padding=1 --> BatchNorm2d --> ReLU
Conv2d(in: 16 channels, out: 5 channels): filter size=3x3, padding=1
I didn't change the default hyperparemeters because they worked well, except for batch_size for faster training.
batch_size: 10
optimizer: optim.Adam, learning rate = 0.001, weight decay = 0.00001
loss function: nn.CrossEntropyLoss
training epochs: 10
im1=plt.imread("part2/train_val_loss.jpg")
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(1,1,1)
ax1.imshow(im1)
plt.axis("off");
Class | AP Score |
---|---|
others | 0.7194 |
facade | 0.7930 |
pillar | 0.1864 |
window | 0.8590 |
balcony | 0.6684 |
Average | 0.6453 |
I picked a few images of buildings in Berkeley to run the model on. For Evans and Dwinelle, the model does a very good job at labeling the windows, facade and even gets the balcony on Dwinelle correctly. For these two buildings I'd say it did pretty well overall. For Doe, it doesn't do such a good job. The windows are labeled facade and most of the pillars are labeled window rather than pillar. The windows are very dark, and it's hard to see the individual panes, so the net could have had a harder time discerning the area. For the pillars, the shadows may have caused the model to misinterpret them.
fig=plt.figure(figsize=(10, 10))
columns = 2
rows = 3
images = ["doe.jpg","doe_label.png","evans.jpg","evans_label.png","dwinelle.jpg","dwinelle_label.png"]
for i in range(0, columns*rows):
img = plt.imread("part2/" + images[i])
fig.add_subplot(rows, columns, i+1)
temp = incorrect[i].split("_")
plt.title(images[i].split(".")[0])
plt.axis('off')
plt.imshow(img, cmap='gray')
plt.show()