CS194-26 Project 4 (something about nets)

Hello, welcome to my webpage :).

Part 1

For the first part of the project, I began by implementing what the spec advised:

conv (32) -> relu -> maxpool -> conv (32) -> relu -> maxpool -> fc -> relu -> fc

with cross entropy loss, Adam with learning rate of 1e-2

Frankly, this worked out pretty well, so I wasn't really motivated to play around that much. I ended up testing playing around with the channel # (i.e. going 3 -> 16 -> 32 instead of 3 -> 32 -> 32, as well as going up to 64 & higher) and adding some new layers (BatchNorm2d) and removing some of the existing ones, but these didn't yield notably better results, or in some cases, got worse. A same idea applied when tuning some of the other hyperparameters (optimizer/learning rate/weight decay). The main change that I ended up doing was dropping the learning rate to 1e-3 and increasing the epochs a bit, which ended up yielding a bit better accuracy while not taking too long to converge (since practically I didn't want the thing needing to run for a significant amount of time). At some point in time, I was also reaching the stated accuracy goals (>90%, and possibly >92%), so I didn't experiment much further.

Below are the training/validation losses and accuracies plotted against number of epochs. We can also notice that at a certain point, the validation results didn't really improve while the training did, which suggested overfitting to the training set. Anyway, that's usually how I decided the number of epochs, since past a certain point, the validation results would stabilize, and then become even worse.

Here's the accuracies on the test set:

Type of Object Validation Accuracy (%) Test Accuracy (%)
t-shirt/top 86.95 86.2
trouser 98.69 98.7
pullover 88.02 87.9
dress 93.69 93.5
coat 88.50 88.3
sandal 97.92 97.5
shirt 80.30 76.1
sneaker 99.07 97.9
bag 98.19 98.4
ankle boot 96.80 96.3
overall 92.86 92.08

We can see that by far the worst classified was shirts - I suspect that this is relatively natural, given the high variation, and especially how similar they are to some of the other classes, specifically, t-shirt/top and pullover (and even coat, probably) given that they all have generally similar shapes, and we can notice that hte accuracy for all of these was slightly lower than the rest. On the other hand, classes that were noticeably different than all others (sneaker, bag, trousers etc.) had incredibly high accuracy since they didn't have to deal with that problem.

Here are some examples of images that the classifier got right/wrong.

They go in order of (correct) (correct) (incorrect) (incorrect), where each row corresponds to another class (in the order of above).

Here's the filters of the first layer of the CNN, as requested by the spec/clarified on piazza. I ended up using a 5x5 kernel with 32 output channels, which is why we have 32 5x5 learned filters here.

Or if you prefer it in black and white instead of plt's default color scheme... (update: realized this is what the piazza says):

Part 2

This is the architecture I ended up using for this part of the project:

conv (32) -> relu -> conv (128) -> relu -> conv (256) -> relu -> conv (128) -> relu -> conv (32) -> relu -> conv (N_CLASS)

where conv (x) represents a convolutional layer with x output channels (same notation as in the previous part). Again, my main priority with the architecture and hyperparameters was having a reasonable runtime (i.e. at worst, slightly less than an hour), without trading off accuracy that much, since colabs lag my computer and I refuse to do that much waiting time, nor do I particularly enjoy randomly guessing numbers in hopes of improvement (aka I found the general process annoying and tedious - maybe I should've built some intuition on how to improve things, but just based off this assignment and the neural net one for 189, I have failed to build the intuition for layer #'s, and have just resorted to guessing, since I'm fairly certain I'm not supposed to just rip some architecture off known models, since that is, in my opinion, even less worthwhile than arbitrary guessing). The process for getting this architecture was basically randomly guessing based off my prior experience with cnns back in 189.

(Sorry about having to read my complaints, I'm just, as you can tell, not very pleased to be working with these.)

Regarding hyperparameters I actually ended up just using the provided optimizer/learning rate/weight decay (Adam/1e-3/1e-5) - I'd actually spent a significant amount of time playing with these (fortunately I had a bit more intuition about working with LR/decay, relative to playing around with the number of layers), mostly consisting of me lowering the learning rate and the weight decay and training for longer in attempts to improve accuracy, but then it turned out that toying around with the layer architecture was way more useful in this case. It ran in a reasonable amount of time with reasonable accuracy results (comfortably above the specified 45% AP), so I was pretty satisfied with that and didn't linger much longer on it. It is currently running for 25 epochs, since it times out on colab if I run it for much longer, and I don't want to pay for pro, and it's already passing the threshold.

As with before, below are the requested training/validation losses plotted against number of epochs. Unlike the previous graphs, it doesn't quite converge, but that's since I'd time out if I ran it for any longer on colab (I've tried 30 and it still does so).

Here are my reported AP values on the test set as requested:

Label AP
0 (Others) 0.658
1 (Facade) 0.776
2 (Pillar) 0.122
3 (Window) 0.780
4 (Balcony) 0.403
AVERAGE 0.548 (> 0.45 !!)

Here's an an individual image, its manual labeling, and then my model's results on it.

Order:

original -> manual label -> model's results

It actually looks like it does a decent job with most of it - it just seems to think that some of the "other" are balconies, which, seems somewhat reasonable as they aren't that well defined beyond usually being around windows. The main failure is that it tries to "see" "other" as actually "balcony", as well as recognizing some "other" as "facade".