Low (5e-5) learning rate run
COMPSCI 194-26: Computational Photography & Computer Vision
Professors Alexei Efros & Angjoo Kanazawa
October 31st, 2021
Nick Kisel
I trained a convolutional neural network NoseNet to recognize noses by processing face images and keypoints using the following parameters:
NoseNet(
(conv1): Conv2d(1, 12 (16 in hyperparameter tuned model), kernel_size=(7, 7), stride=(1, 1))
(conv2): Conv2d(12 (16 in hyperparameter tuned model), 24, kernel_size=(5, 5), stride=(1, 1))
(conv3): Conv2d(24, 32, kernel_size=(3, 3), stride=(1, 1))
(fc1): Linear(in_features=896, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=2, bias=True)
)
Some hits:
Before I explain the misses, let's be clear that we're detecting the base
of the nose, rather than the tip.
I think these are primarily misses because of two reasons: first, the model tends
to think that a nose should be centered in any photo, since all of the photos it's
learned from have the face centered, and therefore the nose centered. This
particularly explains why the left photo is off; the nose is expected in a central
location, and the top of the cheek emulates the shading of the nose closest for
that region of the photo.
Second, the lighting of the faces is relatively consistent between all photos.
That means the lighting surrounding each nose is roughly the same - namely in that
noses cast shadows in this dataset. Thus, the network learns to look for dark areas,
or a transition from light to darkness,
but sometimes doesn't have the context necessary to discern the shadow of a nose
the shadow of say, the mouth, or an eyesocket, or a chin. Thus, we end up with
predictions like the middle photo.
I might call the right photo a combination of the two errors: it tries to bias towards
the center and gets caught on the side of the nose, rather than the tip.
Some misses:
Here, the X marks my prediction, and the green O marks the actual point!
After doing a few trial runs where I modified the learning rate, I settled on 5e-5.
I had initially found that the model was converging on a "solution" quickly, meaning it would
come close to a minimum loss point, but then consistently overshoot when it tried to re-correct
and hit the minimum (see the graph at right, "Default learning rate run"). Thus, I decided to slow the learning rate down so that it would more
gracefully and closely hit that minimum as I kept training.
Notice that the default 1e-3 learning rate begins very favorably,
and its first loss is low enough to bring the scale down to 0.002 per
vertical axis bar; however, this is misleading; since the learning rate
is so high compared to the other two models, the losses down the line
continue fluctuating (stop decreasing) forever. While this model got to
a similar prediction as the low LR model in a shorter amount of time,
it ultimately was unable to optimize and get more accurate after ten
or so epochs. I observed that the validation loss for both 5e-5 models ran
below 0.002 as the epochs went on, while the default model converged
at 0.002 and didn't sink any lower, regardless of how long I trained.
Low (5e-5) learning rate run
After increasing the number of channels on conv1 to 16 from 12
5e-5 learning rate
Default learning rate run
I first ran conv1 with 12 channels; increasing that number to 16 appears
to improve the "bumpiness" and "variability" of the validation
loss, according to the graph; in other words, the network is overall
more consistent, but the final results after many epochs aren't actually
much different in the two 5e-5 models.
I used a batch size of 1 for all runs.
Next, we included all of the face points. A few runs exposed that the learning rate should be lowered again, so I dropped it to 0.00002 (2e-5). I used a batch size of 1 for all runs. In all, I trained for 140 epochs until my validation loss stopped decreasing.
FaceNet(
(conv1): Conv2d(1, 12, kernel_size=(7, 7), stride=(1, 1))
(conv2): Conv2d(12, 20, kernel_size=(5, 5), stride=(1, 1))
(conv3): Conv2d(20, 25, kernel_size=(5, 5), stride=(1, 1))
(conv4): Conv2d(25, 32, kernel_size=(5, 5), stride=(1, 1))
(conv5): Conv2d(32, 32, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=4032, out_features=300, bias=True)
(fc2): Linear(in_features=300, out_features=116, bias=True)
(features): Sequential(
(0): Conv2d(1, 12, kernel_size=(7, 7), stride=(1, 1))
(1): ReLU()
(2): Conv2d(12, 20, kernel_size=(5, 5), stride=(1, 1))
(3): ReLU()
(4): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(20, 25, kernel_size=(5, 5), stride=(1, 1))
(6): ReLU()
(7): Conv2d(25, 32, kernel_size=(5, 5), stride=(1, 1))
(8): ReLU()
(9): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
(10): Conv2d(32, 32, kernel_size=(5, 5), stride=(1, 1))
(11): ReLU()
(12): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
)
)
Some hits:
Some misses:
A miss while looking away.
A miss while looking away.
A miss while looking away.
Firstly, there are a lot more points to predict this time. One difference that this
invites is that once you've chosen a single point, generally, the positioning of other
points falls into place somewhere relative to that point; in effect, points are dependent.
This invites a lot of errors based on how we've trained the model - particularly that,
if some point is thought to be the tip of the nose, it'd only make sense for the model
to map a few points around that as the sides of the nose, a few points diagonally upper-left
as the right eye, and so on. When one such prediction goes wrong, a number of others
might also be influenced to fall in line and map to different points.
While there are some extreme translations that don't make sense, (i.e.
flipping a face horizontally without re-marking the points), we haven't drastically
modified most of the faces that the model is training on outside of a few degrees of
rotation and a few translations. That means that - taking an extreme translation like
an upside-down face as an example - the model would probably still try to map points right-side-up.
Though that case doesn't exist in our dataset, but it's replaced by a similar "rotated face"
problem, where a face turned away from the camera isn't mapped particularly well by the
model.
However, not all is lost, because the first slightly turned face is quite impressively
mapped!
Lastly, I downloaded the large ibug dataset and trained on all the annotated faces
by cropping images and making slight adjustments to the images (rotations, translations, etc.).
I took the advice on the project page and used ResNet18 as a base, modifying the
first convolutional layer conv1
to accept just one color channel, and
the last fully connected layer to output 136 outputs - one for each of the 58 points'
x & y coordinates.
I chose a learning rate of 0.0001 (1e-4), and trained for 25 epochs.
ResNet(
(conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer2): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer3): Sequential(
(0): BasicBlock(
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): Linear(in_features=512, out_features=136, bias=True)
)
153.71337484885117,
128.88900953281424,
91.8845456488832,
64.59337344141063,
41.3287937569761,
34.247558496669384,
27.970341104233338,
25.27063581186854,
28.61103214903506,
22.455415200330542,
20.843517247074377,
19.391145770421286,
20.757094388236542,
19.84007723531323,
19.1822521408161,
18.67110346962592,
18.38411842920109,
18.7877002369144,
17.66790440839208,
17.341433541503495,
17.93859914177192,
17.44737537892279,
17.890463032408388,
18.242590165423774,
17.913578881475026
Some hits:
Some misses:
He's not that chubby.
I don't think there's any one strong reason for the misidentifications in this set. In fact, you can see that many of my "good" images above contain some of the same characteristics of these "bad" ones. Personally, I think a lack of strength in image augmentation led to these circumstances - maybe I could have focused on more examples where the faces are in unnatural orientations and lighting scenarios.
The most obvious errors are those where the edge of the face is mapped off
of the face. This was probably cause by a number of face points being outside
of the bounding boxes or obscured in the original photo, and the model remembering
that some face points could be identified despite that the full face isn't visible.
As you can see from the "bad" predictions, but even many of the "good" ones, my model tends to
overshoot how large a face really is.
You'll notice the same overshoot problem as with the regular images.
Obama's face actually looks quite great!
My face, though, is a little bit of a mixed bag. It seems to have grasped onto some of the facial features correctly, but it misidentified the forest for the side of my face.
As for Gigachad, the neural network has never seen such gorgeousness in a single photo, and especially not with his head turned. I suspect that this misidentification is due to both the lower contrast with the background and the uncharacteristic squareness of his face (most faces are rounder).