Facial Keypoint detection with neural
networks
Roma Desai | CS-194
Project 4
OVERVIEW
For this project,
I implemented facial keypoint detection using
convolutional neural networks. I implemented this in a few parts. First, I
created a net to just detect the noses of people in the IMM Face Dataset. Next,
I modified this net to be able to learn the points outlining an entire face.
Finally, I scaled these concepts onto an even larger dataset and used the ResNet standard CNN model to learn the facial keypoints of those faces.
PART
1: NOSE TIP DETECTION
For the first
part, I began by creating a PyTorch DataLoader class to load all my images as well as keypoints. This dataloader
abstraction made it easy to fetch data and interface with other PyTorch functions. I scaled the images to 60 x 80. Here are
some images with the nose point shown (before being resized):
|
|
|
|
Next, I created
the actual neural network. I used 3 convolutional layers with a max pool layer
following each convolutional layer. I followed this with 2 fully connected
layers with the output size being 2 to represent the (x,y) coordinates of the nose point. Here are the specifications
of my net:
Finally, I
trained the neural network on my training data and tested on my validation
data. My training function involved calculating the loss using mean squared
error and updated my weights accordingly.
I trained my net for 15 epochs. Below is the plot of the training and
validation loss per epoch. The x-axis represents the epochs and the y-axis
represents the loss. (I did not normalize my points which is why the loss is so
high). The blue line is the training loss and the orange line is the validation
loss. As you can see, the loss decreased as the net trained on more data and
had more time to update its weights.
Here are some examples
of my learned points. The green points are the ground truth and the red points
are the outputs of the net.
|
|
|
|
As you can see, not
all the facial key points were successfully identified. I believe this because
of our small dataset and the facial view/expression. The images that failed
were usually side views which made it hard since majority of the training
images were not hard angled views. As you can see above, one of the images also
has a different facial expression which may have also thrown off the results. While
a lack of data is always a problem in machine learning problems, I believe some
varied data/data augmentation could have helped the results.
PART
2: FULL FACIAL KEYPOINTS DETECTION
This section was
very similar to part one except I modified my neural network to learn a full
set of facial keypoints instead of just the nose. This
time, I loaded all 58 facial keypoints in my data
loader. However, to learn a full set of 58 points, we need more data. Since we
do not have new images, we can augment the current dataset to product similar
but different images that will help with training. I implemented random brightness/constrast changes, random rotations, and random shifts.
Here are some samples:
|
|
|
|
Next, I modified
my net from part 1 to be able to learn 58*2 = 116 keypoints
instead of just 2. I added more convolution layers, more max pool layers, and
changed the inputs/ouputs of the different layers.
Here is what my net looked like:
After training, I
plotted my loss function. I trained my net for 15 epochs. Below is the plot of
the training and validation loss per epoch. The x-axis represents the epochs
and the y-axis represents the loss. (I did not normalize my points which is why
the loss is so high). The blue line is the training loss and the orange line is
the validation loss. As you can see, the loss decreased as the net trained on
more data and had more time to update its weights.
Here are some of
my results. As you can see, some did well while others did not do as well. Again, I think this is due to a lack of varied
data even with the data augmentation. The net did not do so well on faces that
were angled away from the center or tilted. The net tried to replicate images
that were facing head on for the most part. Surprisingly, the net got the shape
of the face correct -- it is just the location that seems a little off. This
may be an indication of the net assuming most faces will be forward facing.
|
|
|
|
In addition, I printed
the learned filters. Here are the filters for the first and second layer. The
number of filters increases exponentially so I just displayed the first 2 layers.
First Layer:
Second Layer:
PART
3: TRAIN WITH LARGER DATASET
For part 3, I
used the larger ibug dataset with over 6,000 images. From
the previous parts, I modified my dataloader to read
in the new images, crop them according to the bounding box around the faces,
and resized them into 224x224 images. I also implemented data augmentation to
give better results. Here are a few images along with the ground truth keypoints.
|
|
|
|
Next, I loaded a
standard ResNet18 PyTorch model to train. I modified this
net to take in 1 input and output 136 points for the 68 facial keypoints. Below is my net in detail.
Below is the plot
of the training and validation loss per epoch. My training set consisted of
5000 images while my validation set consisted of the rest. The x-axis
represents the epochs and the y-axis represents the loss. (I did not normalize
my points which is why the loss is so high). The blue line is the training loss
and the orange line is the validation loss. As you can see, the loss decreased
as the net trained on more data and had more time to update its weights.
Here
are some resulting images from the validation set compared to the green ground
truth points.
|
|
|
|
Here
are some resulting images from the test set of images.
|
|
|
|
|
|
|
|
|
I also ran the
model on three of my own images. Below is the result. As you can see below, the
faces that were cropped similarly to the training data set resulted in better
results. I think this may be a result of overfitting. I also noticed that the
tilted face keypoints was also slightly harder for
the program to determine.
|
|
|
REFLECTIONS:
Overall, I think
this was a very interesting project that showed me the depth of possibilities with
machine learning. I really enjoyed reading about the material in detail, finally
learning PyTorch (a long-time goal of mine but
unfortunately could not be disciplined enough to start), playing with
parameters and trying to perfect the network. While I had no prior machine
learning experience, I am really excited to apply my newly learned skills to
other projects. With this project I also
came to realize the true difficulty of detecting simple faces and what that
means for the field of computer vision as a whole. By struggling with this
simple facial feature recognition task, I came to appreciate and wonder how
these concepts are scaled to applications such as autonomous vehicles and more.
SOURCES:
I used the
following resources to learn more about convolutional neural networks and
various aspects of using PyTorch.
· https://towardsdatascience.com/how-to-cook-neural-nets-with-pytorch-7954c1e62e16
· https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
· https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#define-the-network
· https://www.youtube.com/watch?v=Zvd276j9sZ8&ab_channel=AladdinPersson
· https://pytorch.org/docs/stable/notes/cuda.html
· https://stackoverflow.com/questions/24659814/how-to-write-a-numpy-array-to-a-csv-file