Author: Isaac Bae
Class: CS 194-26 (UC Berkeley)
Date: 12/11/21


Poor Man's Augmented Reality


Setup


The first thing to do to start this project is to find a flat surface and place a box on it. The box should have a clean, regular pattern. Then, you must decide on at least 20 points based on the box's pattern, and get the 3D world coordinates of those points by using manual measurements. For example, the part of the box that I worked with had dimensions 32cm x 20cm x 8cm, and I just needed to set some origin (like in a corner) to get the 3D points in cm. Then, you must take a video with the box being in the center while you move around it. Depending on the tracker you will be using (which will be discussed shortly), it may help to not make any sudden and/or quick movements with the camera. Here is the link to my video.


Selecting 2D Keypoints with Known 3D World Coordinates


Once you have everything set up, you have to select the 20 keypoints that you used before in every 2D video frame. Sounds tedious, doesn't it? Well, it doesn't need to be that way, thanks to some useful tools such as off-the-shelf trackers. The idea is to select the keypoints only in the first frame, and let the tools propogate those points to the rest of the video. I used a CSRT tracker, which basically tracked the points within a 8 x 8 bounding box. You will be able to see the results of the tracker in the output video (which is in the last section).


Calibrating the Camera


Using the 2D images coordinates (for every video frame) and their corresponding 3D world coordinates, you must now compute the least squares to fit the camera projection matrix to be able to convert 4D (homogenous) real world coordinates to 3D (homogenous) image coordinates. I will not go into detail on this procedure, but here is a diagram that I used to get myself started:


From this link

cpm

Projecting a Cube in the Scene


With the camera projection matrix (for every video frame), projecting a cube is now as simple as converting the specified axes points (or 3D world coordinates) of the cube into 2D image coordinates, and drawing the cube onto the frame. After going through all the video frames, you have produced a brand new video with a rendered cube in the scene. Here is my output video.


Reimplement: A Neural Algorithm of Artistic Style


Author's Note


I will be following the paper “A Neural Algorithm of Artistic Style” by Gatys et al. very closely. When I refer to the "paper", this is what I am talking about. This paper and the project description below explain how to use a modified version of a famous neural network, VGG-19, to essentially combine the content of one image and the style of another. When I say "input image", I (usually) mean a white noise image that is/will be optimized to contain the content and style mix. All images involving mathematical formulas are in the paper, so look to the paper if you need more information. All other images were taken from Google.


Choosing and Preparing Images


There is no real guideline for choosing the content and style images; let your imagination fly free! There is some preparation that needs to be done on those images before they could be used in the network, which includes making the images into squares (if needed), resizing the images and other transformations. Other than that, the prerequisites are quite tame.


Dealing with Content


In the paper, there was a discussion on how content is represented in (effectively) every layer of the network. It was stated that the higher (or the closer to the end) the layers are the more pixel details are lost, which intuitively makes sense. Since content shouldn't overtake the input image when optimized, the representation of the content that was chosen by the authors was 'conv4_2', which is near the end of the network.

To compute the content loss, we have to get the feature maps from the chosen layer for both the input and content images, and perform the calculation below:


content_loss

Dealing with Style


Style is a bit more interesting as it is not exactly clear what style means from the start. However, the authors of the paper were able to figure out what it may mean mathematically by using gram matrices. In other words, style is measured by the covariance between feature maps in a given layer.


gram_matrix

The authors found that having 'conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1' to represent style produced a smoother and more continuous visual experience.

To compute the style loss, we have to get the feature maps in those layers for both the input and style images, and perform the calculation below:

style_loss

Modifying VGG-19 Network


The VGG-19 network was used to produce the results in the paper (which was stated before), and the authors had a few concrete suggestions. Basically, don't use any of the fully connected layers (or any unneeded layers), and change the max pooling layers to average pooling layers for improved gradient flow and such.


Optimizing Content and Style Mix for Output


The optimizer that I used was L-BFGS, which was found to work well with the task at hand by many others. As an important note, make sure that the network parameters are not being optimized; the input image itself should be optimized as a trainable set of parameters.

So, how do you combine the content and style losses? Here is the total loss function:


total_loss

The ratio between the alpha and beta determine how much weight is given to the content or style.

Overall, the hyperparameters that were (at least somewhat) relevant were the learning rate, desired layers for content and style losses, alpha/beta ratio, style weights, and the number of epochs. I will give an example of how changing one of these affects the results in the next section, but before that, let's look at some results! The first output was given with an alpha/beta ratio of 3e-9, while the second was given with a ratio of 8e-7. All other hyperparameters were the same, but I will say that the number of epochs was 15000.


Content Image

dancing

Style Image

picasso
im1_1

Content Image

home

Style Image

starry_night
im2_1

Changing Hyperparameters


Here I varied the alpha/beta ratio to be 1e-3, 3e-9, and 1e-11 respectively.


1e-3

im1_3

3e-9

im1_1

1e-11

im1_2