CS 194-26: Final Project

Vinay Chadalavada and Gavin Fure

Part 1: Augmented Reality

Input Video

Here is the video we used as our input and later we will projecting a cube on top of it.

Tracking Points

In order to calibrate the camera for each frame we will need to tracking a set of points in 3d coordinates and also for each frame we need to find the 2d points. To automate the tracking of points we used a CSRT Tracker. Here is a video of the tracked points.

Calibration

At this point we can use least squares to solve for the projection matrix, this is very similiar to recovering homographies. The lecture slides say to solve it in the form Ax = 0, however I set the last parameter m_34 to 1. So the form was Ax = 2d_points. For each point we can generate two equations: one for the X coord and the other for the Y coord. The X equations took the form [X, Y, Z, 1, 0, 0, 0, 0, -u*X, -u*Y, -u*Z] * parameters = u and the Y equations took the form [0, 0, 0, 0, X, Y, Z, 1, -v*X, -v*Y, -v*Z] * paramters = v. (u and v are x and y coords in 2d)

Projecting The Cube

Camera calibration allows us to covert 3d points into 2d image coordinates. We can now project our 3d cube into our video using the calibration we found for each frame. We placed our AR cube on top of our box. The results are below.

Part 2: Lightfield Camera

Overview

Lightfield cameras use a microlens array to capture a few hundred tiny little pictures at once that provide slightly different views of a subject. This setup allows for fun, interesting effects in post-processing like refocusing, vantage point shifting, and aperture adjustment. We will be using data from the Stanford Light Field Archive.

Getting our bearings

First, let's get our bearings. We can traverse across the image grid in two directions to see what the view is from different angles:

Depth Refocusing

If we average all of the images in a grid together, we get an image that looks like this:

The background is in focus, but the foreground is blurry. This is because the foreground differs more than the background between different pictures in the image. If we can line each image's foreground up, we can synthetically shift the image's focus to be on the foreground. We are able to do this because the microlens array is regularly spaced, and the offsets between subapertures are included in the dataset. Each image is labelled with a u,v value pair that represents its position on the camera plane. We chose a center cell [9, 9] to compare each image to. For each cell, we calculate the difference between this image's uv values and the center's, multiply this by a scalar alpha, and then shift the image using np.roll. We then average all of the resulting images. By iterating over alpha, we can change the depth of focus! Here are our results. The chess gif used an alpha range of [-.2, .7], and the orb image used an alpha range of [-.7, .7].

Aperture Shift

Averaging adjacent images together in the light field grid mimics a camera with a larger aperture. We are essentially letting more rays in by averaging more pictures together. We chose a center image (again [9,9]) and computed averages with images in the surrounding radius. By increasing the radius we are using, we mimic a larger aperature by essentially increasing the field of received light we use. Here are our results, which iterate from radius=0 to radius=12:

Lightfield Summary

Lightfield cameras are pretty interesting. This kind of data is very versatile and opens up a lot of cool possibilites in post, but they can be pretty difficult to work with. They require much more data than a triditional photo, and are pretty slow to process. However, they are very powerful, and essentially allow you lots of control over a virtual camera that has captured much more of the scene than a normal camera would!

Part 3: A Neural Algorithm of Style Transfer

For this project we want to combine the structure of a photo with the style elements of an art piece to create new art. We will start off with a white noise image and then gradient descent will help fit the loss function better. Loss for structure was defined as the SSD between the activation maps produced on conv3_1. For style, we can also produce activation maps and then multiply them by their transpose to get the Gram Matrix. This essentially will show the correlation between activation maps and serve as higher order statistics. For style loss the paper used SSD between Gram matricies but for us MAE loss worked much better. To summarize, we used the activation maps from conv3_1 for structure and the gram matricies produced by conv1_1, conv2_1, conv3_1, conv4_1, conv5_1 for style. To combine both losses we will multiply structure loss by alpha and multiply style loss by beta, these values were summed up. The ratio we ended up choosing was alpha/ beta = 10^-2 and our learning rate was 10^-4.

Here Are Some Results Below

Above are the structure image and art images we used as inputs.

Above are three different results, each generated from a different starting image, with white noise, the structure image, and the style image respectively being used as the inital image to be optimized by gradient descent.

Here we created a dog that just commited murder! The dog wojak was used as the structure and the art (by Cy Twombly) was used as the style.

Here we merge two types of art and as you can see the natural circular brushstroke structure of the first one and the disorienting texture of the second are both present now.

Here the structure was the statue of liberty and the style was Starry Night. The result came out really well! This is probably our best style transfer. The pairing is super well, especially since they both have similar structures and composition. Not only is the background almost as gorgeous as the original Starry Night, but the texture of the statue itself looks painted as well!

Citations

Dog
Starry Night
Red Circle Art
Album Art
Nerual Style Transfer Paper