Final Project

CS 194-26 Fall 2021

Bhuvan Basireddy and Vikranth Srivatsa



Augmented Reality

Setup

We recorded multiple videos and choose the one that performed the best. We noticed that slower the movement the better the results were. This is a link to the video we captured and decided to use for our results: https://drive.google.com/file/d/1PxPuNHClUNBsiu9nmwNbT-sTEJRu0lSV/view?usp=sharing . We made sure to add big points in a grid on the pizza box.

Keypoints with known 3d coordinates

This is an example of the kind of points we labeled. We added keypoints to the pizza box manually.

Propagating Keypoints to other Images in the Video

Now, we need to add the keypoints to the other frames of the video. We used a tracker for each point (TrackerCSRT) and had bounding boxes for all the points in the 1st image to help keep track of the points. Some points do disappear, but it does work decently well. The following are 2 examples of videos that displays the points being tracked with labels: (points_tracked2.mp4) https://drive.google.com/file/d/15j1U41znIMFrLTzDUX2FC1l-sWSingCv/view?usp=sharing And https://drive.google.com/file/d/15781h_9VHWPVtWKTSj-Nhm7th57c09aY/view?usp=sharing

Calibrating the Camera

Now, we need to calibrate the camera. For each frame, we calculate the camera projection matrix to transform from the 3d world coordinates of the box to the 2d image coordinates as follows:
We also manually labeled these points with the axis.

Projecting a cube in the Scene

Now, we can draw the cube on the box. We define the corners of the cube in world coordinates. For each frame, we apply the camera projection matrix to the 3d points to get the 2d images points, which can then be plotted on the image. Thus, we can visualize the cube on the box in a video. It mostly works well, but it seems to spaz out on a few frames, possibly due to the tracker not being perfect. The following is a video link to the cube projected the scene: https://drive.google.com/file/d/1iqH0udFUjBCh05YBOJEIBSkhPl3zKdFK/view?usp=sharing

A Neural Algorithm of Artistic Style

For this, we implemented a neural algorithm to do style transfer. We used a pretrained VGG-19 neural network to do this. For the content layer, we used layer 21, and, for the style layers we used layers 0, 5, 8, 19, and 28. We run the content image and the style image through the network and extract the layers that we want as features. We basically run gradient descent on the image itself, updating the pixels. To do this, we calculate the content loss, which is just the sum of the squared differences between the feature representations of the original image and generated layer for the content layer, multiplied by content weight. The style loss is calculated by using the feature representations from the 5 style layers for the generating image and finding the gram matrices for each of them. Similarly, we also find the gram matrices for the original style image. For each style layer, we do a sum of squared differences between the gram matrix of the original image and the gram matrix of the generated image and multiply by the style weight for that layer. The content loss and style loss can be combined into the total loss with a simple linear combination: alpha * content loss + beta * style loss. We used alpha=1 and beta=1e12 because this converged quicker. We used a learning rate of 1 with the LBFGS optimizer and trained for 50 epochs. To make the runtime quicker, the images were preprocessed and resize to 256. Here are a couple content images we used:
Here are some styles we used:



And here are some results:

Light Field Camera

Depth Refocusing

Using the jelly bean dataset, we first read the images into a 2d grid. Then, usinged shifted versions of the image on the grid we are able to focus on different parts of the image. This is via a depth * (x/y - center) shift on each image. We can see from the following images that it focus on the foreground and background. We also created a gif using these results:

Aperture Adjustment

We can replicate the change in aperture by averaging around the center. We only average points within some radius of the center. We try this for multiple sizes. We can see the aperture difference in the following gif as it changes focus

Summary

We learned about cameras and how they work with regards to focus and aperture, which was really cool to see in action. We were able to focus on different parts of the scene.

Bells and Whistles

For the Bells and Whistles, using a pizza box, we tried to take a 5x5 grid at intervals to immitate the grids in the dataset. The following is an example of the gird we used. We can see the results for depth varying and aperture focus below. The results weren't very good. This is probably because the shifts were off since we took the photos manually with a phone. It could be better if we measured the distances to keep it the same.