CS 194-26: Computational Photography, Fall 2021

A Neural Algorithm of Artistic Style

Daniel Jae Im, CS194-26-afp

Overview

In this project, I recreated an artifical system using Deep Neural Networks that seperate and recombine the content and style of two images, creating entirely new works of art.

Algorithm

The main algorithm can be explained as taking two images: a content image that outlines the general shape of our resulting image, and a style image that outlines what our resulting image will characteristically have. Given two images, our algorithm continues altering pixel values of our images, while minimizing the loss between its current state and the content image, and the style image.

I chose to leverage a CNN that has already been trained on a large image set. The CNN can be fed a content and style image, and extract the relevant features that give us the content and styles we are looking for in our resulting image.


Architecture

As the paper outlines, we use a pretrained VGG-19 network with pretrained weights. As the paper suggests, we changed our MaxPool layers with AvgPool layers using normalization to smooth out the representations. We chose to add in a content loss layer (conv4_2) and a style loss layer (conv_1, conv_2, conv_3, conv_4, conv_5).

Loss Function

The main heuristic we use to judge the losses between epochs of our CNN running, is through a well defined loss function for content loss. To calculate content loss, we use a closed equation:


Here, the vector p is the original image. The vector x is the resulting image. P and F are the feature representations from layer l of the model.
Now to calculate style loss, we mostly use the correlations between different features given the Gram Matrix, presented by this equation:


We let the vector a be the original style image, and the vector x be the resulting image. A_l and G_l are the style representations from layer l of the model. We also have additional equations from the paper used to define our total loss function:

This is the final equation we are using for our loss equation.


Training

Lastly, we trained our above CNN to minimize the above loss function by finding the optimal pixel values for our generated image. We used the L-BFGS algorithm, which the paper recognizes, for our optimizer and trained it for 250 epochs. Here are our results:

Results

Neckarfront

El Capitan

Arcane Vi

Failures

I was unable to properly represent a good style transfer from this image by Varanoi onto this CrazyFrog image. I believe it is because the randomness of the varanoi image do not lend themselves to transfer well onto the round edges of the frog image, and the white values of the varanoi produce some weird pinks from the bright orange pixel values of the frog image.

Overall, this project was very difficult, but helped a lot in getting me up to speed on CNN's and how to produce models that can train themselves. It was an enjoyable project.

CS 194-26: Computational Photography, Fall 2021

Augmented Reality

Daniel Jae Im, CS194-26-afp

Overview

Augmented Reality has always been an interest of mine, and this project was a great introduction into a rudimentary AR pipeline. This project goes through the process of video capture, point tracking, and using projection to generate an elementary augmented reality scene. .

Video Capture

We first captured a video that we would superimpose our overlay on top of. I chose a rectangular box with intersections as the points of interest. The point spacing was estimated, but for the sake of point tracking it worked out.


Collecting World Points

Once we have our video, we have to match their positions on the screen to some canonical world space coordinate system.

I chose to define each point in the image, with some world coordinate that was consistent with the three axes I labeled above. The origin point (0, 0, 0, 1) would be at the mid-left point, where our three axes start from. From there, I was able to take 25 intersection points, using ginput(25), and mapped each point to their corresponding world points from the first frame of the image.


Point Propogation using CV2.MedianFlowTracker

Once I was able to give each of our 25 screen points, a world coordinate, I also allocated of our 25 screen points a cv2.MedianFlowTracker that would be used to update the intersection point from frame to frame. Each of our screen points has a tracker that uses a bounding box of 8 pixels from where it was in the last frame, to find where the intersection lies in the next frame. These frame/frame points were kept track of and propogated along using iteration.
Here is the box, with its intersections tracked across every frame of the video.

Notice, that for the bottom point, cv2 was unable to consistently track frame/frame that for the third-bottom point (most likely due to the intersection being to subtle against the deskmat).

Camera Projection

With our points tracked, next was to produce the matrix that would allow a homogeneous point in 3d coordinate space to be mapped to a homogeneous pixel-coordinate in screen space. I achieved an approximation of this matrix using least-squares, and by following the outline in this slide.

Translation Matrix

Then for each frame of our resulting video, I achieved a projection matrix that best achieved the translation from 3d-world space to our 2d-world space. Using this projection matrix, I simply mapped the 8 points we need for a cube from the origin point and mapped them for each frame.

Line from axis to 8 corners of cube.

All that was left was to produce the correct colored lines for each frame, and to actually compile those images into a gif that we could present.

Final Gif
The best thing I learned from this project was how awesome our lessons on homographies could be applied to producing projection matrices from 3D->2D points. It was insane how intuitive this project was after what we learned in this class, and for a rudimentary AR system, even this was pretty impressive to work on and show to others.