Final Project I - Lightfield Camera

Final Project II - A Neural Algorithm for Style Transfer

Lightfield Camera

Depth Refocusing and Aperture Adjustment with Light Field Data

Ron Wang

CS 180, Fall 2023, UC Berkeley

As demonstrated in Ren Ng's 2005 paper, Light Field Photography with a Hand-held Plenoptic Camera, one can achieve depth refocusing and aperture adjustment using very simple techniques such as shifting and averaging. In this project, I reproduced these effects using real lightfield data from the Stanford Light Field Archive.

1. Depth Refocusing

The Stanford Light Field Archive provides us with datasets of images taken over a regularly spaced grid. Lets's first explore the effects of simply averaging all the images.

As we can see, the objects far away from the camera are sharp, because they do not vary their position significantly when the camera moves around (the optical axis direction is unchanged). In contrast, objects closer to the camera appear blurry.

To refocus depth, we can shift the images appropriately with respect to the center image, which is the image with coordinates (8, 8) on the (17, 17) grid. The filename of each image provides the location of their respective view. We subtract from it the location of the view of the center image, and now we have the amount of shift we need. We then introduce a parameter depth, using which we control the amount of the shift, and thus the depth we focus on. In the resulting image, this shows up as a refocused area corresponding to that depth.

Depth = -0.1

Depth = 0.0

Depth = 0.1

Depth = 0.2

Depth = 0.3

Depth = 0.4

Depth = 0.5

Here's a GIF that visualizes the change in depth of our focus:

More examples (GIF):

2. Aperture Adjustment

Another cool manipulation we can do here is to mimic a camera with a larger aperture by sampling images over the grid perpendicular to the optical axis. We choose a radius (e.g. 50) and sample all images within that radius of the center image. We then shift them appropriately, as what we did in section 1, and average the shifted images. Note that the base case r = 0 means we sample only the center image. Here are the results (setting depth=0.20 to focus on the center region):

r = 0

r = 10

r = 20

r = 30

r = 40

r = 50

More examples (GIF):

Bells & Whistles: Interactive Refocusing

Building on the refocus_im function implemented in Part 1, I also created a function that allows us to refocus on any point of the image. The function takes in the (u, v) coordinates of the point, which can be easily read from skio's image output. It then calculates the optimal depth for refocusing. For example, we have the following blurry image:

We can specify to the function that we'd like to refocus on the point at (1200, 700). Now the image is refocused!

Here's another demo: let's refocus on the point at (1000, 200).

3. Reflections

This is a really fun demo that showcases our abilities to "refocus" or adjust aperture after images were taken. The method was simple but the results were rather amazing. However, I do think the current approach is limited in that we need to have knowledge of the camera coordinates of the images. In addition, we used a total of 19 x 19 = 289 images for each scene, and this might not always be possible. It would be the most effective if we can develop algorithms to learn the camera coordinates and to generate these results with fewer samples. I plan to read up more on this.

Back to Top

A Neural Algorithm of Artistic Style

Reimplementation + Explorations

1. Introduction

A Neural Algorithm of Artistic Style by Gatys et al was a seminal work in the field of neural style transfer. On a very high level, the approach they introduced was to use CNNs to separate and recombine content and style. I am especially impressed by the novel approach and the aesthetic pleasure from the synthesized images. In this part of the final project, I reimplemented their approach while introducing some personal experimentation.

Here are links to the original paper and two PyTorch tutorials (PyTorch and d2l) that were helpful in guiding my implementation.

In order to use GPU compute, I set up my environment on Azure ML Notebook, utilizing some free credits I had from previous work (thanks Microsoft!). I also set up a project in Weights & Biases to log my results in each run. It turned out logging hyperparameters and synthesized images was extremely important - when I was stuck at some uninteresting results, looking at experiment data helped me figure out the correct parameters to choose.

Reading the paper, the most interesting part comes from realizing we are not training the neural network itself in the traditional sense. What is being optimized instead is the synthesized image. As discussed in the paper, there usually exists no such image that perfectly captures the content of one image and the style of the other. But as we minimize the loss function, the synthetic image we generated become more perceptually appealing.

2. The VGG Network

As described in the paper, we use a pre-trained convolutional neural network called VGG-19. The network is originally used for object recognition, surpassing many previous baseline results on ImageNet. This made the network suitable for our task - with its object recognition abilities, it can "capture the high-level content in terms of objects and their arrangement in the input image but do not constrain the exact pixel values of the reconstruction." We initialize the network with pre-trained weights from VGG19_Weights.IMAGENET1K_V1.

We use the VGG network to extract features from the content image and the style image. Using these "feature maps" we can calculate content and style loss.

3. Loss Functions

The content generation problem is easy - we define content loss using the MSE loss between the synthesized image's features and the original content image's features. Style features, on the other hand, is difficult. For each layer used in style representation, we need to compute the Gram matrix. The elements of the Gram matrix represent the correlations between the activations of different filters in the layer. These correlations capture the texture and visual patterns that are characteristic of the style of the image.

In addition to content and style loss described in the original paper, total variation loss (tv_loss) was also computed. This is meant to minimize the amount of high frequency artifacts in the synthetic image.

4. Training

In earlier training runs, I failed to get satisfying results even after 1,500 epochs. By examining my hyperparameters and the way the images changed, I noticed the content features were reconstructed well, but the stylistic elements were not showing. I assigned a heavier weight to style (1e6 instead of 1e3 or 1e4) and the model started generating interesting results.

5. Results

Content Images

Berkeley, CA

Golden Gate Bridge 1

Golden Gate Bridge 2

Style Images

The Starry Night, Vincent van Gogh

Impression, Sunrise, Claude Monet

Haystacks, Claude Monet

The Scream, Edvard Munch

Figure, Pablo Picasso

Journey to the East, Bukang Y. Kim

Input (Synthetic) Images

All of these are GIFs showing the gradual synthesis of the combined image. You might have to wait a bit to see the entire process!

Berkeley + Starry

Berkeley + Scream

Berkeley + Sunrise

Berkeley + Cubism

Bridge + Scream

Bridge 2 + Haystacks

Berkeley + Ink Wash

My comment on the last synthetic image, which conbines Berkeley and Ink Wash styles, is that the model captures the style only very superficially. It does not abstract away details adequately as is common in most ink wash paintings. But of course, this level of reasoning or artistic interpretation isn't what we expect from this current model.

I also recorded hyperparameters and losses for each run. Here are the results for the Bridge + Scream example shown above.

Bells & Whistles: Shuffling Layers

I also compared the results from different selections of layers from the VGG network.

Approach 1

style_layers = [0, 5, 10, 19, 28]

content_layers = [25]

Approach 2

style_layers = [2, 7, 12, 21, 30]

content_layers = [22, 25]

I like the result from approach 2 better because 1. content features seem more detailed, and 2. it is more artistically expressive.

For this following example, there isn't a huge difference between the two approaches except for the colors.

Approach 1

style_layers = [0, 5, 10, 19, 28]

content_layers = [25]

Approach 2

style_layers = [2, 7, 12, 21, 30]

content_layers = [22, 25]

Back to Top