Style Transfer, & Lightfield Cameras!

A catalogue of steps I took to build the two-part final project for Berkeley's CS194-026. The first project is an implementation of A Neural Algorithm of Artistic Style, and the second involves depth refocusing and aperture adjustment using lightfield cameras as inspired by Light Field Photography with a Hand-Held Plenoptic Camera.

Jazz Singh // December 2020


1. A Neural Algorithm of Artistic Style

For this project, I re-implemented the famous neural style transfer paper, "A Neural Algorithm of Artistic Style" (Gatys et. al.)!

1.1. Summary of Final Approach

Here's a summary of the architecture, hyperparameter settings, and other details that worked for me. I used the same feature extractor architecture as the paper -- VGG-19 without the fully connected layers, and with max pooling replaced with average pooling.

  • (Adapted from: https://github.com/tejaslodaya/neural-style-transfer)

In terms of places where I diverged from the paper:

  • * White noise image: Instead of starting from a white noise image and minimizing content and style loss from there, I started from the content image (and minimized both losses). This choice stemmed from noticing that starting from the white noise image often resulted in noise in the generated image, which could only be fixed by using a very low content layer (i.e. the first convolutional layer in the feature extractor); however, this approach has the downside of including too much low-level pixel information for the content layer, resulting in a generated image that couldn't quite manage to capture the richness and vibrancy of the colors in style image, although it did manage to capture some of the style image's texture. So, I tried feeding in the content image as the initial image instead of noise, and this approach yielded really pretty results.
  • * Ratio of content loss weight to style loss weight (alpha/beta): 1e-9, instead of 1e-3 or 1e-4. For me, the relatively high weights on content loss recommended by the paper caused far too much emphasis on optimizing for content and not enough on optimizing for style. On the other hand, values below 1e-9 distorted the content too much for my taste. 1e-9 balanced the two concerns.
  • * Optimizer: Adam. The paper used a variant of standard gradient descent, but I found it was easier to obtain good results with Adam.

Not mentioned in paper:

  • * Image resolution: 448 x 448. The lowest image resolution acceptable by pretrained PyTorch models is 224 x 224, but I wanted higher resolution results, while maintaining a relatively fast optimization speed, so I doubled the resolution.
  • * Image normalization: I normalized images following the normalization scheme PyTorch pretrained models need to accept. I experimented with implementing the paper without this normalization, and although results were pretty good, the colors weren't as rich and vibrant as desired. Normalization solved this issue! Note that I did not need to clip the range of values at each iteration -- the values were not pushed far out of bounds, and in fact experiments with normalization + clipping at each iteration led to worse, duller color results.
  • * Learning rate: 1e-1. This was the fastest learning rate that converged.
  • * Number of optimization iterations: 5000. Although fewer iterations yielded pretty good results, I found that increasing the number of iterations to ~5000 (for this optimizer, learning rate, etc. scheme) fine-tuned the details of the generated image.

Same as paper:

  • * Feature extractor network: VGG-19, without the fully connected layers (these are not necessary for feature extraction), and with max pooling replaced with average pooling (the authors found this pooling yielded better results).
  • * Content layer: 10th convolutional layer ('conv4_2'). The authors found this layer captured enough high-level information about the content image. I experimented with lower layers, which worked better when starting from a white noise image as mentioned above, but ultimately the paper's suggestion worked best.
  • * Style layers: 1st, 3rd, 5th, 9th, and 13th convolutional layers ('conv1_1', 'conv_2_1', 'conv_3_1', 'conv_4_1', and 'conv_5_1'). The authors found that sampling style layers in multiple places across the feature extractor led to smoother style transfer, from more fine-grained details to higher-level style structure. I experimented with lower and higher subsets of conv layers, but ultimately the paper's suggestion worked best.

1.2. Style Transfer on "Neckarfront"

I transferred 3 different styles onto the same houses image that the paper utilized, to be able to compare my results to the paper's results. Below is the original image:

  • neckarfront

And here are my results!

  • the starry night (vincent van gogh)

  • transferred to the starry night


  • the scream (edvard munch)

  • transferred to the scream


  • the shipwreck of the minotaur (joseph mallord & william turner)

  • transferred to the shipwreck of the minotaur

My results seem to balance shape and structure from the content, and vibrancy and texture from the style! In terms of room for improvement, the paper's results often included very tiny, sharp details, that seem to preserve all of the brushwork of the original style; I think further refinement (including training for longer) would bring my results even closer.

1.3. Even More Style Transfer!

I applied my method to more pictures and styles, and below are some of the results!

  • boba & the campanile, transferred to rain rustle (leonid afremov)

  • flowers, transferred to bouquet of flowers (cezanne)

I included an example of a "failure" case below -- there are artifacts like what appears to be a person's head near the skyline. I think this stems from too large of a content gap between the content and style images, since the style image contains quite a few people.

  • san luis obispo, transferred to a sunday afternoon on the island of la grande jatte (george seurat)


2. Fun with Light Field Cameras

Light field cameras enable camera-like effects as post-processing; for this project, I leveraged the Stanford Light Field Archive to alter the apparent depth of focus and aperture of images with simple averaging and shifting operations. I also implemented arbitrary depth refocusing from a user-specified (x, y) position.

2.1. Changing Depth of Focus

To change the apparent depth of focus, I calculated the difference between the position of the center camera in the light field, and the camera positions of all the other images. I shifted all of these images by the factor alpha multiplied by this difference. Varying alpha varies the depth that seems to be in focus, since it allows a strip of the images to be aligned (and thus sharp), while the rest of the images produce a blurred effect from unaligned averaging.

  • varying alpha

2.2. Changing Aperture

In an actual camera, reducing the size of the pinhole (up to a certain extent of course) leads to a clearer image, while increasing its size leads to circles of confusion (producing blurred effects). So, to implement apparent aperture change for a fixed image, I averaged the images nearby the input image by a smaller or larger radius. The smaller the radius goes, the clearer the result, since the camera positions are closer together.

2.3. Interactive Refocusing

To change the depth of focus of an image depending on a user-defined desired location in focus, I sampled a patch from the center camera image of give patch size and location, I picked an arbitrary second image in the light field and sampled the same patch from this image, and I minimized the SSD between these two patches across different alpha values. Since a single alpha parameter is needed for all images and both axes, I finally changed the depth of focus of the image (see 2.1) by this alpha value, thus aligning all images so that the patch at the user-defined location looks in focus. The result below is from picking a point in the middle bottom of the image, hence the section that appears closest to us is in focus.