CS 194 Final Project Report

CS 194: Computational Photography, Spring 2020

Final Project: HDR | Neural Style Transfer | Lightfield Camera

Wenlong Huang `cs194-26-agf` | Chelsea Ye `cs194-26-agb`

Project 1: Multi-Shot High Dynamic Range

Overview

This project attempts to create HDR photos by automatically combining multiple exposures into a single high dynamic range radiance map, and then converting this radiance map to an image suitable for display through tone mapping. The first step of creating radiance maps follows the algorithm presented in the paper by Paul E. Debevec and Jitendra Malik, Recovering High Dynamic Range Radiance Maps from Photographs, and the second step of tone mapping follows the method by Fre ́do Durand and Julie Dorsey, Fast Bilateral Filtering for the Display of High-Dynamic-Range Images.

Radiance Map Construction

The observed pixel value \(Z_{i j} \) for pixel \(i\) in image \(j\) is a function of unknown scene radiance and known exposure duration: \[Z_{i j} = f (E_i \nabla t_j)\]. \(E_i\) is the unknown radiance and \(\nabla t_j \) is the exposure time at pixel \(i\). Together their product is the exposure at that pixel. And \(f\) is an unknown, complicated pixel response curve. Instead of solving for \(f\), we solve for \(g = ln(f^{-1})\) which maps a discrete pixel value (0 to 255) to the log exposure at that pixel. The function \(g\) can be written as: \[g(Z_{i j}) = ln(E_i) + ln(t_j)\]. Since the scene radiance remains constant across multiple images that we take and we know the exposure time, we can try to solve for \(g\) by setting up a quadratic objective function that we wish to minimize: \[O = \sum_{i=1}^{N} \sum_{j=1}^{P} (g(Z_{i j}) - ln(E_i) - ln(\nabla t_j))^2 + \lambda \sum_{z=1}^{254} g''(z)^2\] where the first term ensures the solution satisfies the previous equation as close as possible and the second term is a smoothing function whose effect we examine below. Having obtained \(g\), we can find the scene radiance by rearanging the above equation: \[ln(E_i) = g(Z_{i j}) - ln(\nabla t_j)\]. During our estimation, we also apply the tent weighting function as suggested in the paper. All the exposures contribute to the final resulting image. However, darker pixels tends to have higher noise and the bright pixels also satuate. Therefore, we give less weighting to those pixels. The formulation of such weighting function as suggested in the paper is: \[ w(z) = \begin{cases} z \text{ for } z \leq 127 \\[.5em] 255 - z \text{ for } z > 127 \end{cases} \]. We show the reference image (one image selected from each scene) and the reconstructed radiance map below.

Chapel

Radiance Map

House

Radiance Map

We also provide the plots of the recovered relationship in Chapel between the exposure and pixel values, both without and with the second-derivative smoothing term and tent function weighting. We observe that the smoothing term plays a major role in separating out the noise, and the tent weighting function does not contribute significantly to the result in this set of images.

\(\lambda = 0\), identity weighting	\(\lambda = 100\), identity weighting
\(\lambda = 0\), tent weighting	\(\lambda = 100\), tent weighting

Tone Mapping

Obtaining the radiance map is only the first step of creaing a great HDR image. The next step is to show details in the dark and bright regions of the scene on a low-dynamic-range display. In this step, we implement both a global tone-mapping operator using gamma compression and a local tone-mapping operator following the paper by Fre ́do Durand and Julie Dorsey. The local tone-mapping algorithm is based on bilateral filtering, which decomposes the radiance map into high-frequency details and low-frequency structure.
The key insight of this local tone-mapping algorithm is that global tone-mapping operator often results in images that lack details, and one way to work around is separating out the large-scale variations and details. The algorithm starts by extracting the intensity of the original image \(I\) by taking the mean across three color channels. Then we apply bilateral filter to the intensity to get the large-scale variations image and the detail image. To preserve the details in the original image, instead of shrinking the dynamic range globally, we only reduce the contrast on the large-scale variation image. This results in an image that is both displayable on low-dynamic-range display and preserves the details.

Results

We show the final images obtained by both tone-mapping methods and the bilateral decomposition below.
Please zoom in to see the bilateral-filtered detail since the pixel values tend to have small variance and may not be visible at small scale.

Chapel

T = 8 seconds	T = 4 seconds	T = 2 seconds	T = 1 second	T = 1/2 seconds	T = 1/4 seconds
T = 1/8 seconds	T = 1/16 second	T = 1/32 seconds	T = 1/64 seconds	T = 1/128 seconds	T = 1/256 second

Global Tone-Mapped HDR

Bilateral Filtered Detail

Bilateral Filtered Structure

Local Tone-Mapped HDR

Bonsai

T = 1/2 seconds	T = 1/4 seconds	T = 1/10 seconds	T = 1/25 seconds
Global Tone-Mapped HDR	Bilateral Filtered Detail	Bilateral Filtered Structure	Local Tone-Mapped HDR

Arch

T = 17 seconds	T = 3 seconds	T = 1/4 seconds	T = 1/25 seconds
Global Tone-Mapped HDR	Bilateral Filtered Detail	Bilateral Filtered Structure	Local Tone-Mapped HDR

Garage

T = 1/40 seconds

T = 1/160 seconds

T = 1/640 seconds

Global Tone-Mapped HDR

Bilateral Filtered Detail

Bilateral Filtered Structure

Local Tone-Mapped HDR

Garden

T = 1/160 seconds

T = 1/320 seconds

T = 1/800 seconds

T = 1/1600 seconds

T = 1/3200 seconds

Global Tone-Mapped HDR

Bilateral Filtered Detail

Bilateral Filtered Structure

Local Tone-Mapped HDR

House

T = 1/320 seconds

T = 1/640 seconds

T = 1/1250 seconds

Global Tone-Mapped HDR

Bilateral Filtered Detail

Bilateral Filtered Structure

Local Tone-Mapped HDR

Mug

T = 1/8 seconds

T = 1/20 seconds

T = 1/40 seconds

T = 1/80 seconds

T = 1/160 seconds

Global Tone-Mapped HDR

Bilateral Filtered Detail

Bilateral Filtered Structure

Local Tone-Mapped HDR

Window

T = 4 seconds

T = 1 second

T = 1/4 seconds

T = 1/15 seconds

T = 1/60 seconds

Global Tone-Mapped HDR

Bilateral Filtered Detail

Bilateral Filtered Structure

Local Tone-Mapped HDR

Bells & Whistles

1. Try the algorithm on your own photos!

We went out to Treasure Island and took a set of images with different exposures. We apply the HDR to two sets of the images and we show the results below. Note that the method does not give compelling results on the first set of images since the image with \(1/15\) seconds exposure time already has no region that is overexposed or underexposed. However, the method performs really well on the second set of the images. Note that the rock and the sky cannot be both correctly in any original image, and the final HDR image contains the details of both.

San Francisco

T = 1/5 seconds

T = 1/10 seconds

T = 1/15 seconds

T = 1/20 second

T = 1/40 seconds

T = 1/60 seconds

Global Tone-Mapped HDR

Local Tone-Mapped HDR

Sunset

T = 1/30 seconds

T = 1/50 seconds

T = 1/60 seconds

T = 1/100 second

T = 1/160 seconds

T = 1/200 seconds

Global Tone-Mapped HDR

Local Tone-Mapped HDR

2. Implement any other local tone mapping algorithm

We implement another local tone-mapping algorithm based on the template image. Observe that in most of the scenerios, we have a stack of differently exposed images in the order of increasing exposure, and the middle image (the template image) is usually the correctly exposed one. This image may not contain all the details in the bright and dark regions but it definitely contains the least amount of overexposed or underexposed regions. Therefore, we can treat this image as the template image and stretch the intensities of the normalized radiance map according to the template image. We show the comparison on the Chapel image below. We can see that the image obtained with this method also achieves reasonable tone-mapping result, but it has high contrast and is not as bright as the the local tone-mapping method based on bilateral filter that we have shown above.

Bilateral Tone-Mapping

Template Tone-Mapping

Project 2: A Neural Algorithm of Artistic Style

Overview

In this project, we attempt to reimplement the Neural Style Transfer algorithm introduced by Gatys et al. in A Neural Algorithm of Artistic Style. It uses an artificial system based on a Deep Neural Network that creates artistic images of high perceptual quality. The input is two images, one with the content and the other with the style which we try to combine in the output image. It uses the activiations at different layers of the neural network to recombine the content and the style of the original images.

Model and Implementation Details

We directly adopt the model used in the original paper, VGG-19, an architecture proven effective at extracting meaningful features in the image domain. The architecture contains 16 convolutional layers and 5 pooling layers, amounting to a total of 19 layers. However, different from the conventional model used for image recognition, the model we used does not utilize the fully connected layers. We visualize the model below. Note that we replace the max-pooling layers by average pooling layers as they improve the gradient flow and can obtain slightly more appealing results.

The input is resized to have resolution \(512, 512\) with three color channels. We use \(\alpha = 1.00\) and \(\beta = 100.00\). We use L-BFGS optimizer with a learning rate of \(1.00\). All images are generated with \(20\) optimization steps.

Extracting the Content

To extract the content from the input content image, following the original paper, we define a squared-error loss so that the generated image and the content image would be similar in their feature representation at each layer we selected. Let \(p\) and \(x\) be the orignal image and the generated image and \(P^l\) and \(F^l\) be their respective feature representation at layer \(l\). We can define the squared-error loss as follows: \[ L_{content} (p, x, l) = \frac{1}{2} \sum_{i,j} (F_{ij}^l - P_{ij}^l)^2 \quad \text{where \(i\) is the filter index and \(j\) is the position index.} \] In this project, we emperically use 'conv2_1' as the layer for content extraction.

Extracting the Style

Instead of directly computing the squared-error of the features between the input and generated image, we first compute the correlations between the different filter responses. This is done by computing the Gram matrix \(G^l\) where \(G_{ij}^l\) is the inner product between the feature maps \(i\) and \(j\) in layer \(l\): \[ G_{ij}^l = \sum_k F_{ik}^l F_{jk}^l \] Then similar to the procedure defined earlier for extracting the content, we define a squared-error loss on top of the Gram matrix between the input and the generated image. Following the paper, let us define \(a\) and \(x\) to be the original style image and the generated image and \(A^l\) and \(G^l\) be their respective Gram matrices. The loss for a particular layer \(l\) can be written as: \[ E_l = \frac{1}{4N_2^2 M_l^2} \sum_{i,j} (G_{ij}^l - A_{ij}^l)^2 \] To write the total loss for style extraction, we additionally weight each layer loss by \(w_l\). Then the style loss can be written as: \[ L_{style} (a, x) = \sum_{l=0}^L w_l E_l \]

Mixing Content and Style

To mix the content from the content image and the style from the style image in the final output, we simply linearly combine two losses with weights \(\alpha\) and \(\beta\). Therefore, our final objective is: \[ L_{total} (p, a, x) = \alpha L_{content} (p, x) + \beta L_{style}(a, x) \]

Neckafront Style Transfer

To quatilatively compare our results to those in the paper, we transfer the image Neckafront to 3 styles that are also used in the paper. As shown below, we are relatively successful in applying the style from the style image to the content. However, the results from the paper are smoother and contain more refined details (e.g. the sky inherits the style from the style image). We believe that the difference is due to differences in hyperparameters, and our results can get better with more iterations of training.

Content Image	Style Image	Our Results	Paper Results

Custom Style Transfer

In addition, we also apply the method to our own pictures and the styles used in the paper. As shown below, the first three rows are successful examples, where the network captures the fine details in the original style and content images and combines them in the output image. However, the last one is less successful. Observing carefully, we can see that the network blends the style from the sky into the sea (instead of the sky) in the content image. This is likely because the sea resembles the sky in terms of color and appearance, and the style image is predominantly about the sky. This might be solved with more iterations of training or a stronger architecture that better captures the features at a high level.

Content Image	Style Image	Our Results

Project 3: Lightfield Camera

Overview

This project reproduces the lightfield effect proposed in this paper by Ng et al by using shifting and averaging operations on multiple images taken over a plane orthogonal to the optical axis. We use datasets each comprising of 289 images taken over a regularly spaced grid from the Stanford Light Field Archive.

Depth Refocusing

The objects which are far away from the camera do not vary their position significantly when the camera moves around while keeping the optical axis direction unchanged. The nearby objects, on the other hand, vary their position significantly across images. Averaging all the images in the grid without any shifting will produce an image which is sharp around the far-away objects but blurry around the nearby ones, as shown by the following image:

Shifting the images 'appropriately' and then averaging allows one to focus on object at different depths. To find out the 'appropriate' shift for each image, we extract the camera positions and grid indices from the image file names and build a 17x17 image grid using grid indices. Then we use the center image (with index [8, 8]) as the reference image, and calculate the distances from each image to the center image in both x and y axis. Multiplying the distances by a scalar factor scale gives the shifts that allow images refocused at different depths. A smaller scale results in a closer focus.

Below are the results with different shift scale factors, and the last gif shows the transition in different focus depth.

scale = -0.27	scale = -0.19
scale = -0.11	scale = -0.03

Aperture Adjustment

In this part we reproduce the aperture effect in lightfield photos by adjusting the number of images to be averaged. Averaging a large number of images sampled over the grid mimics a camera with a much larger aperture, while using fewer images simulates a smaller aperture. We define a radius parameter that represents the aperture and determines the images to be selected. We average images whose index is within radius away from the center image on the grid, that is, radius=0 means only select the center image and radius=8 means select all images on the grid. Below we show the image results with different apertures.

radius = 2	radius = 4
radius = 6	radius = 8

References

[1] Debevec, Paul E., and Jitendra Malik. "Recovering high dynamic range radiance maps from photographs." ACM SIGGRAPH 2008 classes. 2008. 1-10.
[2] Durand, Frédo, and Julie Dorsey. "Fast bilateral filtering for the display of high-dynamic-range images." Proceedings of the 29th annual conference on Computer graphics and interactive techniques. 2002.
[3] HDR Dataset from the HDR assignment from the Computational Photography course at Brown University.
[4] Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "A neural algorithm of artistic style." arXiv preprint arXiv:1508.06576 (2015).
[5] Ng, Ren, et al. "Light field photography with a hand-held plenoptic camera." Computer Science Technical Report CSTR 2.11 (2005): 1-11.