Final Project 1: Reimplement: A Neural Algorithm of Artistic Style

1. Introduction

In this project, we reimplemented the paper A Neural Algorithm of Artistic Style. The goal is to have two images: one content image and one style image, and transfer the artistic style of the style image to the content image automatically using a neural network.

2. Method

We use exactly the same method introduced in the paper. To achieve the said goal. We minimize the below loss throw back propagation.

where ALPHA and BETA are some weights. L_content is the content loss that preserve the content and L_style is the style loss that helps to transfer the style. p is the content image, a is the style image, and input image x is initially set to random noises. The goal is to back propagate to x to minimize the loss, where as all other parameters of the neural net are kept the same.

The output of each layer of a CNN can be viewed as a feature map. The output can be store in a matrix F. The content loss is then defined as

where F_ij^l is ij-th item in the feature map of input image x at layer l and P_ij^l is the ij-th item in the feature map of content image p at layer l.

For the feature map F at layer l, we also define the follow Gram matrix G.

Then we define the style loss at layer l to be

where N_l is the number of filters at layer l and M_l is the height times the width of the feature map at layer l. G is the gram matrix of input image x and A is the gram matrix of style image a.

The style loss at each layer is then combined to calculate the total style loss

where w_l is the weight for each style loss at each layer.

3. Network Architecture

We used exactly the same network architecture as VGG-19 with pre-trained weights on ImageNet, except that to calculate the loss, we only obtain the outputs from some of its convolutional layers. Specifically, we used conv4_2 layer to calculate the content loss, and we used conv1_1, conv2_1, conv3_1, conv4_1, and conv5_1 to calculate the style loss. Here is a visualization of the network architecture that is used in the project.

Following the paper, we used average pooling with 2x2 kernel size instead of max pooling in the original VGG-19. Each convolutional layer is followed by a ReLU activation. The labeling in the picture is interpreted as this: [layer name]: [kernel size] [layer type], [number of filters]. The strides of all layers are 1. The image size can be arbitrary. However, we use the size of the content image as the final output size, so we resized the style image to be compatible with the size of the content image.

For hyperparameters, ALPHA_BETA_RATIO = 1e-3, BETA = 0.5 and ALPHA = BETA * ALPHA_BETA_RATIO. LEARNING_RATE = 0.1. The final output is picked visually at different iterations.

4. Neckarfront and Artworks

In this section, we show the results of transferring Neckarfront to three artworks mentioned in the paper.

4.1 Neckarfront and chosen artworks

Neckarfront

We chose Vincent van Gogh's Starry Night, J.M.W. Turner's The Shipwreck of the Minotaur, and Pablo Picasso's Femme nue assise.

Artworks

4.2 Own Results

Here are the results

Results

4.3 Comparison with Results from Paper

Here we can compare them with the results from the paper

Paper Results

For Starry Night, we can see the result from the paper also captured the "stars" in the sky.

For Shipwreck, we used another scanned copy of the artwork from the internet where we don't have the green-ish color in the painting. The style in our result is more present near the border of the building and the river.

For Feeme nue assise, we also successfully captured the "blockness" style of the artwork.

We believe the results can be improved by fine tuning the hyperparameters.

5. Own Pictures and Self-chosen Style

5.1 Successful Cases

Golden Gate Bridge and Vincent vab Gogh's Sunflower

Own Picture Style Result

Sather Gate and Vincent vab Gogh's Time in Britain

Own Picture Style Result

5.2 Failed Case

Here we used a picture of New York City as content picture and a traditional Chinese painting as style picture

Own Picture Style Result

The style image is black-and-white but the content image is colorful. If we want to transfer the style from the Chinese traditional painting to NYC, then we would also want the final result to be black-and-white. However, the content loss prevents the final result from being black-and-white and thus the algorithm failed: we can see the Statue of Liberty is still a bit green-ish and there are colors present in the sky and other places in the image.

Final Project 2: Lightfield Camera

1. Overview

In this project, using images obtained from camera on a grid in various positions, we can obtain an image with a different depth or aperture. This project aims to produce the results from the Stanford Light Field Archive dataset.

2. Approach

The camera position is encoded in the filename of each image in the above dataset. We can change the depth and aperture with the parameters of camera position. Specifically,

For depth change: we average all images after shifting them c*([u-mean_u, v-mean_v]) where u and v are real world positions for the camera and mean_u and mean_v are the average of all camera positions. The parameter c controls how much depth we want to change.

For depth change: we average all images with the restriction (u-mean_u)^2 + (u-mean_u)^2 <= r^2 where r controls the size of the aperture.

3. Results

3.1 Depth Change

Image 1
Image 2

3.2 Aperture Change

Image 1
Image 2

4. Summary

I am truly fascinated by how much we can do with some simple calculations and shifting of images. Besides, I learned how to use real-world dataset released by experts in the field and interpret the dataset annotations structure.

5. My own data

The images are taken on the 5x5 grid, equally spaced out.

Own Image Depth
Own Image Aperture

You can see from the results that the algorithm tries to re-focus. However it failed. This might be because the physical distance of the grid(1cm approx between each adjacent position) is too small and is very susceptible to noises(small mismatches between distances propagated to large errors when multiplied by the same constant c). The errors in distances were due to unstable hand and inferior grid distances positioned. The small grid size(5x5 vs 17x17 in the Stanford dataset) also contributed to the significant errors present in the GIFs.