CS 194 Final Project
Architecture

Neural Style Transfer

Introduction

In this project we will be doing style transfer with deep neural networks. This project aims to take the art of one image and transfer its style to the content of another image. We use the features from the VGG_19 batchnormed model to conduct our style transfer. We use this method to reconstruct an image's content and its style from various layers in the neural network. Finally, we combine the two methods to construct an image whose content matches one image, and whose style matches the other. All code for this project can be found at this Colab Notebook here
The drive folder for this data can be found here . Note this project used this tutorial from Pytorch online for reference.

Creating Loss Functions

The premise of this algorithm is to use a pre-existing neural network, which we define as $N(x)$ that takes in an input image and computes features on each of its individual layers. We want to create loss functions that will take in an arbitrary noisy image and using gradient descent, compute an image whose features match those of a reference image in some sense. This is the premise of content and style loss.

Content Loss

Content loss is very simple. For a given layer $l$, we compute the SSD between the features of the reference image and the features of the input image. The NN network parameters and reference image parameters are static, the only gradients that are computed are that of the input image itself. $$ \mathscr{L}_{\text{Content}}(x, r, l) = SSD(N(x)[l], N(r)[l]) $$ where $r$ is the reference image, $x$ is our noisy image, $l$ is our layer, and $N$ is the neural network function.

Style Loss

Style loss is a bit different. Two images of similar styles will not have the same pixel values at each stage. They will, however have similar correlations between feature maps (or channels) of a given layer. We can think of a layer as $c$ channels worth of ``images'' each with height $h$ and width $w$. Thus, we compute the Gram Matrix $G$ between a layer and itself which will be a $c \times c$ matrix. The gram matrix is such that $G_{i, j}$ is the image at the $i$th channel dot product with the image at the $j$th channel. We then normalize the Gram matrix by dividing by the number of entries ($c \cdot \w \cdot \h$) in the layer. We then define our style loss as the SSD between the gram matrix of the reference, and the one of the input. $$ \mathscr{L}_{\text{Style}}(x, r, l) = SSD(G(N(x)[l]), G(N(r)[l])) $$ Where $G$ is the function that takes in a $c \times (h \times w)$ layer and returns a $c \times c$ matrix.

Final Loss

The final loss function is dependent on how much content vs style we want in a given image. Suppose our content reference is $c$ and our style reference is $s$. We have $\alpha$ and $\beta$ as parameters in our algorithm that determine the amount of content vs style that we need. We also define $C_L$ and $S_L$ as the layers that we wish to compute content and style loss on. Our final loss function is then: $$ \alpha \sum_{l_c \in C_l} \mathscr{L}_{\text{Content}}(x, c, l_c) + \beta \sum_{l_s \in S_l} \mathscr{L}_{\text{Content}}(x, s, l_s) $$ What we now need to do is determine which layers are best for content, and which are best for style.

Running our Final Algorithm

To run our final algorithm, we simply choose values of $\alpha$, $\beta$, $C_l$ and $C_s$ and have the algorithm backpropagate errors onto an image of white noise. Once this is done, the image will turn into a blend of content from one image, and style from the other.

Content and Style Reconstruction

To determine what is best for each layers, we run experiments on the content and style reconstruction of a given image.

Content Reconstruction

For the given content images we run content reconstruction for a single layer. In this case we set $C_s = \emptyset$ and set $C_l$ to be a single layer and run the algorithm. Original Images:
We now show our reconstruction for the first, fourth, sixth and eigth layers.
Some takeaways are that content reconstruction is really good even for later layers, and if we want to strip the style away from these images then we should look at later layers where the style starts to break down (hence the more pixely and wavy shapes). It is here where the content will not be as overpowering as the other layers.

Style Reconstruction

Similarly we run style reconstruction. This time we set $\alpha$ to 0 and we only use one layer for style loss. Original Images:
Reconstructed Style:
Stylistically what is interesting is that the first layer is probably the worst at guessing the style of the image. I think this is because there is not enough time for the model to nail the style when each pixel is limited to three channels. It is important for the net to be as expressive as possible and that is only true for the later layers. It seems like the real perceptual style only comes out in layers 6-8. So we will be using that in our style transfer construction.

Style Transfer Results

Now that we have done our reconstruction, we can get a good idea of what layers we want for the content and style layers. For the content layers, we use the 5th convolutional layer. This helps preserve enough structure in the image without being too overpowering of the style loss. For the style loss, we take an average over the 7th, 8th, and 9th convolutional layers to get the most realistic style.

Results

Starry Campanile
The first one used starry night and the campanile and later layers of the network. The second one used earlier layers of the network.
Wild Campanile
Wild Chicago
Simpsons Chicago
Starry Arizona
Apocalypse Arizona
Whales As a part of my student org, I also made whales in different styles