COMPSCI 194-26: Final Project

Kaijie Xu

nortrom@berkeley.edu

Project 1: Neural Art Style Transfer

The first project is the reimplementation of the paper on a neural algorithm to transfer artistic styles.

In this project I'll generate an image which takes the style from an art work and takes the content from an image.

As described in the paper, we are using the pre-trained VGG19 network

Model Architecture:

(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) *
(1): ReLU(inplace=True)
(2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU(inplace=True)
(4): AvgPool2d(kernel_size=2, stride=2, padding=0)
(5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) *
(6): ReLU(inplace=True)
(7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): ReLU(inplace=True)
(9): AvgPool2d(kernel_size=2, stride=2, padding=0)
(10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) *
(11): ReLU(inplace=True)
(12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(13): ReLU(inplace=True)
(14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(15): ReLU(inplace=True)
(16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(17): ReLU(inplace=True)
(18): AvgPool2d(kernel_size=2, stride=2, padding=0)
(19): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) *
(20): ReLU(inplace=True)
(21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(22): ReLU(inplace=True)
(23): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(24): ReLU(inplace=True)
(25): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(26): ReLU(inplace=True)
(27): AvgPool2d(kernel_size=2, stride=2, padding=0)
(28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) *
(29): ReLU(inplace=True)
(30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(31): ReLU(inplace=True)
(32): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(33): ReLU(inplace=True)
(34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(35): ReLU(inplace=True)
(36): AvgPool2d(kernel_size=2, stride=2, padding=0)

We replace the max pooling layers with average pooling layers and remove all fully connected layers as the paper suggests

For the content part of the image, we take the first few convolution layer outputs. And as suggested in the paper, we only concern the content representation on layer 'conv4 2'

For the style part, we take the output of several of the early convolutional layers and computer their gram matrices, similar to covariance matrices of the different filter outputs.

We define two losses, the style loss and content loss

The style loss is a multi-scale representation, referring to the input artwork. It is a summation from conv1_1 (lower layer) to conv5_1 (higher layer)

The style loss across multiple layers captures lower level features like points to higher level features like styles

We will use the layers as advised in the paper: conv1_1, conv2_1, conv3_1, conv4_1, conv5_1. And here are formulas for the style loss:

Then we weight the losses differently and the loss function we minimise is:

where a and b are the weighting factors for content and style reconstruction respectively.

Here is my experiment on the image of UCB at different iteration numbers:

The original content image and the original style image

The generated image at 200 iterations and 500 iterations

The generated image at 1000 iterations and 2000 iterations

The final image at 5000 iterations.

Here are the results to transfer the styles from different art works to my hometown:

Project 2: Eulerian Video Magnification

This project explores Eulerian Video Magnification to reveal temporal image variations which are hard for human eyes to detect

The first step to augment a video is to compute a Laplacian pyramid for every single frame

We are really familiar with laplacian pyramid since we have done something similar in the previous projects

This pyramid is constructed by taking the difference between adjacent levels of a Gaussian pyramid, and approximates the second derivative of the image, highlighting regions of rapid intensity change:

With the laplacian pyramid, we could extract components with butter bandpass bands from the image

And after extracting the frequency band of interest, we need to amplify it and add the result back to the original signal.

Specifically, in order to reconstruct the filtered and amplified image, we have to merge the scaled laplacian pyramid across different levels and add the last layer of the scaled gaussian pyramid to reconstruct the final image.

After that, we have to create the video by merging images in a sequence.

Result

Here is an example of the amplified filtered video:

Here are some results of how EVM amplifies motions:

And here is the video of my self