neural style transfer & lightfield camera

Nadia Hyder

PART 1: NEURAL STYLE TRANSFER

For my first project, I reimplemented the neural style transfer algorithm described in A Neural Algorithm of Artistic Style. The algorithm takes 2 inputs: a style image and a content image, and uses a deep convolutional neural network to output the content image with the artistic style of the style image.

 

METHODOLOGY

I first tested my algorithm on two input images from the paper: Femme nue assise by Pablo Picasso as the style image, and  Neckarfront by Andreas Praefcke as the content image. The image on the right is the output with the style added to the content image.

Background pattern

Description automatically generated

A picture containing outdoor, sky, building, water

Description automatically generated

A picture containing nature

Description automatically generated

 

I then used a convolutional neural network to create style and content reconstructions while minimizing both content and style loss. I will define these more clearly in the sections to come.

 

CONVOLUTIONAL NEURAL NETWORK

I used the features module of PyTorch’s pre-trained VGG-19 to get the images at each convolutional layer. The network has the following architecture:

Sequential(

  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (1): ReLU(inplace=True)

  (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=, 1))

  (3): ReLU(inplace=True)

  (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

  (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (6): ReLU(inplace=True)

  (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (8): ReLU(inplace=True)

  (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

  (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (11): ReLU(inplace=True)

  (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (13): ReLU(inplace=True)

  (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (15): ReLU(inplace=True)

  (16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (17): ReLU(inplace=True)

  (18): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

  (19): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (20): ReLU(inplace=True)

  (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (22): ReLU(inplace=True)

  (23): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (24): ReLU(inplace=True)

  (25): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (26): ReLU(inplace=True)

  (27): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

  (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (29): ReLU(inplace=True)

  (30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (31): ReLU(inplace=True)

  (32): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (33): ReLU(inplace=True)

  (34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

  (35): ReLU(inplace=True)

  (36): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

)

Illustration of the network architecture of VGG-19 model: conv means... |  Download Scientific Diagram

 

 

 

I normalized the image tensors by  μ = [0.485, 0.456, 0.406] and σ = [0.229, 0.224, 0.225] as required by VGG networks.

 

CONTENT AND STYLE LOSS

As the paper suggests, I used two loss functions to analyze the network’s performance: content loss and style loss. Content loss is essentially the mean squared error of the content image feature map and the feature map of a content layer in the network. Content loss is calculated as follows, where  is the activation of the  filter at position  in layer  and  and are the original image and the image that is generated, and  and  are their respective feature representations in layer :

Diagram

Description automatically generated with medium confidence

To compute style loss, I created a Gram matrix that contains feature correlations:

A picture containing text

Description automatically generated

We then try to minimize the mean-squared distance between the Gram matrix for the original image and that of the image to be generated. is the number of distinct filters and  is the size of the feature map. The style loss is calculated using the following formula, where  and are the original image and the image that is generated, and  and  are their respective feature representations in layer :

Text

Description automatically generated with medium confidence

Text

Description automatically generated with medium confidence

The total loss is calculated using hyperparameters α and β:

 

Using the images above, I got the following content and style loss:

Chart, line chart

Description automatically generated

 

NETWORK PARAMETERS

For the above images, I chose hyperparameters α =  and β = 1. I used L-BFGS to optimize.

 

MORE RESULTS

Finally, I tested the algorithm on more images (some of my own, and some from google). Here are the results:

 

A picture containing drawing, painting, painted

Description automatically generated

A row of houses with a city in the background

Description automatically generated with medium confidence

A picture containing text, drawing, painting, graffiti

Description automatically generated

A picture containing outdoor, yellow

Description automatically generated

A picture containing outdoor, sky, building, tree

Description automatically generated

A picture containing reptile, turtle

Description automatically generated

A picture containing sky, outdoor, nature, clouds

Description automatically generated

A picture containing water, sky, outdoor, beach

Description automatically generated

A picture containing text

Description automatically generated

 

PART 2: LIGHTFIELD CAMERA

For my second project, I implemented depth refocusing and aperture adjustment using images from the Stanford Light Field Archive. Capturing multiple images over a plane orthogonal to the optical axis allows us to achieve complex results with shifting and averaging. In this project, I produce these effects using lightfield data.

 

DEPTH REFOCUSING

Objects far from the camera do not significantly change their position when the camera moves on an unchanged optical axis. Meanwhile, the objects closer to the camera do vary their positions significantly. However, using the jellybean dataset from the Stanford Light Field Archive, averaging the images produces an image that is clearer for the nearby jellybeans and blurry for the far away ones, as they move the most.

A picture containing indoor, gambling house, close

Description automatically generated

 

 

Extending this idea, we can generate multiple images that focus on different depths. This is achieved by shifting the images relative to the center image, applying incremental weights to change the focal point.

w=1

w=2

w=3

w=4

Here is an animation of the depth refocusing:

Background pattern

Description automatically generated

 

APERTURE ADJUSTMENT

If we compare the images that result from averaging a large number of images sampled over the grid perpendicular to the optical axis versus that resulting from averaging fewer images, we see that the output using a greater number of images resembles an image taken at a greater aperture while the other looks like it was taken with a lower aperture. This is because a lower field of depth results from a greater aperture. To show this effect, I generated images corresponding to different apertures by averaging only over some radius of the image (that we incrementally increase). This works because expanding the radius means we are processing more of the light in the image. The following is the result of my aperture adjustment:

 

 

 

BELLS AND WHISTLES: USING MY OWN IMAGES

I attempted to apply the same depth refocusing and aperture adjustment algorithms to images I took of keychains. I took these images by attempting to moving only vertically and take the images over the grid perpendicular to the optical axis. Unfortunately, the motion was neither constant nor consistent so the results are not as good as those produced using the Stanford Light Field Data.

depth refocused

A picture containing indoor, floor, table, wooden

Description automatically generated

 

aperture adjusted

A picture containing indoor, floor, table, room

Description automatically generated

 

 

CONCLUSION

In these final projects, I learned several new principles of computer vision. From learning more about the capabilities of convolutional neural networks, to learning the power of the lightfield camera and lightfield images, these two projects made me realize there is so much left to learn about computational photography and computer vision. This is definitely my favorite course I have ever taken, and I really hope to continue working on computer vision projects and continue learning about the space! It is incredibly satisfying to have a tangible, visual representation of the power of your code