neural style
transfer & lightfield camera
Nadia Hyder
PART 1: NEURAL STYLE TRANSFER
For my first project, I reimplemented the
neural style transfer algorithm described in A Neural Algorithm of Artistic
Style. The algorithm takes 2 inputs: a style image and a content image, and uses a deep convolutional neural network to
output the content image with the artistic style of the style image.
METHODOLOGY
I first tested my algorithm on two input
images from the paper: Femme nue assise by
Pablo Picasso as the style image, and Neckarfront by Andreas Praefcke
as the content image. The image on the right is the output with the style added
to the content image.
|
|
|
I then used a convolutional neural network
to create style and content reconstructions while minimizing both content and
style loss. I will define these more clearly in the sections to come.
CONVOLUTIONAL NEURAL NETWORK
I used the features module of PyTorch’s pre-trained VGG-19 to get the images at each
convolutional layer. The network has the following architecture:
Sequential( (0): Conv2d(3, 64, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (1): ReLU(inplace=True) (2): Conv2d(64, 64, kernel_size=(3,
3), stride=(1, 1), padding=, 1)) (3): ReLU(inplace=True) (4): MaxPool2d(kernel_size=2,
stride=2, padding=0, dilation=1, ceil_mode=False) (5): Conv2d(64, 128, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (6): ReLU(inplace=True) (7): Conv2d(128, 128, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (8): ReLU(inplace=True) (9): MaxPool2d(kernel_size=2,
stride=2, padding=0, dilation=1, ceil_mode=False) (10): Conv2d(128, 256, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (11): ReLU(inplace=True) (12): Conv2d(256, 256, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (13): ReLU(inplace=True) (14): Conv2d(256, 256, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (15): ReLU(inplace=True) (16): Conv2d(256, 256, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (17): ReLU(inplace=True) (18): MaxPool2d(kernel_size=2,
stride=2, padding=0, dilation=1, ceil_mode=False) (19): Conv2d(256, 512, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (20): ReLU(inplace=True) (21): Conv2d(512, 512, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (22): ReLU(inplace=True) (23): Conv2d(512, 512, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (24): ReLU(inplace=True) (25): Conv2d(512, 512, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (26): ReLU(inplace=True) (27): MaxPool2d(kernel_size=2,
stride=2, padding=0, dilation=1, ceil_mode=False) (28): Conv2d(512, 512, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (29): ReLU(inplace=True) (30): Conv2d(512, 512, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (31): ReLU(inplace=True) (32): Conv2d(512, 512, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (33): ReLU(inplace=True) (34): Conv2d(512, 512, kernel_size=(3,
3), stride=(1, 1), padding=(1, 1)) (35): ReLU(inplace=True) (36): MaxPool2d(kernel_size=2,
stride=2, padding=0, dilation=1, ceil_mode=False) ) |
|
I normalized the image tensors by μ = [0.485, 0.456, 0.406] and
σ =
[0.229, 0.224, 0.225] as required by VGG networks.
CONTENT AND STYLE LOSS
As the paper suggests, I used two loss functions
to analyze the network’s performance: content loss and style loss. Content loss
is essentially the mean squared error of the content image feature map and the
feature map of a content layer in the network. Content loss is calculated as
follows, where is
the activation of the filter at position in
layer and and are the original image and the image that
is generated, and and are their respective feature representations
in layer :
To compute style loss, I created a Gram
matrix that contains feature correlations:
We then try to minimize the mean-squared
distance between the Gram matrix for the original image and that of the image
to be generated. is the number of distinct filters and is
the size of the feature map. The style loss is
calculated using the following formula, where and are the original image and the image that
is generated, and and are
their respective feature representations in layer :
|
|
The total loss is calculated using
hyperparameters α and β:
Using the images above, I got the
following content and style loss:
NETWORK PARAMETERS
For the above images, I chose
hyperparameters α = and β = 1. I used L-BFGS to optimize.
MORE RESULTS
Finally, I tested the algorithm on more
images (some of my own, and some from google). Here are the results:
|
|
|
|
|
|
|
|
|
PART 2: LIGHTFIELD CAMERA
For my second project, I implemented
depth refocusing and aperture adjustment using images from the Stanford Light Field Archive.
Capturing multiple images over a plane orthogonal to the optical axis allows us
to achieve complex results with shifting and averaging. In this project, I
produce these effects using lightfield data.
DEPTH REFOCUSING
Objects far from the camera do not
significantly change their position when the camera moves on an unchanged
optical axis. Meanwhile, the objects closer to the camera do vary their
positions significantly. However, using the jellybean dataset from the Stanford
Light Field Archive, averaging the images produces an image that is clearer for
the nearby jellybeans and blurry for the far away ones, as they move the most.
Extending this idea, we can generate
multiple images that focus on different depths. This is achieved by shifting
the images relative to the center image, applying incremental weights to change
the focal point.
w=1 |
w=2 |
w=3 |
w=4 |
|
|
|
|
Here is an animation of the depth
refocusing:
APERTURE ADJUSTMENT
If we compare the images that result from
averaging a large number of images sampled over the grid perpendicular to the
optical axis versus that resulting from averaging fewer images, we see that the
output using a greater number of images resembles an image taken at a greater
aperture while the other looks like it was taken with a lower aperture. This is
because a lower field of depth results from a greater aperture. To show this
effect, I generated images corresponding to different apertures by averaging
only over some radius of the image (that we incrementally increase). This works
because expanding the radius means we are processing more of the light in the
image. The following is the result of my aperture adjustment:
BELLS AND WHISTLES: USING MY OWN IMAGES
I attempted to apply the same depth
refocusing and aperture adjustment algorithms to images I took of keychains. I
took these images by attempting to moving only vertically and take the images
over the grid perpendicular to the optical axis. Unfortunately, the motion was
neither constant nor consistent so the results are not as good as those
produced using the Stanford Light Field Data.
depth refocused |
|
aperture adjusted |
|
CONCLUSION
In these final projects, I learned
several new principles of computer vision. From learning more about the
capabilities of convolutional neural networks, to learning the power of the lightfield camera and lightfield
images, these two projects made me realize there is so much left to learn about
computational photography and computer vision. This is definitely my favorite
course I have ever taken, and I really hope to continue working on computer
vision projects and continue learning about the space! It is incredibly
satisfying to have a tangible, visual representation of the power of your code