CS 194-26

Image Manipulation and Computational Photography

Final Project: Interactive Style Transfer

Hemang Jangle, cs194-26-acv

Richard Zhang, Shiry Ginosar, Jun-Yan Zhu, Alexei Efros


Overview

Neural Style Transfer is a technique that allows us to create beautiful art by transferring the style of paintings from famous artists to our own photographs and images. Current techniques lack the ability to dynamically spatially control style transfer outputs with different styles. We attempt to solve this problem.

Introduction and Related Work

The technique of neural style transfer is based off of [1] where they develop a method that takes in 2 images, one 'content' image, and one style image. They then produce an output image that combines the stylistic properties of the style image, and the natural features of the content image. To do this, they define a perceptual loss that describes the similarity of the 2 images in both a content sense and a stylistic sense. To evaluate similarity on a content level, they evaluate similarity of a pair of images on their feature responses to a pretrained convolutional neural network, typically the VGG-19 network [2]. To evaluate stylistic similarity, they use the Gram Matrices of those feature responses to evaluate similarity - a method of getting second order statistics for the image. They then produce an output image that minimizes the combined content and style loss.

The output image G is then computed through an optimization process that minimizes the loss function. [3] builds on this result by allowing ways to control this stylization. In particular, they introduce a method to spatially control the transfer on different regions of the output image, and allow the use of multiple style images per content image. To achieve this, they provide both (Style, Mask) Images in addition to the content image to the algorithm. Then, they create an individual style loss for each style included and evaluate the loss only on the masked portion of the feature responses. This technique allows for full and specific spatial control, but also requires the user to provide a complete segmentation as input.

[4] introduce a faster technique for the style transfer process. Instead of performing an iterative optimization procedure on each pair of (style, content) images, they train a feedforward neural network F(c;s) parametrized by a style $s$ that maps the input content image to $G$, the version of $c$ stylized by $s$. This enables real-time creation of stylized images.

Our method works to combine these two recent works: we train a feedforward neural network to stylize a content image with multiple styles, given segmentation guidance for each image. However, we would also like to achieve real-time performance for this system, so instead of requiring the user to specify complete masks as input, we allow the user to place points to create semantic guidance. This gives the user feedback and allows them to refine stylized outputs.

Method

We train a feedforward style network $F(c, r_1, r_2; s_1, s_2)$ that takes in a content image $c$, and two sets of points on the image $r_1$ and $r_2$, to denote spatial guidance for styles $s_1$ and $s_2$. For an example of $r_1$ and $r_2$, refer to the overview picture. We then train the network with the same loss as above, except modifying

, where $T_i$ is the mask of the $i$th style $G_l$ is the Gram matrix of feature responses on layer $l$ of the loss network. We use a VGG-16 network as our loss network. $S_l$ represents the correspond downsampling operation on the mask to match it to the size of the feature tensor at layer $l$ of the response network.

More specifically, we train a resnet on the portraits dataset [5], which provides ground truth foreground and background segmentation. We also attempt to train the network on the cityscapes dataset [6].

We initially train a network that takes in the masks $\{T_i\}$ instead of the sampled masks $\{r_i\}$. The network is able to learn the style transfer function and able to perform this guided style transfer. However, when we try to train a network to perform the stylization using only sampled masks $\{r_i\}$, the network fails to learn the segmentation properly. We detail this in the experiments section; to fix this issue we try to apply a number of techniques including: having styles swap foreground/background in training, fine tuning the ground truth mask model to use samples, providing the network $F$ with samples and masks $\{T_i\}$ stochastically during training, changing network architecture (U-net) [7], changing the random sampling technique of samples to different distributions (i.e. geometric), distance transforming, gaussian blurring and point transforming the masks, and penalizing the wrong style in different regions in the mask. We also attempt to train separate segmentation and stylization models to make the task easier for the network to solve. In the end, fine tuning a joint segmentation-stylization model gives the best results, yet at the expense of true artistic quality. We detail some of the more effective of these techniques in the Experiments section.

Experiments

We show 5 different comparisons in this section, a subset of those mentioned above. As a baseline, we compare to the instance norm stylization networks [8], which are unguided feedforward stylization networks. Note all the networks with use utilize instance normalization. We use these networks to stylize the different regions of the content image separately, and then appropriately alpha blend to produce the final result. Then, we show the upper bound of an interactive system that uses sampled masks as input, by showing the output of a network that takes in ground truth segmentation. After that, we show a network that is trained purely with randomly sampling ground truth segmentation to produce an output. This method produces splotchy outputs that show that the network learns only to associate segmentation with the sampled points, and tends to default to the other style for the rest of the image. Subsequently, we show using distance transforms, gaussian blurring, transforms on the sampled masks that make it a bit easier for the network to learn the segmentation. In the end, we train a joint segmentation-stylization model, which performs the best out of the rest of these, but still does not maintain satisfactory artistic quality. We generally compare quality on validation images in our dataset.

We evaluate the quality of our algorithm on these images
The Scream, by Edvard Munch
The Great Wave, by Hokusai

Feedforward Networks Composited

This is the result of stylizing the content image twice with different style images, and alpha blending the results. A reasonable upper bound in terms of quality for our technique on these styles.

Feedforward Network Trained With Ground Truth Segmentation

This is the result of using one network F(c, T1, T2; s1, s2) that receives the complete masks as input.

While both this and the above result leave much to be desired in terms of visual quality, the results in second figure seem to match, and even improve a little bit on those on the first, especially from edge artefacts.

Feedforward Network with Sampled Masks

This is the result of using one network F(c, T1, T2; s1, s2) that receives the sampled masks as input. The wave style is supposed to be on the foreground, but the network fails to learn this behavior.

Feedforward Network with Transformed Masks

By transforming the input sampled masks, we can learn a better segmentation, but are still victim to splotchiness effects.
This is the result of using one network $F(c,T_1,T_2;s_1,s_2)$ that receives the sampled masks as input. We distance transform and gaussian blur the input masks. We also train the network with an auxiliary segmentation loss, to help it to learn the underlying segmentation. Its predictions, along with the input masks are displayed next to each image.

Joint Segmentation-Stylization Model

We get the best performance if we train two separate networks and then jointly fine tune them. The first network is an interactive segmentation network that takes in an image, and two sampled masks. It is trained to produce a segmentation mask. The second network is the same one we have above as the GT Mask network.

This is the result of training two separate networks: one interactive segmentation network and one stylization network.

Conclusions

We have found throughout this process that it is quite challenging to teach a feedforward neural network to perform stylization and interactive segmentation at the same time. By assisting the network through various tricks that make the internal segmentation of an image easier to learn, we can improve the semantics of the stylization of the output. However, the end results still leave much to be desired in terms of quality, and further work must be done to make the guided feedforward techniques more visually appealing.

References

1. Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. CoRR abs/1508.06576 (2015)
2. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
3. Gatys, L.A., Ecker, A.S., Bethge, M., Hertzmann, A., Shechtman, E.: Controlling perceptual factors in neural style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (Jul 2017)
4. Johnson, J., Alahi, A., Li, F.: Perceptual losses for real-time style transfer and super-resolution. CoRR abs/1603.08155 (2016)
5. Shen, X., Hertzmann, A., Jia, J., Paris, S., Price, B., Shechtman, E., Sachs, I.: Automatic portrait segmentation for image stylization. Computer Graphics Forum 35(2) (2016) 93–102
6. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. CoRR abs/1604.01685 (2016)
7. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. CoRR abs/1505.04597 (2015)a
8. Ulyanov, D., Vedaldi, A., Lempitsky, V.S.: Instance normalization: The missing ingredient for fast stylization. CoRR abs/1607.08022 (2016)