Creative Adversarial Networks

Phillip Kuznetsov

CS194-26 Final Project

cs194-26-aea

A gif of images sampled from the generator

Abstract

In this project, I reimplement the paper "CANS: Creative Adversarial Networks". The basic premise of the paper arises from Colin Martindale's hypothesis that "creative artists try to increase the arousal potential of their art to push against habituation. However, this increase has to be minimal to avoid negative reaction by the observers".

The original authors capture this objective by adding two auxiliary loss functions to the traditional GAN formulation. I implemented this formulation by adapting DCGAN, while also adding a few modifications to ensure that training proceeded successfully. The github repo can be found here: https://github.com/mlberkeley/Creative-Adversarial-Networks.

Paper Details

Creative Adversarial Networks is a paper that came out earlier this year that claims to create completely novel pieces of art by learning from a dataset of art and attempting to deviating from the styles presented. The authors achieve this by adding two loss pieces to the traditional GAN objective (eq. 1). The first component adds a style classification objective to the discriminator. The discriminator then learns how to classify real pieces of art into their respective styles. The second component adds a "style ambiguity" term to the generator. The generator attempts to create images that fall entirely outside the class distribution - basically provided a true label that maximizes the entropy over the style distribution. In other words, the generator tries to generate images that causes the discriminator's style classifier to assign an equal probability of all possible styles. (See eq. 2) $$\newcommand{\E}{\mathbb{E}}$$ \begin{equation} \label{eq:gan_loss} \begin{split} \min_G\max_D V(D,G) =& \E_{x, \hat{c} \sim p_{data}} {[ \log D(x)]} + \\ &\E_{z \sim p_z} {[ \log (1 - D(G(z))) ]} \end{split} \end{equation} \begin{equation} \label{eq:can_loss} \begin{split} \min_G\max_D V(D,G) &= \E_{x, \hat{c} \sim p_{data}} [ \log D_r(x) \\ &+ \log D_c(c = \hat{c} | x)] + \E_{z \sim p_z} [ \log (1 - D_r(G(z))) \\ &- \sum_{k=1}^K \frac 1K \log(D_c(c_k|G(z))) \\ &+ \sum_{k=1}^K(1-\frac 1K) \log(1-D_c(c_k|G(z)))] \end{split} \end{equation} The originl paper associates this design decision with Colin Martindale's hypothesis on the origin of creativity, "He hypothesized that at any point in time, creative artists try to increase the arousal potential of their art to push against habituation. However, this increase has to be minimal to avoid negative reaction by the observers".

Implementation

This project was coded in Python using TensorFlow Deep Learning Library. Experiments were run on an NVIDIA 1080 GTX GPU, which contains 8gb of VRAM. To monitor progress, we heavily relied on TensorBoard, utilizing the image and scalar summary features the most to make sure that the model hadn't reached convergence and that it was producing somewhat art-like images. You can find the repo here: https://github.com/mlberkeley/Creative-Adversarial-Networks.

Data

The data used for this project was scraped from WikiArt. WikiArt is a website that contains a large database of free artwork. I happened to find an already scraped version of the dataset on Github, making it rather simple to repurpose it for this project. The dataset contains 80,000 images. A few of the sampled images and their styles are listed below.
Sample Images from wikiart (a) Cubism style artwork (b) Color Field Painting style artwork (c) Baroque style artwork

Architecture

The architecture used in the project was based on DCGAN. The particular repo we used for the paper was DCGAN-tensorflow written by carpedm20.

Deviation from Paper

Most of what's listed above remains in line with the original paper. However, after no success, we deviated from the paper in a few ways. In-line with GAN hacks, we
  1. Sampled the input $z$ vector on the hypersphere
  2. Utilized label smoothing to reduce the "hardness" that's associated with one-hot encoding of labels for the discriminator
  3. Added a replay buffer from generated samples to keep feeding back into the generator
On top of these changes, we were forced to run with a batch size of 15 rather than the batch size of 128 reported in the paper.Even when I experimented with a 16GB card on AWS, I was only able to get a batch size of 35 images onto the VRAM before running out of memory. I am sure that this has an effect on the results, possibly causing the high variance in samples generated over training time. The authors did not release information about the hardware set up of their experiments. My best guess is that either the authors distributed training across multiple GPUs or reported the batch size used for a smaller resolution output. Additionally, I have no idea I have no idea how similar their dataset was to ours. We can assume there is a sizable overlap - however it's possible that they selected a different subset or had different class labels. This also meant we had no idea whether the size of our epoch was the same or not. To top this off, we were only able to run training for 25 epochs. As mentioned in the following section, training the 256x256 model already took 5 days. Training for the 100 epochs as suggested in the paper would take 20 days - something I did not have time for in the course of this project.

Experimentation Process

MNIST Baseline

The experimentation/debugging process was fairly simple. First we trained a network to generate MNIST digits using the CAN objective. Although the traditional GAN formulation was successful at generating handwritten digits, we found that the CAN formulation did not produce satisfactory results. Instead, the CAN mnist generator produced blob like approximations of MNIST digits. On analysis of the results from the paper, the CAN artwork also appear to lack significant structure - providing a possible reason for why this algorithm failed to create novel MNIST-like digits.

Scaled Down Baselines

To continue this progress, we decided to tackle the wikiart problem directly. Our first attempt failed miserably. We directly tried to generate 256x256 images and found that we didn't get decent results after several days. To fix this problem, we started over at a lower resolution- 64x64. The benefits of this were twofold - time to convergence was much faster, allowing us to debug faster and we got a scaled representation of what the losses would look like as the model learned. Training at this scale only took half a day - compared to the 5 days for the 256 pixel resolution models. Once we started getting "creative" looking results at the current scale, we reran the model at 128x128 - bridging into the size of images actually generated by the model within the paper. These models took somewhere around 2 days to train to convergence
Sample outputs of 128x128 resolution CAN.

Full Scale

Once the model successfully trained something that looked art like, we went for full scale training. At this point we could only support a batch size of 15 on the VRAM of the GPU. Training to 25 epochs took roughly 5 days from start to completion.

Results

The training model saved images over some frequency on the training steps. I've cherry picked the best ones to display here.
256x256 images generated throughout the training process of a network trained to 25 epochs. (a), (b), (c), and (d) are from epoch 11. (e) is from epoch 13. (f) is from epoch 18.
However, when examining the final results after 25 epochs (pictured above), they were highly unsatisfactory. I was unable to get the nice distribution of clear images that I was getting at earlier steps in the training process. Observing the loss curves, especially g_loss_class_fake (see below), there's a sudden change in the nature of the losses at around 60,000 iterations (epoch 11). I suspect that the coupling of the classification loss along with the normal discriminator loss caused this problem. The discriminator classification probabilities are constantly changing throughout training. As a result, the generator must constantly learn a new style distribution which causes numerous problems. Decoupling the style classification and freezing it during training might help stabilize this notion and help to create a better looking image. It's also important to note that we had no objective comparison between the different images. The authors used humans to identify what they thought were the most aesthetically pleasing images among those generated, but this lacks the scale and accessibility that I needed. There is some potential from the paper, Progressive Growing of GANs. I leave this discussion for the last section of this work.
The output of a network trained to 25 epochs
The tensorboard loss curves for 256x256 resolution training
On top of all this, the network also experience large amounts of mode collapse, especially at the end of training. You can see that illustrated in the losses above. Each image was generated from an iid randomly sampled $z$ vector. However, each seems to fall into one of two categories.
The tensorboard loss curves for 256x256 resolution training

What's next

The idea that we can write algorithms to create art or learn how to do so fascinates me. I think GANs are a particularly promising research direction to meet this end, especially given the recent success of Progressive Growing of GANS.

Under Development

We're currently in the process of testing out the feasibility of using the "improved wasserstein gan objective (eq. 3 combined with the style ambiguity objective). This has effectively become the state of the art for GAN losses, and it helps make the sample quality more stable and generally seems to be a good idea. \begin{equation} \label{eq:wgan_loss} \begin{split} L &= \E_{z\sim p_z} [D(G(z))] \E_{x\sim p_{data}} [D(x)] \\ &+ \lambda \mathbb{P}_{z \sim p_{data}} [ (|\|\nabla_{G(z)} D(G(z))|\|_2 - 1)^2 ] \end{split} \end{equation} Additionally, we are looking for ways to decouple the discriminator loss from the generator. It's not very clear to me at least why we need to learn the class distribution while learning the real/fake classification loss, and as mentioned in the results, it might be the reason why the network's quality of images drop in later training iterations. Intead, what if you were to use a stationary class distribution by pretraining a network to convergence for the purpose of classifying artwork. That way, the generator could spend less time trying to optimize against a moving objective and instead focus on creating novel looking art.

Future Work

I'd mainly like to try and take a few of the newly proposed ideas from Progressive Growing of GANs and apply them for this particular problem. I'm most excited about the architectural and training advancements that originated from the paper, specifically the layerwise training. However, the Laplacian pyramid based quality metric might also be a good proxy for human evaluation and enable faster iteration on hyperparameters.