Our final project aims to tackle the
problem of text-to-image generation by leveraging the advancements in
the Natural Language Processing domain. Text-to-Image generation is an
interesting problem as it has great potential in the art and design
field. Recent approaches to this problem use GANs to generate images
from text, since GANs have the ability to encode text into feature
representations and use a generator and a discriminator to do
self-adversarial training in order to generate realistic images. It
comes naturally to just encode the whole piece of text into a global
vector and use it as the condition for image generation using GANs.
However, this method ignores the information at the local word level,
and AttnGAN addresses this problem by using word features on top of
sentence features and using the Deep Attentional Multimodal Similarity
Model (DAMSM) in order to compute a fine-grained loss to incorporate
into the GAN. Recently, with the advent of natural language processing,
more and more opportunities exist in this field as more and more
powerful new tools in the natural language processing field becomes
available to us. BERT is such a prime example. We take advantage of
these tools from the natural language processing realm and modify
AttnGAN to use it.
2. Method
We implement AttnGAN and swap out certain
modules, namely the text encoder and the DAMSM module, with pre-trained
image caption networks and BERT. Below is the modified AttnGAN
architecture and some results generated by AttnGAN. More details on the
internal workings of AttnGAN and how we modified it to take advantage of
the pre-trained modules can be found in the paper.
This bird is brown and yellow in color with a stubby beak.
This bird has wings that are brown and a white belly.
This bird has a bill and a large black eye with a yellow throat
and a grey breast.
A brown colored bird with a long tail and a very small bill in
comparison to its body.