Fantastic GANs and where to find them II

19 Nov 2017

Hello again! This is the follow-up blog post of the original Fantastic GANs and where to find them. If you haven’t checked that article or you are completely new to GANs, consider giving it a quick read - there’s a brief summary of the previous post ahead, though. It has been 8 months since the last post and GANs aren’t exactly known for being a field with few publications. In fact, I don’t think we are very far from having more types of GAN names than Pokémon. Even Andrej Karpathy himself finds it difficult to keep up to date:

So, having this in mind. Let’s see what relevant advances have happened in these last months.

What this post is not about

This is what you won’t find in this post:

What this post is about

Index

  1. Refresher
  2. GANs: the evolution (part II)
    1. Improved WGANs
    2. BEGANs
    3. ProGANs
    4. Honorable mention: CycleGANs
  3. Other useful resources
  4. Closing

Refresher

Let’s get a brief refresher from the last post.

You don't need to design a loss function if a discriminator can design one for you GANs in a nutshell.

GANs: the evolution (part II)

Here I’m going to describe in chronological order the most relevant GAN articles that have been published lately.

Improved WGANs (WGAN-GP)

March 2017

TL;DR: take Wasserstein GANs and remove weight clipping - which is the cause of some undesirable behaviours - for gradient penalty. This results in faster convergence, higher quality samples and a more stable training.

[Article] [Code]

The problem. WGANs sometimes generate poor quality samples or fail to converge in some settings. This is mainly caused by the weight clipping (clamping all weights into a range [min, max]) performed in WGANs as a measure to satisfy the Lipschitz constraint. If you don’t know about this constraint, just keep in mind that it’s a requirement for WGANs to work properly. Why is weight clipping a problem? Because it biases the WGAN to use much simpler functions. This means that the WGAN might not be able to model complex data with simple approximations (see image below). Additionally, weight clipping makes vanishing or exploding gradients prone to happen.

WGAN-GP 8 Gaussians toy example Here you can see how a WGAN fails to model 8 Gaussians (left) because it uses simple functions. On the other hand, a WGAN-GP correctly models them using more complex functions (right).

Gradient penalty. So how do we get rid of weight clipping? The authors of the WGAN-GP (where GP stands for gradient penalty) propose enforcing the Lipschitz constraint using another method called gradient penalty. Basically, GP consists of restricting some gradients to have a norm of 1. This is why it’s called gradient penalty, as it penalizes gradients which norms deviate from 1.

Advantages. As a result, WGANs trained using GP rather than weight clipping have faster convergence. Additionally, the training is much more stable to an extent where hyperparameter tuning is no longer required and the architecture used is not as critical. These WGAN-GP also generate high-quality samples, but it is difficult to tell by how much. On proven and tested architectures, the quality of these samples are very similar to the baseline WGAN:

WGAN-GP baseline comparison

Where WGAN-GP is clearly superior is on generating high-quality samples on architectures where other GANs are prone to fail. For example, to the authors’ knowledge, it has been the first time where a GAN setting has worked on residual network architectures:

WGAN-GP other architectures comparison

There are a lot of other interesting details that I had not mentioned, as it’d go far beyond the scope of this post. For those that want to know more (e.g. why the gradient penalty is applied just to “some” gradients or how to a apply this model to text), I recommend taking a look at the article.

You might want to use WGANs-GP if

you want an improved version of the WGAN which

Boundary Equilibrium GANs (BEGANs)

March 2017

TL;DR: GANs using an auto-encoder as the discriminator. They can be successfully trained with simple architectures. They incorporate a dynamic term that balances both discriminator and generator during training.

[Article]

Fun fact: BEGANs were published on the very same day as the WGAN-GP paper.

Idea. What sets BEGANs apart from other GANs is that they use an auto-encoder architecture for the discriminator (similarly to EBGANs) and a special loss adapted for this scenario. What is the reason behind this choice? Are auto-encoders not the devil as they force us to have a pixel reconstruction loss that makes blurry generated samples? To answer these questions we need to consider these two points:

  1. Why reconstruction loss? The explanation from the authors is that we can rely on the assumption that, by matching the reconstruction loss distribution, we will also end up matching the real sample distributions.

  2. Which leads us to: how? An important remark is that the reconstruction loss from the auto-encoder/discriminator (i.e. given this input image, give me the best reconstruction) is not the final loss that BEGANs are minimizing. This reconstruction loss is just a step to calculate the final loss. And the final loss is calculated using the Wasserstein distance (yes, it’s everywhere now) between the reconstruction loss on real and generated data.

This might be a lot of information at once but, once we see how this loss function is applied to the generator and discriminator, it’ll be much clearer:

Diversity factor. Another interesting contribution is what they call the diversity factor. This factor controls how much you want the discriminator to focus on getting a perfect reconstruction on real images (quality) vs distinguish real images from generated (diversity). Then, they go one step further and use this diversity factor to maintain a balance between the generator and discriminator during training. Similarly to WGANs, they use this equilibrium between both networks as a measure of convergence that correlates with image quality. However, unlike WGANs (and WGANs-GP), they use Wasserstein distance in such a way that the Lipschitz constrain is not required.

Results. BEGANs do not need any fancy architecture to train properly; as mentioned in the paper: “no batch normalization, no dropout, no transpose convolutions and no exponential growth for convolution filters”. The quality of the generated samples (128x128) is quite impressive*:

BEGAN face samples

*However, there’s an important detail to be considered in this paper. They are using an unpublished dataset which is almost twice the size of the widely used CelebA dataset. Then, for a more realistic qualitative comparison, I invite you to check any public implementation using CelebA and see the generated samples.

As a final note, if you want to know more about BEGANs, I recommend reading this blog post, which goes much more into detail.

You might want to use BEGANs…

… for the same reasons you would use WGANs-GP. They both offer very similar results (stable training, simple architecture, loss function correlated to image quality), they mainly differ in their approach. Due to the hard nature of evaluating generative models, it’s difficult to say which is better. As Theis et al. says in their paper, you should choose a evaluation method or another depending on the application. In this case, WGAN-GP has a better Inception score and yet BEGANs generate very high-quality samples. Both are innovative and promising.

Progressive growing of GANs (ProGANs)

October 2017

TL;DR: Progressively add new high-resolution layers during training that generate incredibly realistic images. Other improvements and a new evaluation method are also proposed. The quality of the generated images is astonishing.

[Article] [Code]

Generating high-resolution images is a big challenge. The larger the image, the easier is for the network to fail because it needs to learn to generate more subtle and complex details. To give a little bit of context, before this article, realistic generated images were around 256x256. Progressive GANs (ProGANs) take this to a whole new level by successfully generating completely realistic 1024x1024 images. Let’s see how.

Idea. ProGANs, which are built upon WGANs-GP, introduce a smart way to progressively add new layers on training time. Each one of these layers upsamples the images to a higher resolution for both the discriminator and generator. Let’s go step by step:

  1. Start with the generator and discriminator training with low-resolution images.
  2. At some point (e.g. when they start to converge) increase the resolution. This is done very elegantly with a “transition period” / smoothing:

ProGANs smoothing

Instead of just adding a new layer directly, it's added on small linear steps controlled by α.

Let's see what happens in the generator. At the beginning, when α = 0, nothing changes. All the contribution of the output is from the previous low-resolution layer (16x16). Then, as α is increased, the new layer (32x32) will start getting its weights adjusted through backpropagation. By the end, α will be equal to 1, meaning that we can totally drop the "shortcut" used to skip the 32x32 layer. The same happens to the discriminator, but the other way around: instead of making the image larger, we make it smaller.

  1. Once the transition is done, keep training the generator and discriminator. Go to step 2 if the resolution of currently generated images is not the target resolution.

But, wait a moment… isn’t this upsampling and concatenation of new high-resolution images something already done in StackGANs (and the new StackGANs++)? Well, yes and no. First of all, StackGANs are text-to-image conditional GANs that use text descriptions as an additional input while ProGANs don’t use any kind of conditional information. But, more interestingly, despite both StackGANs and ProGANs using concatenation of higher resolution images, StackGANs require as many independent pairs of GANs — which need to be trained separately — per upsampling. Do you want to upsample 3 times? Train 3 GANs. On the other hand, in ProGANs only a single GAN is trained. During this training, more upsampling layers are progressively added to upsample the images. So, the cost of upsampling 3 times is just adding more layers on training time, as opposed to training from scratch 3 new GANs. In summary, ProGANs use a similar idea from StackGANs and they manage to pull it off more elegantly, with better results and without extra conditional information.

Results. As a result of this progressive training, generated images in ProGANs are of higher quality and training time is reduced by 5.4x on 1024x1024 images. The reasoning behind this is that a ProGAN doesn’t need to learn all large-scale and small-scale representations at once. In a ProGAN, first the small-scale are learnt (i.e. low-resolution layers converge) and then the model is free to focus on refining purely the large-scale structures (i.e. new high-resolution layers converge).

Other improvements. Additionally, the paper proposes new design decisions to further improve the performance of the model. I’ll briefly describe them:

CelebA-HQ. As a side note, it is worth mentioning that the authors enhanced and prepared the original CelebA for high-resolution training. In a nutshell, they remove artifacts, apply a Gaussian filtering to produce a depth-of-field effect, and detect landmarks on the face to finally get a 1024x1024 crop. After this process, they only keep the best 30k images out of 202k.

Evaluation. Last but not least, they introduce a new evaluation method:

You might want to use ProGANs…

Honorable mention: Cycle GANs

[Article] [Code]

Cycle GANs are, at the moment of writing these words, the most advanced image-to-image translation using GANs. Tired that your horse is not a zebra? Or maybe that Instagram photo needs more winter? Say no more.

ProGANs smoothing

These GANs don’t require paired datasets to learn to translate between domains, which is good because this kind of data is very difficult to obtain. However, Cycle GANs still need to be trained with data from two different domains X and Y (e.g. X: horses, Y: zebras). In order to constrain the translation from one domain to another, they use what they call a “cycle consistent loss”. This basically means that if you translate a horse A into a zebra A, transforming the zebra A back to a horse should give you the original horse A as a result.

This mapping from one domain to another is different from the also popular neural style transfer. The latter combines the content of one image with the style of another, whilst Cycle GANs learn a high level feature mapping from one domain to another. As a consequence, Cycle GANs are more general and can also be used for all sorts of mappings such as converting a sketch of an object into a real object.

Let’s recap. We have had two major improvements, WGANs-GP and BEGANs. Despite following different research directions, they both offer similar advantages. Then, we have ProGANs (based on WGANs-GP), which unlock a clear path to generate realistic high-resolution images. Meanwhile, CycleGANs reminds us about the power of GANs to extract meaningful information from a dataset and how this information can be transferred to another unrelated data distribution.

Other useful resources

Here are a bunch of links to other interesting posts:



Hope this post has been useful and thanks for reading! I want to also say thanks to Blair Young for his feedback on this post. If you think there’s something wrong, inaccurate or want to make any suggestion, please let me know in the comment section below or in this reddit thread.

Oh, and I have just created my new twitter account. I’ll be sharing my new blog posts there.

comments powered by Disqus