GANs comparison without cherry-picking
Implementations of some theoretical generative adversarial nets: DCGAN, EBGAN, LSGAN, WGAN, WGAN-GP, BEGAN, DRAGAN and CoulombGAN.
I implemented the structure of model equal to the structure in paper and compared it on the CelebA dataset and LSUN dataset without cherry-picking.
Table of Contents
- Similar works
- Model architectures are same as the architectures proposed in each paper
- Each model was not much tuned, so the results can be improved
- Well-structured (was my goal at the start, but I don't know whether it succeed!)
- TensorFlow queue runner is used for input pipeline
- Single trainer (and single evaluator) - multi model structure
- Logs in training and configuration are recorded on the TensorBoard
The family of conditional GANs are excluded (CGAN, acGAN, and so on).
- All experiments were performed on 64x64 CelebA dataset
- The dataset has 202599 images
- 1 epoch consists of about 1.58k iterations for batch size 128
- The dataset has 3033042 images
- 1 epoch consists of about 23.7k iterations for batch size 128
- I implemented the same as the proposed model in each paper, but ignored some details (or the paper did not describe details of model)
- Granted, a little details make great differences in the results due to the very unstable GAN training
- So if you had a better results, let me know the settings 🙂
- Default batch_size=128 and z_dim=100 (from DCGAN)
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015).
- Relatively simple networks
- Learning rate for discriminator (D_lr) is 2e-4 and learning rate for generator (G_lr) is 2e-4 (proposed in the paper) and 1e-3
Second row (50k, 30k) indicates each training iteration.
Higher learning rate (1e-3) for generator made better results. In this case, however, the generator has been collapsed sometimes due to its large learning rate. Lowering both learning rate may bring stability like https://ajolicoeur.wordpress.com/cats/ in which suggested D_lr=5e-5 and G_lr=2e-4.
Zhao, Junbo, Michael Mathieu, and Yann LeCun. "Energy-based generative adversarial network." arXiv preprint arXiv:1609.03126 (2016).
- I like energy concept, so this paper is very interesting for me :)
- But there is criticism: Are Energy-Based GANs any more energy-based than normal GANs?
- Anyway, the energy concept and autoencoder based loss function are impressive, and the results are also fine
- But I have a question for Pulling-away Term (PT), which prevents mode-collapse theoretically. This is the same idea as minibatch discrimination (T. Salimans et al).
|pt weight = 0.1||No pt loss|
The model using PT generates slightly better sample visually. However, it is not clear from this results whether PT prevents mode-collapse. Furthermore, I could not distinguish what setting is better from repeated experiments.
|pt weight = 0.1||No pt loss|
pt_loss decreases a little faster in the left which used pt_weight=0.1 but there is no big difference and even at the end the right which used no pt_loss showed a lower pt_loss. So I wonder: is the PT loss really working for preventing mode-collapse as described in the paper?
Mao, Xudong, et al. "Least squares generative adversarial networks." arXiv preprint ArXiv:1611.04076 (2016).
- Unusually, LSGAN used large latent space dimension (z_dim=1024)
- But in my experiment, z_dim=100 makes better results than z_dim=1024 which is originally used in paper
Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arXiv preprint arXiv:1701.07875 (2017).
- The samples from WGAN are not that impressive - compared to the very impressive theory
- Also no specific network structure proposed, so DCGAN architecture was used for experiments
- In the author's implementation, they used higher n_critic in the early stage of training and per 500 iterations
Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." arXiv preprint arXiv:1704.00028 (2017).
- I tried two network architectures, which are DCGAN architecture and ResNet architecture in appendix C
- ResNet has more complicated architecture and better performance than DCGAN architecture
- The interesting thing is that the visual quality of samples improves very quickly (ResNet WGAN-GP has best samples on 7k iterations) and it gets worse when continue training
- According to DRAGAN, constraints of WGAN are too restrictive to learn good generator
|DCGAN architecture||ResNet architecture|
|30k||7k, batch size = 64|
|100k, ResNet architecture|
Face collapse phenomenon
WGAN-GP was collapsed more than other models when the iteration increases.
ResNet architecture showed the best visual quality sample in the very early stage, 7k iteration in my criteria. This maybe due to the residual architecture.
Regardless of the face collapse phenomenon, the Wasserstein distance decreased steadily. It should come from that the critic (discriminator) network failed to find the supremum and K-Lipschitz function.
|DCGAN architecture||ResNet architecture|
The plots in the last row of the table are just expanded version of the plots in the second row.
It is interesting that W_dist < 0 at the end of the training. This indicates that E[fake] > E[real] and, in the point of original GAN view, it means the generator dominates the discriminator.
Berthelot, David, Tom Schumm, and Luke Metz. "Began: Boundary equilibrium generative adversarial networks." arXiv preprint arXiv:1703.10717 (2017).
- The best model that generates samples with the best visual quality as far as I know
- It also showed the best performance in this project
- Even though optional improvements was not implemented (section 3.5.1 in the paper)
- However, the samples generated by BEGAN give a slightly different feel from other models - it seems like disappearing details.
- So I just wonder what the results are for different datasets
batch_size=16, z_dim=64, gamma=0.5.
|Convergence measure M|
I also tried to reduce speck-like artifacts as suggested in Heumi/BEGAN-tensorflow, but it did not go away.
BEGAN in the LSUN datset works terribly. Not only severe mode-collapse was observed, but also generated images were not realistic.
Kodali, Naveen, et al. "How to Train Your DRAGAN." arXiv preprint arXiv:1705.07215 (2017).
- Different with other papers, DRAGAN was motivated from the game theory for improving performance of GAN
- This approach through the game theory is highly unique and interesting
- But, IMHO, there is not much real contribution. The algorithm is similar to WGAN-GP
The original paper has some bugs. One of those is image x is pertured only positive-sided. I applied two-sided perturbation as the author admitted this bug on the GitHub.
Unterthiner, Thomas, et al. "Coulomb GANs: Provably Optimal Nash Equilibria via Potential Fields." arXiv preprint arXiv:1708.08819 (2017).
- CoulombGAN has also very interesting perspective - "Coulomb potential".
- It is very interesting but I don't know whether it is GAN.
- CoulombGAN tried to solve the diversity problem (mode collapse)
G_lr=5e-4, D_lr=25e-5, z_dim=32.
The disadvantage of this model is that it takes a very long time to train despite the simplicity of network architecture. Further, like original GAN, there is no convergence measure. I thought that the potentials of fake samples served as a convergence measure, but it did not.
Download CelebA dataset:
$ python download.py celebA $ python download.py lsun
Convert images to tfrecords format:
Options for converting are hard-coded, so ensure to modify it before run
convert.py. In particular, LSUN dataset is provided in LMDB format.
$ python convert.py
If you want to change the settings of each model, you must also modify code directly.
$ python train.py --help usage: train.py [-h] [--num_epochs NUM_EPOCHS] [--batch_size BATCH_SIZE] [--num_threads NUM_THREADS] --model MODEL [--name NAME] --dataset DATASET [--ckpt_step CKPT_STEP] [--renew] optional arguments: -h, --help show this help message and exit --num_epochs NUM_EPOCHS default: 20 --batch_size BATCH_SIZE default: 128 --num_threads NUM_THREADS # of data read threads (default: 4) --model MODEL DCGAN / LSGAN / WGAN / WGAN-GP / EBGAN / BEGAN / DRAGAN / CoulombGAN --name NAME default: name=model --dataset DATASET, -D DATASET CelebA / LSUN --ckpt_step CKPT_STEP # of steps for saving checkpoint (default: 5000) --renew train model from scratch - clean saved checkpoints and summaries
Monitor through TensorBoard:
$ tensorboard --logdir=summary/dataset/name
Evaluate (generate fake samples):
$ python eval.py --help usage: eval.py [-h] --model MODEL [--name NAME] --dataset DATASET [--sample_size SAMPLE_SIZE] optional arguments: -h, --help show this help message and exit --model MODEL DCGAN / LSGAN / WGAN / WGAN-GP / EBGAN / BEGAN / DRAGAN / CoulombGAN --name NAME default: name=model --dataset DATASET, -D DATASET CelebA / LSUN --sample_size SAMPLE_SIZE, -N SAMPLE_SIZE # of samples. It should be a square number. (default: 16)
- python 2.7
- tensorflow >= 1.2 (verified on 1.2 and 1.3)
- (optional) pynvml - for automatic gpu selection