Artificial Intelligence Machine Learning Software Development

Unsolved Challenges in Generative Adversarial Networks That Still Define the Field

Apr 09, 2019 979 views

Generative Adversarial Networks have produced some of the most visually striking outputs in modern machine learning — photorealistic faces, high-resolution scenes, convincing domain transfers. But beneath the impressive demos, a set of foundational questions remains stubbornly unanswered. A closer look at the current state of GAN research reveals not a field running out of problems, but one that hasn't yet agreed on what the right problems are.

GANs vs. Flow Models vs. Autoregressive Models: Who Pays the Computational Price?

Three generative model families currently dominate the research conversation: GANs, Flow Models, and Autoregressive Models. Each comes with a distinct set of trade-offs that aren't yet fully understood.

Flow Models apply a stack of invertible transformations to a prior sample, enabling exact log-likelihood computation. Autoregressive Models factorize a distribution into conditional components and process observations sequentially — one pixel at a time in the image domain. GANs, by contrast, pit a generator against a discriminator in an adversarial training loop that sidesteps explicit likelihood computation entirely.

On paper, Flow Models look like they could make GANs redundant. Exact inference and exact log-likelihoods are genuinely useful properties. But the computational cost comparison tells a different story. The GLOW Flow Model, trained to generate 256×256 celebrity faces, required 40 GPUs running for two weeks and roughly 200 million parameters. Progressive GANs, trained on a comparable face dataset, used 8 GPUs for four days and around 46 million parameters — and produced 1024×1024 images. That's approximately 17 times more GPU-days and four times more parameters for the Flow Model, yielding images with 16 times fewer pixels.

Two explanations are plausible. Maximum likelihood training may be inherently harder: if a generative model assigns zero probability to any training point, the penalty is infinite. GAN generators face a softer, indirect version of that penalty. Alternatively, normalizing flows may simply be an inefficient way to represent certain functions — a hypothesis that lacks rigorous in-depth analysis at this point.

Autoregressive Models add another dimension. They can be expressed as non-parallelizable Flow Models, and they tend to be more time- and parameter-efficient than Flow Models. The resulting picture looks something like this: GANs are parallel and efficient but not reversible; Flow Models are reversible and parallel but not efficient; Autoregressive Models are reversible and efficient but not parallel. Whether that triangle of trade-offs reflects something fundamental — analogous to the CAP theorem in distributed systems — is an open question worth pursuing.

What Makes a Dataset Hard for a GAN to Model?

Most GAN benchmarking happens on a small set of standard image datasets: MNIST, CIFAR-10, STL-10, CelebA, and ImageNet. There's accumulated folklore about which are easier — MNIST and CelebA are considered more tractable due to their regularity, while ImageNet's high class count is widely cited as a core difficulty. The empirical results back this up: state-of-the-art synthesis on CelebA looks noticeably more convincing than the best results on ImageNet.

The problem is that these conclusions have been reached through slow, noisy trial-and-error — training GANs on progressively larger and more complex datasets and observing what breaks. The datasets themselves weren't designed for generative modeling; they were built for object recognition and happened to be available.

A more principled approach would let researchers examine a dataset and predict, without ever training a model, how difficult it will be for a GAN to learn. Some early work has touched on this, but the field lacks a clean theoretical framework. Key questions remain open: What does it actually mean to "model" a distribution — is a low-support approximation acceptable, or is a true density model required? Are there distributions a GAN fundamentally cannot learn? Are there distributions that are learnable in principle but not efficiently learnable given realistic resource constraints? And critically — are the answers to any of these questions actually different for GANs than for other generative model families?

Beyond Images: The Domain Generalization Problem

The vast majority of GAN research lives in the image domain. Attempts to extend GANs to text, audio, and structured data exist, but images remain by far the most tractable target. The gap in performance between image synthesis and other domains isn't just a matter of scale — it likely reflects something deeper about the implicit priors baked into current GAN architectures.

For continuous non-image data, the expectation is that GANs will eventually reach image-synthesis-level performance, but getting there will require domain-specific priors developed through careful thinking about what's computationally feasible in each setting. For discrete or structured data, the path is less clear. One candidate approach involves training both generator and discriminator as reinforcement learning agents, though that route may demand substantial computational resources and possibly fundamental research advances that don't yet exist.

Why the Gaps in GAN Theory Actually Matter

The practical progress in GAN image synthesis over the past two years has been real and rapid. But the theoretical scaffolding hasn't kept pace. There's still no consensus on how GANs should be evaluated. Current image synthesis benchmarks are showing signs of saturation. And the field is drawing conclusions about model behavior from datasets that were never designed to test generative models in the first place.

This matters because without a cleaner theoretical understanding — of training dynamics, of what makes distributions hard to learn, of where the fundamental trade-offs between model families actually lie — progress risks becoming increasingly empirical and increasingly expensive. The researchers pushing GAN capabilities forward are doing so largely by scaling compute and iterating on architectures, without a strong predictive theory to guide those choices.

The open problems here aren't niche academic puzzles. A CAP-theorem-style characterization of generative model trade-offs, a principled way to predict dataset difficulty, and a reliable path to non-image domains would each have direct consequences for how the field allocates research effort — and for which applications become tractable next.

GAN research is at a point where the benchmarks are getting easier to beat and the underlying questions are getting harder to ignore. The next meaningful leap probably won't come from another architecture tweak on CelebA.

Generative Adversarial Networks have matured enough to produce photorealistic faces and convincing synthetic video, yet the theoretical scaffolding underneath them remains surprisingly shaky. Several foundational questions — about convergence, evaluation, scaling, and robustness — are still open, and the field's momentum has largely outpaced its rigor.

The convergence problem nobody has fully solved

Proving that a GAN will reliably reach a good solution is genuinely hard. The loss functions for both the generator and discriminator are non-convex with respect to their parameters, which is a problem shared by neural networks broadly. But GANs add a layer of difficulty: both networks are being optimized simultaneously, and that dynamic interaction creates instabilities that standard convergence theory wasn't built to handle.

Three broad technical approaches have shown promise in addressing this, though none has been studied to the point of producing a clean, general answer. The question of when GANs can be proven to converge globally — and which neural network convergence results transfer to the adversarial training setting — remains one of the more pressing open problems in the area.

Evaluation is unsettled because the use case is unsettled

There's no shortage of proposed ways to measure GAN performance. The Inception Score and Fréchet Inception Distance have become relatively common benchmarks, but neither commands universal confidence, and the list of competing proposals is long. The deeper issue is that disagreement about how to evaluate GANs reflects a more fundamental disagreement about what GANs are actually for.

If the goal is a proper density model — something that assigns meaningful likelihoods across a full data distribution — GANs are probably the wrong tool. Experimental evidence suggests they learn a low-support representation of the target dataset, meaning large portions of a test set may receive effectively zero likelihood under the model. That's a significant limitation for certain applications.

Where GANs do seem well-suited is in perceptual tasks: image synthesis, style transfer, attribute manipulation, infilling. These are domains where human judgment is the real benchmark, even if it's expensive to collect. Classifier two-sample tests offer a cheaper proxy, but they have a known weakness — any systematic defect in the generator, however minor, tends to dominate the result and skew the evaluation. A more robust approach might involve iteratively constructing critics that are blind to the most prominent defect, then the next most prominent, and so on — something analogous to a Gram-Schmidt orthogonalization applied to evaluation. Alternatively, human evaluation costs can be reduced by using predictive models that only escalate to a real human judge when confidence is low.

Scaling with large batches — promising but not proven

Large minibatch training has driven significant gains in image classification. The question of whether the same approach can accelerate GAN training is less clear-cut. On the surface it seems reasonable — the discriminator is, after all, just a classifier. Larger batches reduce gradient noise, which can speed up training when noise is the bottleneck.

The complication is that GANs have an additional failure mode classifiers don't: training divergence. There's some evidence that bigger batches improve quantitative results and cut training time, which would imply gradient noise is a dominant factor. But this hasn't been studied systematically enough to draw firm conclusions.

Optimal Transport GANs are one candidate worth watching here — they theoretically offer better convergence properties and are structurally designed around aligning batches of samples, which makes them a natural fit for very large batch regimes. Asynchronous SGD is another angle worth exploring. GANs appear to benefit from training on past parameter snapshots, which might interact in an interesting way with the stale-gradient dynamics that typically characterize asynchronous training.

What adversarial examples mean for the discriminator

The discriminator in a GAN is an image classifier, and image classifiers are known to be vulnerable to adversarial examples — small, human-imperceptible perturbations that flip the model's output. Despite the substantial literature on both GANs and adversarial robustness, the intersection of the two has received surprisingly little attention.

The specific concern is whether the generator's gradient updates could inadvertently produce outputs that exploit weaknesses in the discriminator — not through a deliberate attack, but as a byproduct of optimization. There are reasons to think this "accidental attack" scenario is less likely than a targeted one: the generator only gets one gradient update before the discriminator is retrained, the batch of prior samples changes at every step, and optimization happens in parameter space rather than pixel space. None of those factors definitively rules it out, though. Deliberate attacks on generative models have been shown to work, and the question of whether something similar emerges organically during training is worth taking seriously.

Taken together, these open problems sketch a field that is technically impressive but theoretically underleveraged. Progress on convergence proofs, evaluation standards, batch scaling, and adversarial dynamics would each independently strengthen the foundation — and the answers to any one of them would likely reshape how the others are approached.

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from …".

Source: https://distill.pub/2019/gan-open-problems

Comments

No comments yet. Be the first to comment.

GANs vs. Flow Models vs. Autoregressive Models: Who Pays the Computational Price?

What Makes a Dataset Hard for a GAN to Model?

Beyond Images: The Domain Generalization Problem

Why the Gaps in GAN Theory Actually Matter

The convergence problem nobody has fully solved

Evaluation is unsettled because the use case is unsettled

Scaling with large batches — promising but not proven

What adversarial examples mean for the discriminator

Comments

Related Articles

When Legacy Systems Meet Modern Demands: Navigating the Infrastructure Gap

Microsoft Brings On the Team Behind AI Collaboration Platform Cove

I Let an AI Music Generator Create a Full Song — Here's What Happened