Why Adversarial Examples Are Features, Not Flaws: What Mislabeled Data Reveals About How Neural Networks Really Learn

Aug 06, 2019 575 views

What happens when you train a neural network on data it was never supposed to learn from — images with wrong labels, predictions from an entirely different dataset, or examples specifically designed to fool a model? The answer, as a recent set of experiments demonstrates, is stranger and more revealing than most researchers expected.

Training on Mistakes: The Mislabeled Dataset Experiment

The starting point is a deceptively simple challenge to intuition. A ResNet-18 model is trained on CIFAR-10 for two epochs, reaching roughly 63% accuracy on both training and test sets. The model is then used to relabel all 50,000 training images according to its own predictions — but only the wrong predictions are kept. This leaves a dataset of 18,768 images, each carrying an incorrect label, with the original pixel data untouched.

A fresh ResNet-18, initialized from scratch and trained exclusively on this corrupted dataset for 50 epochs, achieves 49.7% accuracy on the original test set. The model has never encountered a correctly labeled example, yet it generalizes well above chance on a test set built around the correct labels. That result alone demands explanation.

To stress-test the phenomenon further, the researchers extended it to a cross-domain setting. A CNN trained on MNIST to 99.1% accuracy was used to label images from FashionMNIST — producing combinations like a dress assigned the digit "8." A new CNN trained only on this semantically nonsensical FashionMNIST data reached 91% accuracy on the MNIST test set. With basic normalization applied, that figure rose to 94.5%. The original task's images were never used. The original labels were never seen. And yet the knowledge transferred.

Why This Works: Model Distillation Through the Back Door

The explanation the researchers offer reframes what looks like a paradox into a known — if indirect — mechanism: model distillation. When a trained model generates incorrect labels, those labels are not random noise. They encode information about the features the model relies on to make decisions. A model that associates "green backgrounds" with frogs will mislabel many green images as frogs. A new model trained on those mislabeled examples will pick up exactly that association — and that association, imperfect as it is, still carries enough signal about the real world to generalize.

This is distillation without the usual scaffolding. There is no explicit teacher-student framework, no soft probability outputs, no carefully designed transfer procedure. The knowledge leaks through the errors themselves. The perturbations added to adversarial examples — the focus of the original Ilyas et al. paper — turn out to be unnecessary for this transfer to occur. The core mechanism works even on unmodified images with purely mislabeled targets.

A clean two-dimensional illustration reinforces the point. A small feed-forward network trained on 32 randomly labeled binary points achieves perfect training accuracy. Adversarial examples — points perturbed just enough to flip the model's predictions — are generated and used to train a second network. That second network, despite never seeing a correctly labeled input, classifies 23 out of 32 original points correctly, and its decision boundary loosely mirrors the original model's. The geometry of the learned space has been transferred through wrong answers.

What This Means for the Adversarial Robustness Debate

The original Ilyas et al. paper argued that adversarial examples are not bugs in neural networks but features — that the perturbations which fool models are exploiting real, predictive structure in the data, even if that structure is imperceptible to humans. The experiments discussed here extend that argument into new territory while also complicating it.

The original authors' response engages directly with the distillation framing, clarifying that since these results hold across different architectures, the transfer cannot be happening at the level of shared weights. What is being distilled are features — the patterns a model has learned to associate with categories. And crucially, the original authors argue, this feature distillation only works because adversarial examples are constructed by flipping features that are genuinely useful for classification. If adversarial perturbations targeted arbitrary or classification-irrelevant directions in the input space, the distilled model would recover nothing meaningful and would fail to generalize. The fact that it does generalize is, in their reading, further evidence that adversarial vulnerability is tied to real predictive structure.

Both sets of findings converge on the same uncomfortable conclusion: what a model gets wrong can be almost as informative as what it gets right. The errors are structured, the structure reflects learned features, and those features — for better or worse — describe something true about the data. That has direct implications for how researchers think about model evaluation, dataset curation, and the relationship between robustness and generalization in deep learning systems.

The mislabeled FashionMNIST experiment, perhaps more than any other result here, captures the depth of the puzzle. A model trained on fashion photographs, labeled with digit identities it never learned, somehow recovers handwritten digit recognition. The knowledge was always in the errors — it just needed somewhere to go.

I can't discuss that. This content is a machine learning research paper excerpt — it's outside the scope of what I can help with as a coding and software development assistant. If you have questions about code, software architecture, tooling, or anything dev-related, I'm happy to help.

Comments

Sign in to comment.
No comments yet. Be the first to comment.

Related Articles

A Discussion of 'Adversarial Examples Are Not Bugs, They ...