A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Adversarial Example Researchers Need to Expand What is Meant by 'Robustness'
This article is part of a discussion of the Ilyas et al. paper “Adversarial examples are not bugs, they are features”. You can learn more in the main discussion article .
The hypothesis in Ilyas et. al. is a special case of a more general principle that is well accepted in the distributional robustness literature — models lack robustness to distribution shift because they latch onto superficial correlations in the data. Naturally, the same principle also explains adversarial examples because they arise from a worst-case analysis of distribution shift. To obtain a more complete understanding of robustness, adversarial example researchers should connect their work to the more general problem of distributional robustness rather than remaining solely fixated on small gradient perturbations.
The main hypothesis in Ilyas et al. (2019) happens to be a special case of a more general principle that is
commonly accepted in the robustness to distributional shift literature
Given the plethora of useful correlations that exist in natural data, we should expect that our models will
learn to exploit them. However, models relying on superficial statistics can poorly generalize should these
same statistics become corrupted after deployment. To obtain a more complete understanding of model
robustness,
How, then, can the research community create models that robustly generalize in the real world, given that
adversarial training can harm robustness to distributional shift? To do so, the research community must take
a broader view of robustness and accept that adversarial robustness is highly limited and mostly
detached from security and real-world robustness
Response Summary: The demonstration of models that learn from high-frequency components of the data is interesting and nicely aligns with our findings. Now, even though susceptibility to noise could indeed arise from non-robust useful features, this kind of brittleness (akin to adversarial examples) of ML models has been so far predominantly viewed as a consequence of model “bugs” that will be eliminated by “better” models. Finally, we agree that our models need to be robust to a much broader set of perturbations — expanding the set of relevant perturbations will help identify even more non-robust features and further distill the useful features we actually want our models to rely on.
Response: The fact that models can learn to classify correctly based purely on the high-frequency component of the training set is neat! This nicely complements one of our takeaways: models will rely on useful features even if these features appear incomprehensible to humans.
Also, while non-robustness to noise can be an indicator of models using non-robust useful features, this is not how the phenomenon was predominantly viewed. More often than not, the brittleness of ML models to noise was instead regarded as an innate shortcoming of the models, e.g., due to poor margins. (This view is even more prevalent in the adversarial robustness community.) Thus, it was often expected that progress towards “better”/”bug-free” models will lead to them being more robust to noise and adversarial examples.
Finally, we fully agree that the set of -bounded perturbations is a very small subset of the perturbations we want our models to be robust to. Note, however, that the focus of our work is human-alignment — to that end, we demonstrate that models rely on features sensitive to patterns that are imperceptible to humans. Thus, the existence of other families of incomprehensible but useful features would provide even more support for our thesis — identifying and characterizing such features is an interesting area for future research.
If you see mistakes or want to suggest changes, please create an issue on GitHub.
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
For attribution in academic contexts, please cite this work as