A Discussion of 'Adversarial Examples Are Not Bugs, They ...

This article is part of a discussion of the Ilyas et al. paper “Adversarial examples are not bugs, they are features”. You can learn more in the main discussion article .

Ilyas et al. define a feature as a function $f$ that takes $x$ from the data distribution $(x,y) \sim \mathcal{D}$ into a real number, restricted to have mean zero and unit variance. A feature is said to be useful if it has high correlation with the label. But in the presence of an adversary Ilyas et al. argues the metric that truly matters is a feature’s robust usefulness,

its correlation with the label while under attack. Ilyas et al. suggests that in addition to the pedestrian, robust features we know and love (such as the color of the sky), our models may also be taking advantage of useful, non-robust features, some of which may even lie beyond the threshold of human intuition. This begs the question: what might such non-robust features look like?

Our search is simplified when we realize the following: non-robust features are not unique to the complex, nonlinear models encountered in deep learning. As Ilyas et al observe, they arise even in the humblest of models — the linear one. Thus, we restrict our attention to linear features of the form:

The robust usefulness of a linear feature admits an elegant decomposition This $\begin{aligned} \mathbf{E}\left[\inf_{\|\delta\|\leq\epsilon}yf(x+\delta)\right] & =\mathbf{E}\left[yf(x)+\inf_{\|\delta\|\leq\epsilon}yf(\delta)\right]\\ & =\mathbf{E}\left[yf(x)+\inf_{\|\delta\|\leq\epsilon}y\frac{a^{T}\delta}{\|a\|_{\Sigma}}\right]\\ & =\mathbf{E}\left[yf(x)+\frac{\inf_{\|\delta\|\leq\epsilon}a^{T}\delta}{\|a\|_{\Sigma}}\right]=\mathop{\mathbf{E}[yf(x)]}-\epsilon\frac{\|a\|_{*}}{\|a\|_{\Sigma}} \end{aligned}$ into two terms:

In the above equation $\|\cdot\|_*$ deontes the dual norm of $\|\cdot\|$ . This decomposition gives us an instrument for visualizing any set of linear features $a_i$ in a two dimensional plot.

The elusive non-robust useful features, however, seem conspicuously absent in the above plot. Fortunately, we can construct such features by strategically combining elements of this basis.

It is surprising, thus, that the experiments of Madry et al. (with deterministic perturbations) do distinguish between the non-robust useful features generated from ensembles and containments. A succinct definition of a robust feature that peels these two worlds apart is yet to exist, and remains an open problem for the machine learning community.

Response Summary: The construction of explicit non-robust features is very interesting and makes progress towards the challenge of visualizing some of the useful non-robust features detected by our experiments. We also agree that non-robust features arising as “distractors” is indeed not precluded by our theoretical framework, even if it is precluded by our experiments. This simple theoretical framework sufficed for reasoning about and predicting the outcomes of our experiments We also presented a theoretical setting where we can analyze things fully rigorously in Section 4 of our paper.. However, this comment rightly identifies finding a more comprehensive definition of feature as an important future research direction.

Response: These experiments (visualizing the robustness and usefulness of different linear features) are very interesting! They both further corroborate the existence of useful, non-robust features and make progress towards visualizing what these non-robust features actually look like.

We also appreciate the point made by the provided construction of non-robust features (as defined in our theoretical framework) that are combinations of useful+robust and useless+non-robust features. Our theoretical framework indeed enables such a scenario, even if — as the commenter already notes — our experimental results do not. (In this sense, the experimental results and our main takeaway are actually stronger than our theoretical framework technically captures.) Specifically, in such a scenario, during the construction of the $\widehat{\mathcal{D}}_{det}$ dataset, only the non-robust and useless term of the feature would be flipped. Thus, a classifier trained on such a dataset would associate the predictive robust feature with the wrong label and would thus not generalize on the test set. In contrast, our experiments show that classifiers trained on $\widehat{\mathcal{D}}_{det}$ do generalize.

Overall, our focus while developing our theoretical framework was on enabling us to formally describe and predict the outcomes of our experiments. As the comment points out, putting forth a theoretical framework that captures non-robust features in a very precise way is an important future research direction in itself.

Shan Carter (design overhaul), Preetum (technical discussion), Chris Olah (technical discussion), Ludwig (overall feedback), Ria (feedback) Aditiya (feedback)

Writing & Diagrams: The text was initially drafted by…

If you see mistakes or want to suggest changes, please create an issue on GitHub.

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.

For attribution in academic contexts, please cite this work as

A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Two Examples of Useful, Non-Robust Features

Related Articles

AI ‘actor’ Tilly Norwood put out the worst song I’ve ever heard

Ford’s new AI assistant will help fleet owners know if seatbelts are being used

Zendesk acquires agentic customer service startup Forethought