TF
Tech Frontier

Weight Banding

Apr 08, 2021 599 views

This article is part of the Circuits thread, an experimental format collecting invited short articles and critical commentary delving into the inner workings of neural networks.

Open up any ImageNet conv net and look at the weights in the last layer. You’ll find a uniform spatial pattern to them, dramatically unlike anything we see elsewhere in the network. No individual weight is unusual, but the uniformity is so striking that when we first discovered it we thought it must be a bug. Just as different biological tissue types jump out as distinct under a microscope, the weights in this final layer jump out as distinct when visualized with NMF. We call this phenomenon weight banding.

1. When visualized with NMF, the weight banding in layer mixed_5b is as visually striking compared to any other layer in InceptionV1 (here shown: mixed_3a) as the smooth, regular striation of muscle tissue is when compared to any other tissue (here shown: cardiac muscle tissue and epithelial tissue).

So far, the Circuits thread has mostly focused on studying very small pieces of neural network – individual neurons and small circuits. In contrast, weight banding is an example of what we call a “structural phenomenon,” a larger-scale pattern in the circuits and features of a neural network. Other examples of structural phenomena are the recurring symmetries we see in equivariance motifs and the specialized slices of neural networks we see in branch specialization. In the case of weight banding, we think of it as a structural phenomenon because the pattern appears at the scale of an entire layer.

Weight banding also seems similar in flavor to the checkerboard artifacts that form during deconvolution.

In addition to describing weight banding, we’ll explore when and why it occurs. We find that there appears to be a causal link between whether a model uses global average pooling or fully connected layers at the end, suggesting that weight banding is part of an algorithm for preserving information about larger scale structure in images. Establishing causal links like this is a step towards closing the loop between practical decisions in training neural networks and the phenomena we observe inside them.

Weight banding consistently forms in the final convolutional layer of vision models with global average pooling.

In order to see the bands, we need to visualize the spatial structure of the weights, as shown below. We typically do this using NMF, as described in Visualizing Weights. For each neuron, we take the weights connecting it to the previous layer. We then use NMF to reduce the number of dimensions corresponding to channels in the previous layer down to 3 factors, which we can map to RGB channels. Since which factor is which is arbitrary, we use a heuristic to make the mapping consistent across neurons. This reveals a very prominent pattern of horizontalThe stripes aren’t always perfectly horizontal - sometimes they exhibit a slight preference for extra weight in the center of the central band, as seen in some examples below. stripes.

2. These common networks have pooling operations before their fully connected layers and consistently show banding at their last convolutional layers.

Interestingly, AlexNet does not exhibit this phenomenon.

3. AlexNet does not have a pooling operation before its fully connected layers and does not show banding at its last convolutional layer.

To make it easier to look for groups of similar weights, we sorted the neurons at each layer by similarity of their reduced forms.

Unlike most modern vision models, AlexNet does not use global average pooling. Instead, it has a fully connected layer directly connected to its final convolutional layer, allowing it to treat different positions differently. If one looks at the weights of this fully connected layer, the weights strongly vary as a function of the global y position.

The horizontal stripes in weight banding mean that the filters don’t care about horizontal position, but are strongly encoding relative vertical position. Our hypothesis is that weight banding is a learned way to preserve spatial information as it gets lost through various pooling operations.

In the next section, we will construct our own simplified vision network and investigate variations on its architecture in order to understand exactly which conditions are necessary to produce weight banding.

We’d like to understand which architectural decisions affect weight banding. This will involve trying out different architectures and seeing whether weight banding persists. Since we will only want to change a single architectural parameter at a time, we will need a consistent baseline to apply our modifications to. Ideally, this baseline would be as simple as possible.

We created a simplified network architecture with 6 groups of convolutions, separated by L2 pooling layers. At the end, it has a global average pooling operation that reduces the input to 512 values that are then fed to a fully connected layer with 1001 outputs.

This simplified network reliably produces weight banding in its last layer (and usually in the two preceding layers as well).

5. NMF of the weights in the last layer of the simplified model shows clear weight banding.

In the rest of this section, we’ll experiment with modifying this architecture and its training settings and seeing if weight banding is preserved.

To rule out bugs in training or some strange numerical problem, we decided to do a training run with the input rotated by 90 degrees. This sanity check yielded a very clear result showing vertical banding in the resulting weights, instead of horizontal banding. This is a clear indication that banding is a result of properties within the ImageNet dataset which make spatial vertical position(or, in the case of the rotated dataset, spatial horizontal position) relevant.

We remove the global average pooling step in our simplified model, allowing the fully connected layer to see all spatial positions at once. This model did not exhibit weight banding, but used 49x more parameters in the fully connected layer and overfit to the training set. This is pretty strong evidence that the use of aggressive pooling after the last convolutions in common models causes weight banding. This result is also consistent with AlexNet not showing this banding phenomenon (since it also does not have global average pooling).

We average out each row of the final convolutional layer, so that vertical absolute position is preserved but horizontal absolute position is not.Since this model has 7x7 spatial positions in the final convolutional layer, this modification increases the number of parameters in the fully connected layer by 7x, but not the 49x of a complete fully connected layer with no pooling at all. The banding at the last layer seems to go away, but on closer investigation, clear banding is still visible in layer 5a, similar to the baseline model’s 5b. We found this result surprising.

8. NMF of weights in 5a and 5b in a version of the simplified model modified to have pooling only along the x-axis. Banding is gone from 5b but reappears in 5a!

We tried each of the modifications below, and found that weight banding was still present in each of these variants.

An interactive diagram allowing you to explore the weights for these experiments and more can be found in the appendix.

In the previous section, we observed two interventions that clearly affected weight banding: rotating the dataset by 90º and removing the global average pooling before the fully connected layer. To confirm that these effects hold beyond our simplified model, we decided to make the same interventions to three common architectures (InceptionV1, ResNet50, VGG19) and train them from scratch.

With one exception, the effect holds in all three models.

The one exception is VGG19, where the removal of the pooling operation before its set of fully connected layers did not eliminate weight banding as expected; these weights look fairly similar to the baseline. However, it clearly responds to rotation.

Once we really understand neural networks, one would expect us to be able to leverage that understanding to design more effective neural networks architectures. Early papers, like Zeiler et al, emphasized this quite strongly, but it’s unclear whether there have yet been any significant successes in doing this. This hints at significant limitations in our work. It may also be a missed opportunity: it seems likely that if interpretability was useful in advancing neural network capabilities, it would become more integrated into other research and get attention from a wider range of researchers.

It’s unclear whether weight banding is “good” or “bad.”On one hand, the 90º rotation experiment shows that weight banding is a product of the dataset and is encoding useful information into the weights. However, if spatial information could flow through the network in a different, more efficient way, then perhaps the channels would be able to focus on encoding relationships between features without needing to track spatial positions. We don’t have any recommendation or action to take away from it. However, it is an example of a consistent link between architecture decisions and the resulting trained weights. It has the right sort of flavor for something that could inform architectural design, even if it isn’t particularly actionable itself.

More generally, weight banding is an example of a large-scale structure. One of the major limitations of circuits has been how small-scale it is. We’re hopeful that larger scale structures like weight banding may help circuits form a higher-level story of neural networks.

This article is part of the Circuits thread, an experimental format collecting invited short articles and critical commentary delving into the inner workings of neural networks.

The simplified network used to study this phenomenon was trained on Imagenet (1.2 million images) for 90 epochs. Training was done on 8 GPUs with a global batch size of 512 for the first 30 epochs and 1024 for the remaining 60 epochs. The network was built using TF-Slim. Batch norm was used on convolutional layers and fully connected layers, except for the last fully connected layer with 1001 outputs.

To explore how layer weights are affected by the various attempts to affect banding, we clustered a normalized form of the weights in the experiments discussed above. In this figure, you can explore how the proportion and type of banding changes with the various experiments.

Highlighted labels indicate experiments where weight banding no longer persisted for the given intervention and layer.

The following experiments were discussed in various conversations but have not been run at this time:

As with many scientific collaborations, the contributions are difficult to separate because it was a collaborative effort that we wrote together.

Research. Ludwig Schubert accidentally discovered weight banding, thinking it was a bug. Michael Petrov performed an array of systematic investigations into when it occurs and how architectural decisions affect it. This investigation was done in the context of and informed by collaborative research into circuits by Nick Cammarata, Gabe Goh, Chelsea Voss, Chris Olah, and Ludwig.

Writing and Diagrams. Michael wrote and illustrated a first version of this article. Chelsea improved the text and illustrations, and thought about big picture framing. Chris helped with editing.

We are grateful to participants of #circuits in the Distill Slack for their engagement on this article, and especially to Alex Bäuerle, Ben Egan, Patrick Mineault, Vincent Tjeng, and David Valdman for their remarks on a first draft.

If you see mistakes or want to suggest changes, please create an issue on GitHub.

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.

For attribution in academic contexts, please cite this work as

Related Articles