How InceptionV1 Processes Images in Its Early Vision Layers

Apr 01, 2020 661 views

Neural networks process images through layers of increasingly abstract feature detectors — but what exactly are those features, and how do they build on each other? A new research effort from the Circuits project takes a systematic look at the first five convolutional layers of InceptionV1, mapping out over a thousand neurons into organized "neuron groups" to give researchers a navigable map of early visual processing.

This article is part of the Circuits thread, a collection of short articles and commentary by an open scientific collaboration exploring the inner workings of neural networks.

From raw pixels to proto-heads in five layers

The five convolutional layers leading up to InceptionV1's third pooling layer cover a surprising amount of ground. Starting from raw pixel input, the network progressively builds up to boundary detection, basic shape recognition — including curves, circles, spirals, and triangles — and even crude detectors for very small heads. Along the way, researchers found Complex Gabor detectors that echo classic neuroscience findings on "complex cells," alongside more unexpected features like black-and-white versus color detectors and small circle formation from curves.

Early vision is a natural starting point for this kind of investigation. The circuits are shallow, the neuron count is manageable, and the features are relatively interpretable. There's also a stronger case for universality here — the hypothesis that the same features and circuits emerge across different architectures and training tasks seems most likely to hold in these early layers.

Why taxonomy matters before theory

The researchers draw an explicit parallel to Dmitri Mendeleev's development of the Periodic Table — an act of patient organization that preceded any deep theoretical understanding of atomic structure. The same logic applies in biology, where species taxonomies long predated genetics or evolutionary theory. Organizing phenomena has value even without a complete explanatory framework behind it.

That's the spirit behind the neuron group categorization here. Many neurons in vision models cluster into families — a dozen units detecting the same feature at different orientations or colors, for instance. More striking is that these families appear to recur across different models entirely. Gabor filters and color contrast detectors in the first layer are well-established, but seeing this pattern persist into later layers was unexpected.

The categorization is explicitly ad-hoc and human-defined. Some families are suspected to reflect genuine structure in the network; others are categories of convenience or low-confidence groupings. The primary goal is orientation — giving researchers a foothold when confronting a thousand unfamiliar neurons.

How this differs from automated approaches like Net Dissect

The manual approach taken here contrasts with tools like Net Dissect, which correlates neurons with a pre-defined feature taxonomy and groups them into broad categories like color, texture, and object. That method scales well and removes subjectivity, but it has real limitations. Correlation can mislead, and a fixed taxonomy will always miss unanticipated feature types by definition.

A telling example: in Net Dissect's analysis of a related model, a unit in mixed3b that appears to detect left-oriented whiskers ends up classified primarily as a cat detector. The correlation is technically defensible, but it obscures what the neuron is actually doing. For novel features — like high-low frequency detectors — no pre-defined taxonomy can surface them at all. Manual investigation remains the only reliable discovery method.

The researchers suggest a future hybrid approach: known features get sorted automatically as the taxonomy grows, while human investigators focus attention on genuinely novel units. This becomes especially practical if the universality hypothesis holds across architectures.

Reading the circuits: feature visualizations and weight maps

To represent individual neurons without relying on opaque index numbers, the project uses feature visualizations — optimized images that maximally stimulate a given neuron. These aren't claimed to fully capture a neuron's behavior; they function more like variable names in code, replacing an arbitrary number with a meaningful symbol.

Circuits are presented as a neuron alongside the units it connects most strongly to in the previous layer, with weights shown using a red-positive, blue-negative color map. Clicking any neuron's visualization surfaces its 50 strongest connections in both directions — forward and backward — giving an unfiltered view of the weight structure. A worked example traces how a circle detector in mixed3a assembles itself from earlier curve detectors and a primitive circle unit, with a deeper discussion provided later in the series.

The first convolutional layer, conv2d0, follows the pattern seen across virtually every vision model the team has examined: units split predominantly between color-contrast detectors and Gabor filters. What happens in the layers that follow is where things get more interesting — and considerably harder to map without a guide like this one.

Neural networks learn to see the world in stages — and tracing those stages through InceptionV1 reveals something surprisingly close to how biological vision works, built not by design but by gradient descent chasing a loss function.

Why Early Layers Look Messy — and What That Tells Us

The first convolutional layer of InceptionV1 doesn't produce the clean, textbook-perfect Gabor filters and color contrast detectors you'd expect from a well-trained vision model. The features are, frankly, noisy. The most likely explanation is that gradients struggle to propagate cleanly to early layers during training — a known problem in deep networks that predate techniques like batch normalization and the Adam optimizer. When you compare InceptionV1 to its TF-Slim rewrite, which does use BatchNorm, the difference is stark: the rewritten version produces crisp, clearly separated Gabor filters, color detectors, and center-surround units. The messiness isn't a mystery — it's an artifact of training conditions.

One structural detail worth understanding here: Gabor filters almost universally appear in pairs of opposing polarity. A single Gabor can only detect edges at certain offsets, but its negative counterpart fills the gaps. Together, they enable the next layer to build more complex, position-tolerant edge detectors — which is exactly what starts happening in conv2d1.

From Simple Edges to Shape Precursors: The First Three Layers

conv2d1 is where the network starts producing what visual neuroscience calls complex cell behavior — neurons that respond to similar patterns as the layer before them, but with less sensitivity to exact position or orientation. The clearest example is the "Complex Gabor" family: unlike simple Gabors, these units don't care which side of an edge is dark or light, or precisely where the edge falls. They achieve this by pooling responses from multiple Gabor filters of similar orientation, including reciprocal pairs with inverted contrast. It's an early instance of what researchers describe as a "union over cases" computation.

Because conv2d1 uses 1x1 convolutions, each channel connects to the previous layer through a single weight. That constraint shapes what features can emerge — in models with larger second-layer convolutions, you tend to see cruder approximations of higher-level features appear earlier. Here, the architecture keeps things disciplined.

By conv2d2, the network is assembling Gabor responses into rudimentary shapes. About 25% of units in this layer behave as line detectors — preferring a single extended line over a repeating Gabor pattern. Alongside them appear tiny curve detectors, corner detectors, divergence detectors, and even a single primitive circle detector. You can actually see the assembly process in feature visualizations: curves are built from small piecewise Gabor segments stitched together. Texture and color detectors also start becoming a significant presence here, including units that look for different textures on opposite sides of their receptive field.

Mixed Layers and the Emergence of Real Structure

mixed3a marks a meaningful jump in feature diversity. Curve detectors and high-low frequency detectors — both discussed in prior work — appear here, but so do several less-documented circuit types. Black and white detectors emerge for the first time: rather than comparing one color to its complement, these units detect the absence of color entirely, computing something close to a logical NOT over a set of color features. This matters practically — greyscale images correlate with specific ImageNet categories, and these detectors appear to be part of how the network handles them.

Shape complexity also increases. Small circle and eye detectors form by combining the early curve and circle detectors from conv2d2. Triangle detectors assemble from line and shifted-line detectors. In practice, though, those triangle detectors often end up functioning as general multi-edge detectors downstream, or as components in detecting convex boundaries — their "triangleness" is less important than their sensitivity to angular junctions.

mixed3b sits at an awkward boundary between low-level and mid-level vision. It contains color center-surround units that still feel primitive alongside object boundary detectors and early head detectors that clearly don't. The boundary detectors are particularly interesting: they're not just more refined edge detectors. They integrate multiple cues — including the high-low frequency detectors from mixed3a — to signal transitions between objects, largely independent of the direction of that transition. The high-low frequency detectors from the previous layer appear to exist specifically to feed this computation.

Curve-based features also grow more elaborate in mixed3b: circles, S-shapes, spirals, divots, and what the researchers call "evolutes" — units tuned to curves facing away from the center of the receptive field. And in a detail that reflects ImageNet's heavy dog content, oriented fur detectors appear, built by assembling fur precursors from mixed3a so their responses converge in a directionally specific way.

What This Architecture Reveals About How Vision Gets Built

The progression through these layers isn't arbitrary. Gradient descent doesn't create features speculatively — it only builds what later layers find useful. That means every feature in early vision exists because something downstream demanded it. The high-low frequency detectors in mixed3a exist because boundary detectors in mixed3b needed them. The Gabor pairs in the first layer exist because complex Gabors in conv2d1 needed them. Tracing these dependencies backward gives a functional explanation for why each feature exists, not just a description of what it looks like.

This also means the messiness in InceptionV1's earliest layers isn't just a training artifact to dismiss — it's a signal about what happens when optimization pressure doesn't reach far enough back. Better training techniques don't change the logic of what gets built; they just make the construction cleaner. The underlying computational motifs — edge pairs, union-over-cases, spatial assembly of simpler features — appear to be robust properties of how convolutional networks solve visual recognition, not quirks of a particular architecture.

Each feature family discussed here — curves, boundaries, fur detectors, black-and-white units — represents a thread that can be pulled much further. The overview is useful for orientation, but the real questions are specific: exactly what curve geometries trigger a curve detector, how it behaves on edge cases, and precisely how it's assembled from the layer below. Those are the questions that turn a taxonomy into a mechanistic understanding of vision.

What does it actually mean to "understand" a neural network? That question sits at the heart of a research effort that has spent considerable time mapping the earliest layers of InceptionV1, one of the most studied convolutional neural networks in computer vision. The result is a working taxonomy of low-level visual features — organized, downloadable, and openly debated.

A taxonomy built neuron by neuron

The project originated with a deceptively simple goal: make sure every single neuron in InceptionV1 had received at least some deliberate attention. Chris Olah drove that effort, working through the network's early layers and drawing on a shared body of knowledge the Clarity team had built up over time around InceptionV1's internal behavior. Nick contributed particularly detailed investigations of specific units, while infrastructure work by Ludwig and Michael made the whole analysis possible. The resulting taxonomy covers five layers — conv2d0, conv2d1, conv2d2, mixed3a, and mixed3b — and is available as downloadable JSON files for anyone who wants to dig in directly.

The taxonomy groups neurons into what the researchers call "feature families" — clusters of units that seem to detect related visual patterns. But the team is candid about the limits of that framing. These groupings may reflect something real and fundamental about how low-level vision works, or they may simply be a categorization that feels intuitive to human observers without mapping cleanly onto the network's actual computational structure.

The open questions that matter more than the answers

What makes this work interesting beyond its immediate findings is the list of questions it leaves unresolved. Do feature families form because of something deep about the geometry of visual information, or are they an artifact of how humans like to organize things? Do the same families reliably appear across different model architectures, or is this taxonomy specific to InceptionV1? And perhaps most provocatively — is there something like a "periodic table" of low-level visual features waiting to be discovered?

That last question carries real weight. The periodic table analogy isn't just rhetorical. It implies the possibility of a principled, predictive structure — one where knowing a feature exists in one model would let you anticipate related features, or where gaps in the taxonomy would point toward features not yet identified. Whether that kind of structure exists in neural networks is genuinely unknown, and the researchers frame it as a direction worth pursuing rather than a claim they're making.

There's also the question of depth. This work focuses on early vision — the first few layers where networks tend to learn edges, textures, and simple frequency patterns. Whether later layers, which encode far more abstract representations, admit a similar taxonomic treatment is an open problem. The complexity scales quickly, and the interpretability tools that work reasonably well on conv2d0 may not transfer cleanly to deeper, more entangled representations.

Why interpretability research at this level still matters

Mechanistic interpretability — the project of understanding not just what a network does but how it does it — remains one of the harder and less glamorous corners of AI research. Most of the field's attention goes to capabilities: what models can do, how large they can scale, what benchmarks they can clear. Work like this sits in a different register. It asks whether we can build a legible account of the internal structure of a trained network, feature by feature, layer by layer.

The Circuits thread, of which this article is a part, represents an ongoing open scientific collaboration aimed at exactly that. The work is published through Distill, with diagrams and text available under a Creative Commons Attribution CC-BY 4.0 license and source code on GitHub. Errors and suggested changes can be submitted directly via GitHub issues — a workflow that reflects the collaborative, iterative nature of the project.

The researchers acknowledge a long list of contributors and reviewers, including Brice Menard, Sophia Sanborn, Daniel Filan, and others who shaped the final text. For academic citation, the work follows standard attribution practices for Distill publications. The taxonomy files themselves — conv2d0 through mixed3b — are the most concrete artifact the project leaves behind, a starting point for anyone who wants to push these questions further rather than treat them as settled.

Comments

Sign in to comment.
No comments yet. Be the first to comment.

Related Articles

An Overview of Early Vision in InceptionV1