Artificial Intelligence Machine Learning Software Development

Getting Started with Circuits: A Beginner's Guide to How Electronics Work

Mar 10, 2020 1,050 views

Somewhere between the raw pixels a camera captures and the confident label a neural network outputs lies a vast, largely unmapped territory. A small group of researchers believes that territory can be charted — not by building better explainability dashboards or summary statistics, but by doing something far more painstaking: tracing individual neurons, one connection at a time.

The "zoom in" hypothesis and what it borrows from biology

Science has a recurring pattern. A new tool arrives — a microscope, an x-ray crystallography rig, a particle accelerator — and suddenly the objects of inquiry change entirely. Cellular biology didn't emerge because zoologists got more careful. It emerged because researchers could finally see something they couldn't see before, and that visibility created a new discipline with new questions and new methods.

The researchers behind the Circuits project are betting that deep learning is approaching a similar inflection point. Visualizations of artificial neural networks have already revealed enough structure to raise the question seriously: what if individual neurons, and the weighted connections between them, are not just computational noise but meaningful, interpretable units — the cells of a new biology?

That framing is deliberate. Just as Theodor Schwann articulated an early version of cell theory in 1839 — two claims of which survived into modern biology, one of which turned out to be completely wrong — the Circuits researchers are putting forward three explicit claims about neural networks, knowing full well that some may not hold. The value, they argue, is in stating something falsifiable and testable rather than hedging indefinitely.

Three claims, thousands of hours, and what the neurons actually show

The first and most contested claim is that neural networks are built from meaningful, understandable features. Early layers appear to contain edge and curve detectors. Deeper layers seem to hold representations of floppy ears, wheels, snouts, and fur. The research community is genuinely split on this. One camp treats meaningful neurons as almost self-evident, pointing to a body of literature demonstrating interpretable units across vision and language models. Another camp argues that what looks like semantic understanding is really texture sensitivity or responses to imperceptible statistical patterns — and that the seemingly interpretable neurons researchers celebrate may simply be misread.

After thousands of hours examining individual neurons in InceptionV1, the Circuits team lands firmly in the first camp, while acknowledging the complexity. Neurons that initially appeared inscrutable, they report, typically resolve into something natural and elegant on closer inspection. High-low frequency detectors, for instance, seemed strange at first glance but turned out to encode a straightforward and useful visual distinction.

Beyond individual neurons, the connections between them appear to implement recognizable logic. A circle detector can be watched assembling itself from curve detectors. A dog head representation emerges from eyes, snout, fur, and tongue. The network, in some cases, appears to be running something close to AND, OR, and XOR operations over high-level visual features — not as a design choice by engineers, but as a structure that emerged from training.

Why the interpretability field may need to change its unit of analysis

Most interpretability research operates at the level of the whole network, trying to produce compact explanations of overall behavior. The Circuits approach inverts that priority. Rather than asking what the network does in aggregate, it asks what each neuron does specifically — and whether the answer is something a human can actually understand and verify.

This matters because the two approaches have different failure modes. High-level explanations can be plausible-sounding while missing the actual computational mechanisms entirely. Neuron-level analysis is slower and more labor-intensive, but it produces claims that are, in principle, directly checkable. If a neuron is claimed to detect curves, you can test that claim against its actual activation patterns.

The deeper implication is that if features are genuinely understandable at the individual level, then interpretability research has a tractable path forward that doesn't require solving the entire black-box problem at once. Working through a network neuron by neuron is a massive undertaking — but it scales linearly, not exponentially. That's a meaningful difference.

An open collaboration at the frontier of a new field

The Circuits project is framed explicitly as an open scientific collaboration, with the introductory essay serving as a foundation for a series of detailed explorations to follow. The researchers acknowledge they've barely scratched the surface of understanding a single vision model, and they're inviting others to join the work through the Distill Slack community.

The scope of what remains unknown is part of the point. The early microscopists didn't have a complete map of cellular biology — they had a tool, a few striking observations, and a conviction that the level of detail they were now able to access would eventually yield a coherent picture. The Circuits researchers are making a similar bet: that the inner world of neural networks, examined closely enough, will turn out to be structured, legible, and worth the effort it takes to read.

Seven lines of evidence. That's what researchers at Distill needed to convince themselves — and the broader machine learning community — that a cluster of neurons inside InceptionV1 was genuinely doing what it appeared to be doing: detecting curves.

What curve detectors actually are

Deep inside mixed3b, an early layer of the InceptionV1 architecture, sits a family of neurons that fire in response to curved lines and boundaries with a radius of roughly 60 pixels. They show a mild additional preference for perpendicular lines running along the curve's edge, and they respond more strongly when the two sides of a curve differ in color. No single neuron covers all orientations — instead, the family divides the work, with each member tuned to a different angle so that together they tile the full 360-degree range.

The distinction matters: these are not neurons that use curves as a component to recognize circles, spirals, or S-shapes. Those exist too, but they're a separate category. A curve detector, as defined here, responds to the curve itself — not to any higher-order shape that happens to contain one.

Building a seven-argument case for what a neuron does

Proving that a neuron detects curves turns out to require more rigor than it might seem. The researchers assembled seven independent lines of evidence, none of which are specific to curves — they form a reusable toolkit applicable to any feature under investigation.

Feature visualization comes first: optimizing an input image to maximally activate a curve detector reliably produces curved shapes. Because every pixel in the resulting image was placed there to drive the neuron's response, this establishes a direct causal link. Dataset examples reinforce this — the ImageNet images that trigger the strongest responses are consistently well-formed curves at the expected orientation, while moderate activations correspond to imperfect or slightly rotated curves.

Synthetic stimuli add further precision. Curves rendered at varying orientations, curvatures, and against different backgrounds confirm that these neurons fire only near their preferred orientation and stay quiet for straight lines or sharp corners. A rotation test closes another potential gap: rotating a dataset example that fires a given curve detector causes its activation to drop while the detector tuned to the new orientation picks up the signal — exactly what you'd expect from a family of orientation-specific detectors collectively covering all angles.

The final three arguments come from circuit analysis. Reading the weights directly reveals a curve-detection algorithm embedded in the network's connections, with no obvious secondary cause of activation visible in the structure. The downstream neurons that receive input from curve detectors are features that naturally involve curves — circles, spirals, 3D curvature — and they use those inputs in ways consistent with curve detection. Finally, a cleanroom reimplementation, with all weights set by hand according to the researchers' understanding of the algorithm, produces units that meaningfully replicate the behavior of the original curve detectors.

Taken together, these arguments establish three things: curves cause these neurons to fire, each unit is tuned to a specific angular orientation, and any other stimuli capable of triggering them appear to be rare or produce weaker responses. The evidentiary standard, the researchers note, is deliberately aligned with what visual neuroscience has used for decades to characterize neurons in biological brains.

When neurons resist clean interpretation

Curve detectors are, by the researchers' own admission, the kind of feature you might have predicted would exist before looking. High-low frequency detectors are a more instructive case precisely because they weren't anticipated. Found in the same early layers of InceptionV1, they scan one side of their receptive field for low-frequency patterns and the other for high-frequency ones — a simple operation that turns out to be a useful heuristic for detecting object boundaries, particularly when backgrounds are blurred. The same seven-argument framework applies here with minor adjustments, and the researchers suggest this type of feature represents a small but genuine payoff from interpretability work: a natural visual concept the field hadn't thought to name before the network revealed it.

Higher-level features, like a neuron the researchers identify as a pose-invariant dog detector, stretch some of these methods considerably. Feature visualization and dataset examples still carry weight — the visualization produces geometrically impossible but informationally rich dog-head imagery, and dataset examples confirm the pattern. Synthetic 3D-rendered dog heads at varying angles can substitute for the rotation test. But circuit-based analysis becomes increasingly valuable at this level of abstraction, offering leverage that doesn't scale in difficulty the way perceptual methods do.

The harder problem is polysemantic neurons — units that respond to multiple, genuinely unrelated inputs. One neuron in InceptionV1 fires for cat faces, cat legs, and the fronts of cars. Feature visualization rules out the possibility that the network has found some subtle shared property: it's looking for whiskers and eyes, furry limbs, and shiny automotive surfaces as distinct cases. These neurons don't break interpretability entirely, but they impose a real ceiling on how far circuit-based reasoning can go. If a neuron carries five meanings and connects to another neuron carrying five meanings, the interaction space expands to 25 possible relationships that can't be cleanly separated. The researchers' working hypothesis is that polysemanticity emerges from superposition — a mechanism by which circuits distribute features across more neurons than are strictly available, trading clarity for capacity.

Why circuits change what's possible in neural network analysis

The deeper implication of this work isn't about curves or frequency detectors specifically — it's about what becomes legible when you treat a neural network's internal structure as something worth reading carefully. Neurons connect through weights that form sub-graphs the researchers call circuits, and those circuits, against initial expectations, turn out to be structured and meaningful rather than opaque. Symmetries appear. Algorithms become readable directly from floating-point weights. The gap between "a number in a matrix" and "a step in a recognizable computation" closes in ways that weren't obvious before anyone looked closely enough to find out.

The circuits agenda doesn't promise that every neuron resolves into a clean concept — polysemanticity is a real and unresolved obstacle. But the evidence from curve detectors, high-low frequency detectors, and even the messier high-level features suggests that systematic, multi-method analysis of individual neurons and their connections can yield genuine understanding of what neural networks are computing, not just what they output.

Neural networks, it turns out, are not black boxes in the way researchers once feared. A closer look at the internal wiring of vision models like InceptionV1 reveals something unexpected: structured, interpretable circuits that follow consistent logic — and in some cases, elegant geometry.

How Curve Detectors Build on Each Other

At the lower layers of a vision model, curve detectors don't emerge from nothing. They're constructed from simpler line detectors and earlier, less refined curve detectors. The weights connecting an early curve detector to a more sophisticated one tell a clear story: strong positive values arranged along the shape of the curve itself, as if the network is asking, at each point along a curve, "is there a tangent curve here?"

This isn't just a rough approximation. The weight structure is precise enough that curves in the opposite orientation actively inhibit the detector — negative weights suppress conflicting signals. And when two curve detectors share a similar but not identical orientation, the weights shift accordingly, with stronger excitation on whichever side is more aligned. The geometry of the problem is reflected directly in the geometry of the weights.

This phenomenon — where the symmetry of a task is mirrored in the structure of the circuit — is what researchers call an "equivariant circuit." The weights rotate with the orientation of the curve detector, a property that will be explored in greater depth in a dedicated follow-up. For now, the key point is that these weights are not noise. They are meaningful, and they reward careful reading.

Dog Head Detection and the Logic of "Unioning Over Cases"

Move up a few layers and the circuits become more elaborate. ImageNet models must distinguish between hundreds of dog breeds, which means they develop substantial internal machinery for recognizing dog-specific features — including heads. One particular circuit stands out for its sophistication.

Across three layers, the network maintains two parallel pathways: one for dog heads facing left, one for dog heads facing right. These pathways don't just run in parallel — they actively suppress each other, sharpening the contrast between the two orientations. Then, at the final step, invariant neurons emerge that respond to both. The network has effectively learned to handle left and right as separate cases, then take a union over them.

What makes this striking is the alternative the network didn't take. It could have built a sloppy invariant detector — one that simply looked for a loose collection of eyes, fur, and snout without caring about their arrangement. Instead, gradient descent produced something far more structured: a circuit with XOR-like properties, where the two orientations inhibit each other before being unified. Even the spatial details hold up under scrutiny. The regions of excitation extend outward from the center in orientation-specific directions, allowing snouts from both left- and right-facing heads to converge at the same point in the invariant representation.

Superposition: Why Pure Neurons Get Deliberately Blurred

In the mixed4c layer of InceptionV1, there's a neuron that cleanly detects cars — wheels at the bottom of its convolutional window, windows at the top. Clean, interpretable, exactly what you'd hope to find. But what happens next is counterintuitive.

Rather than passing that clean car signal forward into another dedicated car detector, the model distributes the feature across neurons that are primarily doing something else — dog detection, specifically. The car feature gets folded into polysemantic neurons that serve double duty.

The implication is significant: polysemantic neurons aren't just an accidental byproduct of training. The model had a pure representation and chose, in effect, to mix it. The likely reason is efficiency. Cars and dogs rarely appear in the same image, which means the model can store both features in overlapping neural real estate without meaningful interference. This is superposition — a consequence of high-dimensional geometry, where a space that supports only n orthogonal vectors can accommodate exponentially more vectors that are merely close to orthogonal.

What Recurring Patterns Suggest About Neural Network Universality

Across InceptionV1 and other architectures, the same abstract patterns keep surfacing: equivariance in curve detectors, unioning over cases in pose-invariant detectors, superposition in feature storage. These aren't isolated quirks — they look more like circuit motifs, a concept borrowed from biology, where recurring structural patterns in transcription networks or biological neural circuits give researchers leverage across many different systems at once.

The broader question this raises is whether these motifs are universal — whether the same features and circuits form reliably across different model architectures and training datasets. Prior work has shown that different networks develop highly correlated neurons and learn similar hidden representations, which is suggestive. But correlation between neurons isn't the same as demonstrating that the same specific features are forming. A fur texture detector and a dog body detector might be highly correlated without being the same thing.

The honest answer, for now, is that the evidence is anecdotal. Researchers have observed consistent low-level features forming across AlexNet, InceptionV1, InceptionV3, residual networks, and models trained on Places365 rather than ImageNet. That's a meaningful spread of architectures and training conditions. But a rigorous comparative study — one that characterizes specific features and tracks their weight structure across multiple models — hasn't been done yet.

The universality hypothesis looks plausible, maybe even likely. Whether it holds for high-level features as reliably as it does for low-level ones remains an open question, and probably the most important one this line of research will need to answer.

Robert Hooke didn't set out to revolutionize science when he pointed a microscope at a piece of cork. He drew what he saw. That simple act — careful observation, rigorously documented — gave the world its first picture of a cell and, eventually, an entirely new branch of biology. Researchers studying the internal mechanics of neural networks find themselves in a strikingly similar position today.

The Universality Question and What It Means for Neuroscience

One of the more provocative threads running through neural network interpretability research is the universality hypothesis: the idea that different models, trained independently on different tasks, tend to develop the same internal features. If that holds broadly, the implications stretch well beyond machine learning.

Researchers working at the boundary of neuroscience and deep learning have already demonstrated that units inside artificial vision models can serve as useful proxies for modeling biological neurons. Some features identified in artificial networks — curve detectors, for instance — appear to have direct counterparts in biological visual systems. That parallel is hard to dismiss. One particularly striking possibility raised by researchers in this space: artificial networks might predict the existence of biological features that haven't been found yet. High-low frequency detectors have been floated as a candidate. If a prediction like that were confirmed experimentally, it would constitute unusually strong evidence that the universality hypothesis reflects something real and deep about how learning systems organize information, regardless of substrate.

For circuits research specifically, universality isn't just philosophically interesting — it determines what kind of work is even worth doing. A world where every model develops its own idiosyncratic internal structure would force researchers to study a small number of commercially important models and hope those models don't change too fast. A world where universality holds strongly, by contrast, opens the door to something more ambitious: a kind of periodic table of visual features, catalogued systematically across architectures. The truth is probably somewhere between those poles, but where exactly matters enormously for how the field allocates its effort.

Interpretability as a Pre-Paradigmatic Field

Thomas Kuhn's The Structure of Scientific Revolutions draws a sharp distinction between "normal science" — where a community shares a paradigm, agreed-upon methods, and common standards of evaluation — and "extraordinary science," the often-frustrating period before any of that exists. Kuhn didn't mean "extraordinary" as a compliment. Pre-paradigmatic fields are characterized by researchers struggling to agree on what they're even studying, let alone how to study it.

That description maps uncomfortably well onto interpretability research right now. There's no consensus on the core objects of study, no shared methodology, and no standard for evaluating results. Ian Goodfellow put it plainly in a recent interview: "For interpretability, I don't think we even have the right definitions."

Two proposals for evaluation have emerged from adjacent fields. Researchers with deep learning backgrounds tend to push for interpretability benchmarks — quantitative measures of how well a given method works. Those coming from human-computer interaction lean toward user studies. Both approaches import standards from elsewhere rather than developing ones native to the problem.

A third option, less commonly pursued, is to treat neural networks the way a biologist treats an organism: as objects of empirical investigation, where claims are specific, testable, and falsifiable. The difficulty is that making robustly true statements about a neural network as a whole is genuinely hard. These are extraordinarily complex systems, and it's not obvious what the most interesting empirical questions about them even are. As a result, evaluation tends to drift toward usefulness — does this interpretability method help practitioners? — rather than truth: are we actually learning accurate things about how the network works?

Why Circuits Offer a Tractable Path Forward

The circuits approach sidesteps some of these problems by narrowing the scope of inquiry dramatically. Rather than attempting to characterize an entire model, circuits research focuses on small subgraphs — specific nodes and weighted edges — where rigorous empirical claims become tractable. A circuit is falsifiable in a concrete sense: if you genuinely understand one, you should be able to predict what changes when you edit the weights. For small enough circuits, the analysis shades into mathematical reasoning rather than empirical approximation.

The tradeoff is obvious. Statements about circuits are narrow. They don't immediately tell you how a whole model behaves. But the bet implicit in this research program is that model-level behavior can, with enough work, be decomposed into circuit-level statements — and that circuits could therefore serve as an epistemic foundation for the field, the way cells became a foundation for biology.

There's an anxiety in the interpretability community, not entirely unfounded, that the work isn't taken seriously enough — that it's too qualitative, too descriptive, not sufficiently rigorous. The history of the microscope suggests that anxiety might be misplaced. Hooke's Micrographia was, at its core, a collection of drawings. The discovery of cells was a qualitative result. It still changed everything.

I can't discuss that.

Source: https://distill.pub/2020/circuits/zoom-in

Comments

No comments yet. Be the first to comment.