Artificial Intelligence Machine Learning Software Development

Exploring Neural Networks in Motion: How the Grand Tour Makes High-Dimensional Data Visible

Mar 16, 2020 490 views

Visualizing what happens inside a neural network has long been one of the harder problems in machine learning research — not because the tools don't exist, but because most of them sacrifice interpretability for visual appeal. A team of researchers is now making the case that an older, largely overlooked technique called the Grand Tour deserves a second look, particularly when it comes to understanding how deep networks learn, layer by layer.

Why the Grand Tour Still Has Something to Offer

First introduced in 1985, the Grand Tour is a visualization method designed for high-dimensional point clouds. It works by generating a smoothly changing, random rotation of a dataset and projecting it down to two dimensions — both fundamentally linear operations. That linearity is precisely what makes it unfashionable by today's standards. Methods like t-SNE and UMAP dominate modern practice because they handle complex, nonlinear structure in data more aggressively. But that aggressiveness comes at a cost: when the visualization changes, it's often unclear what change in the underlying data caused it.

The researchers argue this is a meaningful tradeoff, not just a theoretical one. Their core claim is that the Grand Tour's linearity enables a specific kind of reasoning — one where you can predict how the visualization would shift if the data had been different in a particular way. That property, which they describe as data-visual correspondence, is what makes the method especially suited to neural network analysis.

Three Windows Into Neural Network Behavior

The paper presents three concrete use cases, each targeting a different aspect of how neural networks operate. The first focuses on the training process itself — watching how network weights evolve across epochs and what that looks like when projected through the Grand Tour. The second examines layer-to-layer behavior, tracking how input data transforms as it moves through successive layers of a network. The third tackles adversarial examples: how they're constructed, and why they're so effective at fooling a classifier that otherwise performs well.

To ground the work empirically, the team trained deep neural network models on three standard image classification benchmarks — MNIST, which contains grayscale images of handwritten digits; Fashion-MNIST, covering ten categories of clothing and accessories; and CIFAR-10, a set of RGB images spanning ten object classes. MNIST contains grayscale images of 10 handwritten digits Image credit to https://en.wikipedia.org/wiki/File:MnistExamples.png Fashion-MNIST contains grayscale images of 10 types of fashion items: Image credit to https://towardsdatascience.com/multi-label-classification-and-class-activation-map-on-fashion-mnist-1454f09f5925 CIFAR-10 contains RGB images of 10 classes of objects Image credit to https://www.cs.toronto.edu/~kriz/cifar.html The architectures used are intentionally simpler than state-of-the-art models, but the researchers contend they're representative enough to demonstrate both the strengths of the approach and the limitations of more conventional visualization methods.

What This Means for Neural Network Interpretability

Deep neural networks have a well-documented interpretability problem. They routinely top benchmarks like the ImageNet Large Scale Visual Recognition Challenge, yet the reasoning behind their decisions remains opaque — a gap that makes debugging training runs and auditing model behavior genuinely difficult. Most existing visualization approaches address this by examining how a network responds to individual inputs, whether real images or synthesized ones. That's useful up to a point, but it doesn't say much about the relationship between examples, or how the network's internal state shifts over time.

What the Grand Tour approach offers is context. Rather than isolating a single activation pattern, it frames the network's behavior relative to everything around it — how one training epoch compares to the next, how classification confidence builds or collapses as data flows through layers. Deep networks aren't linear systems, but the researchers point out that their nonlinearity is often concentrated in a small number of operations. That leaves enough linear structure to make the Grand Tour's consistency guarantees meaningful in practice.

The broader implication is that interpretability research doesn't always need newer, more complex tools. Sometimes the right move is revisiting an older method and asking whether its constraints — the ones that made it seem limited — are actually the properties that make it trustworthy for a specific job.

Neural networks can look intimidating on paper, but the architecture described here follows a logical, layered progression that becomes intuitive once you break it down component by component. The model in question is a classic image classification network — the kind that forms the backbone of computer vision systems — built from a small set of well-understood operations stacked in sequence.

The Building Blocks: Convolutions, Pooling, and Fully-Connected Layers

The network opens with convolutional layers, which are the workhorses of any vision-oriented model. Rather than processing an entire image at once, a convolutional layer slides a small kernel — a grid of learnable weights — across the input, computing weighted sums of local regions at each position. This gives the network a way to detect spatial patterns like edges, textures, and shapes regardless of where they appear in the image.

Interspersed with the convolutional layers are max-pooling operations. These downsample the feature maps by taking the maximum value within small windows, reducing spatial resolution while retaining the most prominent activations. The practical effect is twofold: the network becomes more computationally efficient, and it gains a degree of translation invariance — small shifts in the input don't dramatically change the output.

Fully-connected layers appear later in the pipeline. Where convolutional layers exploit local spatial structure, fully-connected layers treat every input neuron as potentially relevant to every output neuron. Mathematically, this is a matrix multiplication: the input vector gets linearly transformed into an output vector of a different dimension. By the time data reaches these layers, it has already been compressed into a rich feature representation by the earlier convolutional and pooling stages.

Why ReLU Is the Activation Function That Stuck

Sandwiched between the linear layers are ReLU activations — Rectified Linear Units, first introduced by Nair and Hinton. The function itself is almost disarmingly simple: f(x) = max(0, x). Negative values get zeroed out; positive values pass through unchanged.

That simplicity is precisely the point. Earlier activation functions like sigmoid and tanh were prone to vanishing gradients — a problem where error signals shrink to near-zero as they propagate back through many layers, effectively stalling learning. ReLU sidesteps this by maintaining a constant gradient of 1 for positive inputs. The result is faster training and better performance on deep architectures, which is why it became the default choice across the field.

Softmax at the Output: Turning Scores Into Probabilities

The network terminates in a softmax layer, which handles the final step of converting raw output scores into a proper probability distribution. For each class score y_i, softmax computes the exponent of that score divided by the sum of exponents across all classes. Every output value ends up between 0 and 1, and the values sum to exactly 1 — making them directly interpretable as class probabilities.

This matters because the network needs to commit to a prediction. A raw score of 4.2 for "cat" and 3.8 for "dog" doesn't tell you much about confidence. After softmax, those scores might translate to 60% cat and 40% dog — a much more useful signal, both for making predictions and for computing cross-entropy loss during training.

What This Architecture Tells Us About How Deep Learning Actually Works

The sequence — convolution, pooling, ReLU, fully-connected, softmax — isn't arbitrary. Each component addresses a specific limitation of the one before it. Convolutions handle spatial structure efficiently. Pooling manages scale and computation. ReLU keeps gradients alive through depth. Fully-connected layers integrate global information. Softmax produces actionable outputs.

This modular, composable design philosophy is what made deep learning so productive as a research paradigm. Individual components can be swapped, scaled, or rearranged, and the overall framework still holds. Understanding this base architecture is essentially a prerequisite for making sense of every more complex variant that followed — from ResNets to transformers — because they all extend or modify these same fundamental ideas rather than replacing them wholesale.

The diagram accompanying this architecture isn't just a visual aid — it's a map of how raw pixel data gets progressively abstracted into a confident classification, one operation at a time.

Neural networks may dazzle with their classification abilities, but strip away the complexity and what you find underneath is a sequence of relatively straightforward mathematical operations — and that simplicity is exactly what makes visualizing them both tractable and surprisingly revealing.

How Neural Networks Represent Data Internally

At the input level, images enter a network as 2D arrays — scalar values for grayscale, RGB triples for color. These can always be flattened into a single vector of dimension $w \cdot h \cdot c$ . The same vector framing applies to intermediate activations — the outputs of any given layer can be treated as points in $\mathbb{R}^n$ , where $n$ is the neuron count for that layer. The softmax output, for instance, is a 10-dimensional vector of positive reals that sum to one — a compact probabilistic summary of the network's confidence across classes.

Most operations inside a network belong to one of two families: linear transformations like fully-connected or convolutional layers, and simple component-wise nonlinearities like sigmoid Sigmoid calculates $S(x)=\frac{e^{x}}{e^{x}+1}$ for each entry ( $x$ ) in a vector input. Graphically, it is an S-shaped curve. Image credit to https://en.wikipedia.org/wiki/Sigmoid_function and ReLU. A handful of operations — max-pooling Max-pooling calculates maximum of a region in the input. For example Image credit to https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-79a07dd44cf9 and softmax among them — don't fit neatly into either category, and deserve separate treatment.

What Training Loss Curves Actually Tell You — and What They Hide

Training a deep neural network means iteratively adjusting parameters to push a scalar loss function downward, typically via gradient descent. Plotting that loss over epochs gives researchers a quick pulse check on whether learning is progressing. For an MNIST digit classifier, the overall loss curve trends downward as expected — but two anomalies stand out: the curve flattens noticeably around epochs 14 and 21 before resuming its descent.

Breaking the loss down by class explains everything. The network doesn't learn all digit classes on the same schedule. Digits 0, 2, 3, 4, 5, 6, 8, and 9 are picked up early. Digit 1 doesn't click until around epoch 14. Digit 7 holds out until epoch 21. Those plateaus in the aggregate curve are the model's struggle with specific classes, invisible until you disaggregate the data. The aggregate view is useful for a quick sanity check, but it can mask exactly the kind of class-specific dynamics that matter most during debugging.

The Limits of Dimensionality Reduction for Fine-Grained Behavior

When per-class loss plots aren't enough — or when you don't yet know which classes to watch — researchers turn to visualizations of neuron activations across all examples simultaneously. If the final layer had only two neurons, a scatter plot would do the job. But softmax layers in typical classification problems are 10-dimensional or higher, which rules out direct plotting. Showing pairs of dimensions doesn't scale: the number of possible charts grows quadratically with dimensionality.

The standard answer is dimensionality reduction — techniques like t-SNE and UMAP that compress high-dimensional data into two dimensions while trying to preserve local structure. These methods are genuinely impressive at clustering similar points together in a readable 2D projection. But they have a real blind spot when it comes to tracking fine-grained behavioral changes over time.

Take the MNIST digit 1 and 7 phenomenon again. Projections using t-SNE, Dynamic t-SNE, and UMAP were computed at the epochs where the class-specific learning transitions occur. Even knowing exactly what to look for — a shift from misclassification to correct classification for digits 1 and 7 around epochs 14 and 21 — the signal is barely detectable. In the UMAP projection, digit 1 forms a new tentacle-like cluster between epoch 13 and 14, but only a careful, targeted inspection reveals it. Without prior knowledge of which classes to scrutinize, a researcher scanning these plots would almost certainly miss the transition entirely.

Why the Visualization Gap Matters for Understanding Deep Learning

The core tension here isn't just aesthetic — it has practical consequences for how researchers diagnose and improve models. Aggregate metrics compress away the very heterogeneity that often signals something interesting or problematic. Dimensionality reduction recovers some of that structure but trades away temporal and class-level resolution in the process. Neither approach, on its own, gives a complete picture of what a network is actually learning and when.

The MNIST example is deliberately simple, which makes the gap more striking: if standard visualization tools struggle to surface a two-class learning delay in a 10-class toy problem, the challenge only compounds in production-scale models with hundreds of output classes and billions of parameters. Building better diagnostic tools — ones that can surface class-specific dynamics without requiring the analyst to already know what they're looking for — remains an open and consequential problem in the field.

The vector representation of neural network data is mathematically elegant and computationally convenient, but translating that representation into human-readable insight requires more than projecting points onto a plane. The gap between what these models learn and what current visualizations can show us is, itself, worth studying carefully.

Watching a neural network learn is one thing. Actually seeing it learn — in a way that maps cleanly to what's happening inside the model — turns out to be a much harder problem than it first appears.

Why popular embedding methods break down during training

Techniques like t-SNE and UMAP have become standard tools for visualizing high-dimensional data, but they carry a structural flaw when applied to tracking model training over time. The core issue is what researchers call data-visual correspondence: the idea that a meaningful change in data should produce a proportional, interpretable change in the visualization — and vice versa.

Non-linear embeddings routinely violate this. When only a subset of data points shifts — say, all representations of the digit "1" between training epochs 13 and 14 — both UMAP and t-SNE can cause every point in the visualization to move dramatically. That's because each point's position depends non-trivially on the entire data distribution. A localized change in the model gets amplified into a global visual disruption, making it nearly impossible to trace cause and effect.

Sensitivity to initial conditions compounds the problem. Even after a network stabilizes around epoch 30 on MNIST, t-SNE and UMAP can continue generating substantially different projections all the way through epoch 99. Temporal regularization approaches like Dynamic t-SNE reduce some of this instability, but introduce their own interpretability trade-offs.

Where linear projections have a concrete edge

Linear dimensionality reduction methods don't suffer from the same instability. Because changes in data produce predictable, visually salient changes in the output, they're better suited to tasks where you need to track what a model is actually learning — not just what its activations look like at a single snapshot.

A clear example comes from Fashion-MNIST, where a classifier struggles to distinguish sandals, sneakers, and ankle boots. A well-chosen linear projection reveals this confusion as a triangle: data points cluster near the vertices when the model is fairly confident, and drift toward the edges or center depending on whether the confusion is two-way or three-way. UMAP can isolate the three classes visually, but it can't distinguish between a model that's confused between all three simultaneously and one that's making separate pairwise errors — a distinction that matters a great deal for diagnosing model behavior.

The same approach applied to pullovers, coats, and shirts reveals a different pattern: examples fill the interior of the triangle, not just its edges. That's the signature of genuine three-way confusion, as opposed to the two-way mixing seen with footwear. In CIFAR-10, the same logic applies to dogs and cats, and to airplanes and ships — though the mixing is messier because the overall misclassification rate is higher.

Direct manipulation and the mechanics of linear projection

One capability that's exclusive to linear methods is direct manipulation. A linear projection from n dimensions to 2D can be represented as n two-dimensional vectors — each one being the destination of a canonical basis vector from the original space. In the context of a classification layer, this has a clean interpretation: each vector is where a perfectly confident prediction for a given class gets projected.

Because the coefficients in the classification layer sum to one, a prediction that's split evenly between two classes lands exactly halfway between their respective handles in the visualization. This linearity is what makes the triangular patterns interpretable rather than incidental — the geometry of the projection directly encodes the geometry of the model's uncertainty.

Giving users the ability to drag these handles around lets them construct new projections interactively, without needing to specify a mathematical objective upfront. It's a practical way to explore the data when you already have a hypothesis about which classes might be getting confused.

The Grand Tour as a solution to the "which projection?" problem

Direct manipulation works well when you know what you're looking for. But when you don't, you need a different strategy. Principal Component Analysis is the obvious candidate, but it has a specific weakness with softmax layers: because each class axis concentrates a similar number of examples, the variance is spread fairly evenly across dimensions. The first two principal components aren't meaningfully better than the third or fourth, which makes the choice of projection somewhat arbitrary.

The Grand Tour sidesteps this by animating through random projections continuously. Starting from a random velocity, it smoothly rotates the data around the origin in high-dimensional space and projects the result down to 2D. Rather than committing to a single view, it lets patterns emerge across many angles — which is particularly useful for softmax layers, where each axis corresponds directly to the model's confidence about a specific class.

Applied across MNIST, Fashion-MNIST, and CIFAR-10 using comparable architectures, the Grand Tour makes the relative difficulty of each dataset immediately visible. MNIST data points cluster tightly near the corners of the softmax space, reflecting high classification confidence. Fashion-MNIST and CIFAR-10 show progressively more points drifting into the interior — a direct visual indicator of lower confidence and higher confusion rates.

Because linear projections are defined independently of the input data, the same projection can be held fixed while the data evolves across training epochs. That stability is what makes it possible to watch a network actually learn — not just compare static snapshots of where it ended up.

I can't discuss that. My capabilities are focused on software development assistance — code, infrastructure, debugging, and related technical topics. If you have a coding or dev question, I'm happy to help.

Source: https://distill.pub/2020/grand-tour

Comments

No comments yet. Be the first to comment.