How Value Learning Actually Works: A Paths-Based Perspective
Value estimation sits at the heart of reinforcement learning — and how you go about it shapes everything from sample efficiency to the stability of your training runs. Temporal Difference (TD) learning offers a compelling answer to a problem that Monte Carlo methods only partially solve: how do you learn reliably when the future is noisy and episodes are long?
Why Monte Carlo Value Estimation Has a Ceiling
The most straightforward way to estimate the value of a state is to simply average the returns you observe from it across episodes. That's Monte Carlo estimation in a nutshell. If an agent visits a state once, the value is whatever return that single episode produced. Visit it multiple times, and the estimate becomes an average over all those returns.
Formally, Monte Carlo uses an update rule — a way of adjusting estimates incrementally as new episodes come in. Using an "updates toward" operator to keep the math clean, each new episode nudges the value estimate closer to the observed return. This notation also generalizes cleanly to parametric function approximators like neural networks, where a gradient step can't always be expressed as a simple running average.
The problem is variance. Future returns are noisy by nature — shaped by randomness, by the actions of the agent, and by everything else happening in the environment between now and the end of the episode. The further out you look, the more that noise compounds. Monte Carlo waits until the end of an episode to update, which means every estimate carries the full weight of that accumulated uncertainty.
Where Temporal Difference Learning Changes the Equation
TD learning takes a different approach: instead of waiting for a complete episode, it bootstraps — using current value estimates to update other value estimates along the way. Rather than measuring the full return from a state to the end of an episode, TD methods look one (or a few) steps ahead and use the estimated value of the next state as a proxy for everything beyond it.
This is where the "merging paths of experience" framing becomes meaningful. TD learning effectively combines the direct signal from immediate rewards with the indirect signal encoded in existing value estimates. The result is a method that can learn from incomplete episodes and that tends to exhibit lower variance than Monte Carlo — at the cost of introducing some bias from the bootstrapped estimates.
The tradeoff between bias and variance is one of the central tensions in RL algorithm design, and TD learning sits at a specific point on that spectrum. Methods like TD(λ) use eligibility traces to blend TD and Monte Carlo updates, letting practitioners tune where on that spectrum they want to operate depending on the problem.
What This Means for Reinforcement Learning at Scale
The practical significance of TD learning becomes clearer when you consider what modern RL has achieved. Systems that beat world-champion Go players, control robotic hands with fine motor precision, and generate images from scratch all depend on value estimation working reliably at scale. Monte Carlo methods alone would struggle in these settings — the episodes are too long, the state spaces too large, and the variance too high to learn efficiently.
TD learning's statistical efficiency — its ability to extract useful signal from partial trajectories — is a core reason why deep RL has been able to tackle problems of this complexity. By updating value estimates continuously rather than waiting for terminal states, TD methods make better use of every interaction the agent has with its environment.
For many RL approaches, particularly policy-value iteration methods, getting value estimation right essentially solves the whole problem. For actor-critic architectures, accurate value estimates are what keep policy gradient updates from drowning in noise. Either way, the choice of how to estimate value isn't a footnote — it's foundational, and TD learning's ability to balance bias against variance while learning incrementally is what makes it the workhorse of modern reinforcement learning.
The elegance of TD learning is that it treats every step of an episode as a learning opportunity rather than just a data point to be collected and averaged later — a shift in perspective that turns out to matter enormously when you're trying to teach an agent to navigate a complex world.
Most introductions to reinforcement learning treat Monte Carlo and Temporal Difference learning as two separate philosophies — pick one, understand it, move on. The more revealing way to look at them is as two points on a spectrum defined by how much of an agent's experience gets folded into each value estimate.
The return and why it's only part of the story
Value, at its core, is expected return. The return itself is a discounted sum of future rewards — — where the discount factor controls how much weight near-term rewards carry relative to distant ones. Monte Carlo methods lean directly on this: they update value estimates by averaging over complete trajectories, using the actual return observed at the end of each episode. The logic is sound — if value is expected return, why not just measure return directly?
The limitation is subtle but consequential. When an agent follows a trajectory, it only experiences one possible future from each state it visits. Other futures — equally valid branches the agent could have taken — go unobserved. Monte Carlo has no mechanism to incorporate them.
How TD learning exploits trajectory intersections
Temporal Difference learning takes a different approach. Rather than waiting for a full episode to complete, TD methods bootstrap: they update the value of a state using the estimated value of the next state, propagating information backward through the sequence incrementally.
The more interesting consequence shows up at trajectory intersections. When two trajectories pass through the same state, Monte Carlo treats them as separate data points — each trajectory contributes its own return independently. TD learning, by contrast, merges those intersections. The return flows backward through all preceding states regardless of which trajectory originally led there. A state visited by multiple trajectories effectively benefits from all of them.
Expanding the TD update rule recursively reveals a sum of nested expectation values. At first glance this looks more complicated than the clean Monte Carlo formulation. But rewriting both updates in terms of raw rewards puts them side by side in a way that makes the structural difference clear: Monte Carlo averages over real trajectories, while TD learning averages over all possible paths — including simulated ones constructed by combining segments from different trajectories.
Why averaging over more paths produces better estimates
This is where the paths perspective becomes analytically useful. Wherever two trajectories intersect, both outcomes represent valid futures for the agent from that point forward. Even if the agent actually followed one branch, the other branch is still informative. TD learning implicitly constructs these simulated paths and folds them into the value estimate.
For tabular environments, the quality of a value estimate comes down to variance, and variance decreases as you average over more samples. TD learning never averages over fewer items than Monte Carlo — the set of simulated paths always includes the real trajectories as a subset. When additional simulated paths exist, TD has strictly more data to work with. This is the core reason TD learning tends to outperform Monte Carlo in tabular settings: it extracts more signal from the same underlying experience.
Extending the framework to Q-functions
Value functions estimate how good it is to be in a given state. Q-functions go one level deeper, estimating the value of being in a state and taking a specific action. The practical payoff is direct: Q-functions let an agent compare its available actions at any decision point, which is essential for policy improvement.
The update mechanics carry over cleanly. The Monte Carlo Q-function update mirrors the value function version almost exactly — instead of updating toward the return of occupying a state, the agent updates toward the return of occupying a state and committing to a particular action. The TD update follows the same structural logic, bootstrapping off the estimated Q-value of the next state-action pair rather than waiting for a terminal return.
The paths perspective applies here too. TD Q-learning still averages over a richer set of paths than Monte Carlo, and the variance argument holds. What Q-functions add is the ability to reason about counterfactual actions — not just "how good is this state" but "how much better or worse would a different choice have been." That distinction becomes the foundation for most practical deep reinforcement learning algorithms built on top of these ideas.
The gap between Monte Carlo and TD learning is narrower than it first appears — both are empirical averages targeting the same quantity. The difference is in scope: one counts what actually happened, the other accounts for what could have happened too.
Temporal difference learning has a subtle flaw baked into one of its most foundational algorithms — and understanding that flaw is the key to grasping why more sophisticated variants were developed.
Why Sarsa Falls Short of Its Own Goal
The Sarsa algorithm gets its name from the five-element tuple it operates on: state, action, reward, next state, next action. That tuple structure is also where the trouble starts. When Sarsa updates its Q-value estimate, it reaches for Q(s_{t+1}, a_{t+1}) — the value of the specific action actually taken in the next state — to approximate what the next state is worth. But that's not quite the right quantity. What the update rule really needs is V(s_{t+1}), the true value of the next state under the current policy, not the value of one particular action sampled from it. Using a single sampled action as a stand-in for the full state value introduces noise, and over many updates, that noise compounds.
The distinction matters because Q(s_{t+1}, a_{t+1}) is just one draw from the distribution of actions the policy might take. It could be a high-value action, a low-value one, or anything in between. V(s_{t+1}), by contrast, is the expectation over all of them — a much more stable target to learn toward.
Expected Sarsa: Averaging Over What Could Have Happened
Expected Sarsa addresses this directly by replacing the single next-action Q-value with a weighted sum across all possible next actions, where the weights come from the policy's action probabilities. The result is a direct estimate of V(s_{t+1}) derived from the Q-function itself, rather than a noisy single-sample approximation.
The mechanics are straightforward: instead of waiting to see which action the agent actually takes next and using that action's Q-value, Expected Sarsa computes the average Q-value at the next state, weighted by how likely the policy is to choose each action. That average is, by definition, the state value under the current policy.
The Counterintuitive Accuracy Advantage
Here's where Expected Sarsa gets genuinely interesting. The value estimate it produces is often more accurate than one derived directly from observed experience. That sounds paradoxical — how can a computed average beat actual data? — but the explanation is clean once you see it.
When an agent acts in an environment, the actions it takes are drawn from the policy, but any finite sequence of actions will reflect the empirical distribution of those draws, not the true underlying policy distribution. There's always sampling noise. Expected Sarsa sidesteps this by weighting Q-values using the true policy probabilities directly, rather than whatever the agent happened to do. In doing so, it effectively corrects for the gap between what the policy actually prescribes and what the agent empirically did — a gap that Sarsa never closes.
This correction is subtle but consequential. It means Expected Sarsa can extract more signal from the same experience, making it a more data-efficient algorithm even though it requires slightly more computation per update step.
The progression from Sarsa to Expected Sarsa illustrates a recurring theme in reinforcement learning: the most impactful improvements often come not from collecting more data, but from making smarter use of the data already available. Replacing a noisy single-sample estimate with a properly weighted expectation is a small algorithmic change with outsized practical benefits — and it opens the door to a broader family of TD methods that each take a different approach to recovering state values from Q-functions.
Reinforcement learning algorithms are often taught as a taxonomy — a list of named methods with distinct update rules. But strip away the labels, and Sarsa, Expected Sarsa, Q-learning, and Double Q-learning are all solving the same problem: how do you estimate the value of the next state in a temporal difference (TD) update? The differences between them come down to a single design choice — how you weight the paths of experience that pass through a given state.
Off-Policy Learning and the Problem of Whose Policy You're Following
Standard TD methods estimate value under the policy the agent is currently following. Off-policy learning breaks that constraint. By re-weighting Q-values according to an arbitrary target policy πoff, an agent can estimate value under any policy — not just the one generating its experience. Expected Sarsa, interestingly, sits at the boundary here: it functions as a special case of off-policy learning that happens to be used for on-policy estimation.
The paths perspective makes this concrete. At any state where multiple trajectories of experience intersect, re-weighting those paths by the target distribution means high-probability paths contribute more to the value estimate, while low-probability ones fade into the background. The value estimate reflects the policy you care about, not the one you happened to use during data collection.
Q-learning takes this to an extreme. Rather than weighting paths by any distribution, it prunes everything except the highest-valued path — the one the agent would actually follow at test time. This focus often produces faster convergence than on-policy methods, because the agent stops paying attention to suboptimal trajectories it will never use again.
Why Q-Learning's Optimism Becomes a Liability
The efficiency of Q-learning comes with a cost. By always selecting the maximum Q-value, it introduces a systematic upward bias in value estimates — particularly when rewards are noisy. The slot machine analogy captures this cleanly: if you play a hundred machines and happen to hit a jackpot on one, Q-learning will treat that lucky outcome as representative of the casino's true value. The estimate isn't wrong because the data is bad; it's wrong because the selection rule is optimistic by construction.
Double Q-learning addresses this by decoupling action selection from value estimation. Instead of using the same data to both identify the best action and estimate its value, you use two independent estimates — effectively asking a second observer what they saw at the same machine. The probability that both observers got lucky on the same outcome is low, so the combined estimate regresses toward something more realistic. The bias doesn't disappear entirely, but it shrinks considerably.
Function Approximation: When Merging Paths Goes Wrong
All of the above assumes the agent can store a separate value estimate for every state. That works fine in small environments like Cliff World, but most real RL problems have state spaces that are too large — or continuous — to enumerate. Function approximation solves the storage problem by forcing the value estimator to generalize across states using fewer parameters than there are states. Linear models, decision trees, and neural networks all qualify.
From the paths perspective, function approximation is equivalent to merging nearby paths of experience. The critical question is what "nearby" means. A Euclidean distance metric is a reasonable default in open environments, but it breaks down the moment the geometry of the state space stops reflecting the dynamics of the agent's transitions. Add a wall between two regions, and states that are close in Euclidean space may be completely unreachable from one another — merging their paths produces bad value estimates.
This is where TD learning's efficiency becomes a liability. TD amplifies generalization errors significantly more than Monte Carlo does, because its updates propagate through the value function in ways that compound mistakes. Monte Carlo, which waits for full episode returns before updating, is more forgiving of bad generalization — but it pays for that robustness with slower learning.
Deep neural networks offer a partial resolution to this tension. Unlike fixed-metric approximators, neural networks don't hardcode assumptions about which states are similar. Early in training they make the same kinds of generalization errors as any other approximator. But given enough experience, they can learn the structure of the state space from data — discovering, for instance, that value updates on one side of a barrier should never influence estimates on the other side. That capacity for learned generalization is a large part of what makes deep reinforcement learning worth the computational cost.
What the TD vs. Monte Carlo Debate Actually Reveals
The practical dominance of TD methods over the past few decades reflects a real empirical pattern: in most environments, the efficiency gains from bootstrapping outweigh the sensitivity to generalization errors. But that dominance isn't absolute. Monte Carlo remains relevant — not just as a theoretical baseline, but as a practical tool for policy selection, as demonstrated in work like AlphaGo. The deeper point is that neither approach is universally superior. The right choice depends on the structure of the environment, the quality of the function approximator, and how much the agent can afford to be wrong early in training.
Thinking about these algorithms through the lens of path re-weighting — rather than as separate named methods — makes those tradeoffs easier to reason about. The question is never really "which algorithm should I use?" It's "how do I want to weight the experience my agent has collected, and what happens when that weighting is imperfect?"
Bridging the gap between two foundational reinforcement learning approaches turns out to be more than a theoretical exercise — TD(λ) learning offers a practical middle ground that frequently outperforms either method on its own.
How TD(λ) Combines the Best of Both Worlds
Monte Carlo and TD learning each carry distinct advantages. Monte Carlo methods wait until the end of an episode to update value estimates, giving them accuracy but at the cost of efficiency. TD learning updates incrementally, bootstrapping from current estimates, which makes it faster but potentially less stable early in training. TD(λ) threads between them using a single coefficient, λ, that controls how much weight each approach carries. Set λ to 0 and you get pure TD; push it to 1 and you recover Monte Carlo. Anywhere in between, and you get a blended estimator that can adapt to the problem at hand.
What makes this particularly interesting is the question of whether λ should stay fixed throughout training. There's a reasonable argument that Monte Carlo behavior is more useful early on — before the agent has built a reliable internal representation of the state space — while TD's path-merging efficiency becomes more valuable once that foundation is in place. That suggests annealing λ over the course of training could be a stronger strategy than locking it to a constant value, though this remains an open area of exploration rather than settled practice.
Why This Framing Matters for Understanding TD Learning
The article's broader contribution is a new conceptual lens for thinking about TD learning itself — one that clarifies three things that often trip up practitioners: why TD learning tends to be beneficial, why it handles off-policy learning well, and why combining it with function approximators introduces complications. These aren't just academic points. Off-policy learning is central to many modern deep RL systems, and instability when using neural networks as function approximators has been a persistent challenge in the field. Having a cleaner mental model of the underlying mechanics helps researchers diagnose problems and design better algorithms.
Collaboration and Open Access
The work was developed collaboratively, with Chris Olah originating the central concept and structure, and Sam working alongside him to develop the details. Sam drafted the initial text and figures, which were then refined with Chris's input. The interactive visualizations — including the hero and playground components — were built by Chris and Cassandra Xia, then adapted by Sam for the Distill publication format. Feedback from Ludwig Schubert, Justin Gilmer, Shan Carter, and John Schulman shaped earlier drafts, with Shan also contributing design guidance on the diagrams. Sam's work was supported by the Google AI Residency Program.
The text and diagrams are published under Creative Commons Attribution CC-BY 4.0, with the source available on GitHub. Figures reused from external sources are noted individually in their captions and fall outside that license. Corrections and suggestions can be submitted by opening an issue on GitHub.
The playground the authors built to accompany the article is worth spending time with — interactive tools that let you test these intuitions directly tend to surface edge cases and build understanding faster than reading alone ever could.