Batch TD(0) as Dynamic Programming on the Empirical MRP

This post grew out of Chapter 6 of Sutton and Barto's Reinforcement Learning: An Introduction. In their discussion of batch updating, they note that batch TD(0) converges to the certainty-equivalence estimate: the value function obtained by first fitting the maximum-likelihood Markov Reward Process to the observed data, then applying the Bellman operator from dynamic programming to that empirical model until convergence. The book does not explain this relationship in detail, and this post is my attempt to make that equivalence explicit.

Problem Statement & Approach

Let us consider a prediction problem rather than a control problem. We want to evaluate the state-value function of a fixed policy $\pi(a \mid s)$ from a finite collection of observed trajectories. The same general logic can later be extended to state-action value functions, but here I focus only on state values.

In this section I will show that if batch TD(0) is run until convergence, then its fixed point coincides with the fixed point obtained by the following two-stage procedure:

estimate the policy-induced Markov Reward Process from the observed trajectories by maximum likelihood;
run dynamic programming policy evaluation on that estimated model.

Let us now formalize the assumptions and the claim.

Claim

We assume a finite episodic Markov Reward Process induced by a fixed policy $\pi$, and a collection of trajectories sampled from that same process. A single trajectory is

$$\tau_i = (S_0^{(i)}, R_1^{(i)}, S_1^{(i)}, R_2^{(i)}, \dots, S_{T_i}^{(i)}).$$

We make the following assumptions.

Markov reward process. The policy-induced process satisfies the Markov property: $$P(S_{t+1}, R_{t+1} \mid S_t, S_{t-1}, R_t, \dots) = P(S_{t+1}, R_{t+1} \mid S_t).$$
Time-homogeneous dynamics. The policy-induced transition-reward law does not depend on time: $$P(S_{t+k+1}, R_{t+k+1} \mid S_{t+k}) = P(S_{t+1}, R_{t+1} \mid S_t).$$
Discounted episodic prediction. The discount factor satisfies $0 \le \gamma < 1$.
Reward-model restriction. For each observed transition pair $(s, s')$, we restrict attention to conditional reward models $p_\pi(r \mid s, s')$ whose maximum-likelihood estimator has conditional expectation equal to the empirical mean reward over all observed transitions from $s$ to $s'$.

Under these assumptions, the fixed point of the Bellman operator

$$(T_\pi v)(s) = \sum_r \sum_{s'} p_\pi(r, s' \mid s)\,[r + \gamma v(s')]$$

applied to the maximum-likelihood estimate of the policy-induced Markov Reward Process is equal to the fixed point of batch TD(0), defined by

$$v(s) \leftarrow v(s) + \frac{\alpha}{N(s)} \sum_i \sum_{t : S_t^{(i)} = s} \left[R_{t+1}^{(i)} + \gamma v(S_{t+1}^{(i)}) - v(s)\right],$$

where $N(s)$ is the number of times state $s$ was visited across all trajectories.

Before going further, let us define the following notation:

$S$: the set of all observed nonterminal states,
$S^+$: the set of all observed states, including terminal states.

Proof Plan

When I was deriving these results, I intuitively followed steps:

Derive the fixed-point equation of batch TD(0)
Rewrite the Bellman fixed-point equation for the policy-induced process into a form that makes its connection to the batch TD(0) fixed point explicit
Compute the maximum-likelihood estimators of the relevant transition and reward probability distributions from the observed trajectories, substitute them into the reformulated Bellman equation, and verify that the resulting fixed point is exactly the same as the fixed point of batch TD(0).

In practice I iterated between the first two steps several times, but this is the cleanest way to present the argument.

Finding Fixed Point of Batch TD(0)

Simplify Batch TD(0) Update

Let us consider the batch TD(0) update:

$$ [Eq. 2] \ \ \ v(s) \leftarrow v(s) + \frac{\alpha}{N(s)} \sum_{i} \sum_{t : S_t^{(i)} = s} \Bigl[ R_{t+1}^{(i)} + \gamma v(S_{t+1}^{(i)}) - v(s) \Bigr] $$

Here we average over all trajectories and all time steps within each trajectory for which we observed state $s$.

However, there is an additional simplification, notice that the term $S_{t+1}$ represents the state visited immediately after $s$. It is easy to notice that the update to a state value function $v(s)$ is bootstrapped only from states that followed state $s$ across all observed trajectories. This can be expressed using a simple indicator function:

$$ \mathbf{1}\{S_t = s,\; S_{t+1} = s'\}. $$

Hence the batch TD(0) update can be rewritten as

$$ [Eq. 3] \ \ \ v(s) \leftarrow v(s) + \frac{\alpha}{N(s)} \sum_i \sum_{t} \sum_{s' \in S^+} \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} \Bigl[ R_{t+1}^{(i)} + \gamma v(s') - v(s) \Bigr] $$

Notice that when we sum over all timestamps, we don't enforce sum over set $[t: S_t = s]$, and the reason for that is the intersection of states where $S_t=s$ and $S_t=s \land S_{t+1}=s'$ is $S_t=s \land S_{t+1}=s'$, so with the indicator function we added, we can sum over all $t$.

Derive Fixed Point of Batch TD(0) Update

Let's show that the fixed point of batch TD(0) is

$$ v(s) = \sum_{s' \in S^+} \frac{N(s,s')}{N(s)} \Bigl[ \bar r(s,s') + \gamma v(s') \Bigr]. $$

where

$$ N(s,s') = \sum_i \sum_t \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} $$ $$ N(s) = \sum_{s'} N(s,s'), $$ $$ \bar r(s,s')= \frac{1}{N(s,s')} \sum_i \sum_t \mathbf{1}\{S_t^{(i)}=s,\;S_{t+1}^{(i)}=s'\}R_{t+1}^{(i)}, \qquad N(s,s')>0. $$

Derivation

So at fixed point $\forall s \in S$ update for $v(s)$ is 0, which means based on equation $[Eq. 3]$

$$ \sum_{s' \in S^+} \sum_i \sum_{t} \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} \Bigl[ R_{t+1}^{(i)} + \gamma v(s') - v(s) \Bigr] = 0. $$

This can be split into 3 different components

Part 1: reward term

Therefore,

$$ \sum_{s' \in S^+} \sum_i \sum_t \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} R_{t+1}^{(i)} = \sum_{s' \in S^+} N(s, s')/N(s, s') \sum_i \sum_t \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} R_{t+1}^{(i)} = \sum_{s' \in S^+} N(s, s') \bar r(s, s') $$

Part 2: bootstrapped next-state value term

Now consider the term involving $\gamma v(s')$. Since $\gamma v(s')$ does not depend on $i$ or $t$, it can be pulled outside the inner sums:

$$ \sum_{s' \in S^+} \sum_i \sum_{t} \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} \gamma v(s') = \sum_{s' \in S^+} \gamma v(s') \sum_i \sum_{t} \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} $$

If you look at the inner sum it is just number of times across all trajectories where state $s$ was followed by state $s'$, which is exactly $N(s, s')$

Now all we are left with is outer sum

Summing over all successor states $s' \in S^+$, the bootstrapped value part becomes

$$ \gamma \sum_{s' \in S^+} N(s,s') v(s'). $$

Part 3: current-state value term

Finally, consider the term involving $v(s)$. Since $v(s)$ does not depend on $s', i, or \ t$, it can be factored out of all sums

$$ \sum_i \sum_{t} \sum_{s' \in S^+} \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} v(s) = v(s) \sum_{s' \in S^+} \sum_i \sum_{t} \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} $$

Notice, that we already saw two inner sums, in Part 2 of the derivation and they are just equal to $N(s, s')$, now $\sum_{s' \in S^+} N(s, s') = N(s)$, which is just number of times we saw state $s$ in all of the trajectories.

Part 4: Putting the three parts together

Substituting the three simplified expressions back into the fixed-point equation, we obtain

$$ \sum_{s' \in S^+} N(s,s') \bar r(s,s') + \gamma \sum_{s' \in S^+} N(s,s') v(s') - N(s)\, v(s) = 0. $$

Rearranging,

$$ N(s)\, v(s) = \sum_{s' \in S^+} N(s,s') \bar r(s,s') + \gamma \sum_{s' \in S^+} N(s,s') v(s'). $$

Dividing both sides by $N(s)$, we get

$$ v(s) = \sum_{s' \in S^+} \frac{N(s,s')}{N(s)} \Bigl[ \bar r(s,s') + \gamma v(s') \Bigr]. $$

We only evaluate this equation for states $s \in S$, so $N(s)>0$. Also, $\bar r(s,s')$ is only defined for observed transition pairs with $N(s,s')>0$, pairs with $N(s,s')=0$ contribute zero in the indicator-based form.

Fixed Point of MLE + DP

Let us split this part into two subsections. In the first, we simplify the Bellman operator; in the second, we compute the necessary maximum-likelihood estimates from existing trajectories and express the Bellman operator through those estimates.

Simplify Bellman Operator

Given the Bellman operator of the policy-induced Markov Reward Process

$$(T_{\pi} v)(s) = \sum_r \sum_{s'} p_\pi(R_{t+1}=r, S_{t+1}=s' \mid S_t = s)[r + \gamma v(s')]$$

show that it can be rewritten

$$ (T_{\pi} v)(s)= \sum_{s'} p_\pi(s' \mid s) \Bigl[E_\pi[r \mid s,s'] + \gamma v(s')\Bigr]. $$

Starting from the MRP form, factor the joint conditional law as

$$ p_\pi(r,s' \mid s) = p_\pi(r \mid s,s')\,p_\pi(s' \mid s). $$

Substituting this factorization into the Bellman operator immediately gives

$$(T_\pi v)(s)= \sum_{s'} p_{\pi}(s'|s) \Bigl[ \sum_r p_{\pi}(r| s', s)r + \gamma v(s') \Bigr]$$ $$ = \sum_{s'} p_\pi(s' \mid s) \Bigl[E_\pi[r \mid s,s'] + \gamma v(s')\Bigr]. $$

Now after we rewrote Bellman operator we see the probabilities that we need to find maximum likelihood for $p_{\pi}(s'|s)$ and $p_{\pi}(r| s', s)$. And it is intuitively clear at this point if you look at fixed point of TD(0) what MLE of these distributions should be.

Maximum Likelihood Estimation

Maximum likelihood of trajectory

Even though from this point onward we work directly with the policy-induced MRP, it is useful to make explicit how the MRP trajectory law is derived from the underlying MDP under a fixed policy.

Consider first the MDP view of a single trajectory:

$$ \tau_i^{\mathrm{MDP}} = \bigl( S_0^{(i)}, A_0^{(i)}, R_1^{(i)}, S_1^{(i)}, A_1^{(i)}, \ldots, R_T^{(i)}, S_T^{(i)} \bigr). $$

Conditioning on the initial state and using the Markov property together with a fixed policy $\pi$, the trajectory probability is

$$ p\!\left( A_0^{(i)}, R_1^{(i)}, S_1^{(i)}, \ldots, R_T^{(i)}, S_T^{(i)} \mid S_0^{(i)} \right) = \prod_{t=0}^{T-1} p\!\left( R_{t+1}^{(i)}, S_{t+1}^{(i)} \mid S_t^{(i)}, A_t^{(i)} \right) \pi\!\left( A_t^{(i)} \mid S_t^{(i)} \right). $$

For policy evaluation, actions are not decision variables anymore; they are integrated out by the fixed policy. Therefore the induced MRP kernel is

$$ p_\pi(r,s' \mid s) = \sum_{a \in A} p(r,s' \mid s,a)\,\pi(a \mid s). $$

Marginalizing over the full action sequence gives the trajectory law of the induced MRP:

$$ p_\pi\!\left( R_1^{(i)}, S_1^{(i)}, \ldots, R_T^{(i)}, S_T^{(i)} \mid S_0^{(i)} \right) = \sum_{a_0 \in A}\cdots\sum_{a_{T-1} \in A} \prod_{t=0}^{T-1} p\!\left( R_{t+1}^{(i)}, S_{t+1}^{(i)} \mid S_t^{(i)}, a_t \right) \pi(a_t \mid S_t^{(i)}) = \prod_{t=0}^{T-1} p_\pi\!\left( R_{t+1}^{(i)}, S_{t+1}^{(i)} \mid S_t^{(i)} \right). $$

Hopefully this adds clarity and from here on, we will be working in MRP notation. A single trajectory is

$$ \tau_i = \bigl( S_0^{(i)}, R_1^{(i)}, S_1^{(i)}, \ldots, R_T^{(i)}, S_T^{(i)} \bigr). $$

The conditional probability of the full trajectory, given the initial state, is

$$ p(\tau_i \mid S_0^{(i)}) = p\bigl( R_1^{(i)}, S_1^{(i)}, \ldots, R_T^{(i)}, S_T^{(i)} \mid S_0^{(i)} \bigr). $$

Because the policy-induced process is an MRP, this can be rewritten as

$$ p(\tau_i \mid S_0^{(i)}) = \prod_{t=0}^{T-1} p\!\left( R_{t+1}^{(i)}, S_{t+1}^{(i)} \mid S_t^{(i)} \right). $$

From here we can factor the joint conditional distribution as

$$ p_\pi(r,s' \mid s) = p_\pi(r \mid s,s')\,p_\pi(s' \mid s), $$

which gives

$$ p_\pi\!\left( R_1^{(i)}, S_1^{(i)}, \ldots, R_T^{(i)}, S_T^{(i)} \mid S_0^{(i)} \right) = \prod_{t=0}^{T-1} p_\pi\!\left( R_{t+1}^{(i)} \mid S_t^{(i)}, S_{t+1}^{(i)} \right) p_\pi\!\left( S_{t+1}^{(i)} \mid S_t^{(i)} \right). $$

From here one can derive MLEs for

$$ p_\pi(S_{t+1} \mid S_t) \quad\text{and}\quad p_\pi(R_{t+1} \mid S_t, S_{t+1}). $$

such that the probability of observed trajectories is maximized, i.e. we parameterize both distributions and find parameters such that likelihood of observing trajectories we see is maximized.

MLE for Transition Kernel $p_{\pi}(s' | s)$

Claim

Assuming a direct categorical parameterization of the transition kernel, with no additional structural assumptions, i.e. $p_{\pi}(s'|s) = \theta_{\pi,s,s'}$ and $\sum_{s'}\theta_{\pi,s,s'} = 1$

$$ \hat{p}_\pi^{\mathrm{MLE}}(s' \mid s) = \frac{N(s,s')}{N(s)}. $$

Derivation

Note, if a single transition probability is categorical, because we can have multiple it translates into a multinomial, and we get that in order to find $\theta$ that maximizes likelihood of trajectory, we need

$$ \max L(\theta_{\pi, s, s'}) = \max \prod_{s' \in S^+} \theta_{\pi, s,s'}^{\,N(s,s')}, $$

which is equivalent to finding maximum of

$$ \ell(\theta_s) = \sum_{s' \in S^+} N(s,s') \log \theta_{\pi, s,s'}. $$

Since, must also satisfy the normalization constraint

$$ \sum_{s' \in S^+} \theta_{\pi, s,s'} = 1. $$

This translates into constrained optimization problem, which can be expressed through Lagrangian

$$ L(\theta_s,\lambda) = \sum_{s' \in S^+} N(s,s') \log \theta_{\pi, s,s'} + \lambda \left( \sum_{s' \in S^+} \theta_{s,s'} - 1 \right). $$

F.O.C:

$$ [1] \ \frac{\partial L}{\partial \theta_{\pi, s,s'}} = \frac{N(s,s')}{\theta_{\pi, s,s'}} + \lambda = 0, $$ $$ \theta_{\pi, s,s'} = -\frac{N(s,s')}{\lambda}. $$ $$ [2] \ \frac{\partial L}{\partial \lambda} = 0 $$ $$ \sum_{s' \in S^+} \theta_{\pi, s,s'} = 1 $$ $$ \sum_{s' \in S^+} -{N(s,s')} = \lambda $$ $$ - N(s) = \lambda $$

Substituting back to $[1]$ F.O.C we obtain the MLE

$$ \hat{\theta}_{s,s'}^{\mathrm{MLE}} = \hat{p}_\pi^{\mathrm{MLE}}(s' \mid s) = \frac{N(s,s')}{N(s)}. $$

Note if you take S.O.C it is easy to see this is indeed a maximum.

Bringing it all together

We now have all the pieces needed to prove the main claim.

From the fixed-point equation of batch TD(0), we derived that for every nonterminal state $s$ visited in the trajectories,

$$ v(s) = \sum_{s' \in S^+} \frac{N(s,s')}{N(s)} \Bigl[ \bar r(s,s') + \gamma v(s') \Bigr]. $$

On the other hand, after rewriting the Bellman operator under the policy-induced probability measure, we showed that policy evaluation in the induced MRP can be written as

$$ (T_\pi v)(s) = \sum_{s'} p_\pi(s' \mid s) \Bigl[ E_\pi[r \mid s,s'] + \gamma v(s') \Bigr]. $$

So if we now replace the unknown quantities in the Bellman operator by their maximum-likelihood estimates, we obtain

$$ (\hat T_\pi v)(s) = \sum_{s' \in S^+} \hat p_\pi^{\mathrm{MLE}}(s' \mid s) \Bigl[ \hat E_\pi^{\mathrm{MLE}}[r \mid s,s'] + \gamma v(s') \Bigr]. $$

From the transition-likelihood calculation, we already derived

$$ \hat p_\pi^{\mathrm{MLE}}(s' \mid s) = \frac{N(s,s')}{N(s)}. $$

For the reward term, there is a subtle but important point, batch TD(0) uses the empirical conditional mean

$$ \bar r(s,s') = \frac{1}{N(s,s')} \sum_i \sum_t \mathbf{1}\{S_t^{(i)}=s,\;S_{t+1}^{(i)}=s'\}R_{t+1}^{(i)}. $$

Therefore, the equivalence with MLE + DP holds only when the $p^{MLE}_{\pi}(r | s, s')$ is such that the conditional expectation of reward coincides with this empirical conditional mean. This is true for many common models (Gaussian), but not for all (e.g. Laplace).

Substituting these two estimators into the Bellman operator gives

$$ (\hat T_\pi v)(s) = \sum_{s' \in S^+} \frac{N(s,s')}{N(s)} \Bigl[ \bar r(s,s') + \gamma v(s') \Bigr]. $$

But this is exactly the same equation as the fixed-point equation of batch TD(0). Therefore, the fixed point of batch TD(0) coincides with the fixed point obtained by

estimating the policy-induced transition and reward model by maximum likelihood from the observed trajectories, and then
running dynamic programming policy evaluation with that estimated model.

In other words, batch TD(0) converges to the state-value function that is the fixed point of the Bellman equation for the empirical MRP induced by the maximum-likelihood model estimated from the data. If you view it this way, TD(0) is not just an update rule - it is implicitly solving that Bellman fixed-point equation by leveraging the Markov structure of the problem.

Experiment: What approach is better and when?

In the previous part of the blog, we derived that the fixed point of empirical model + DP is the same as the fixed point of batch TD(0). In this part, I want to understand which regimes are better suited for each approach. For a second, let's think about the time complexity. A single batch TD sweep over all trajectories is linear in the total number of sampled transitions; if we denote that number by $m$, then one batch TD sweep costs $O(m)$.

For certainty-equivalent evaluation, we first compile the sampled transitions into an empirical MRP. This construction scans the $m$ sampled transitions once, but the resulting model stores only the $E$ unique observed transition edges, where an edge is a distinct state-transition tuple $(s,s')$ with $N(s,s') > 0$. After that, one application of the empirical Bellman operator costs $O(E)$. Instead of repeatedly applying the Bellman operator, one could also solve the linear DP problem directly, for example by solving $(I - \gamma \hat P)v = \hat r$. A naive matrix inverse would be unstable and would ignore sparsity, so in the experiments below I use repeated applications of the empirical Bellman operator until convergence as the baseline DP solver.

Radius Knob: Increasing Empirical Branching

Intuitively, it is clear that the larger the number of edges is, the more advantageous it becomes to use TD(0). So the main idea of the experiment is to have an environment where we can experiment with different levels of connectivity between states. We start from a very sparse version of the environment and gradually move to a denser one.

The environment is very simple: you start from the upper-left point on a $256 \times 256$ grid and need to get to the goal point in the lower-right corner. The fixed policy is deterministic: it moves right until it reaches the last column, and then moves down toward the goal. With probability $0.75$, the environment respects this choice. With probability $0.25$, it teleports you. The knob that regulates teleportation is what moves the environment from sparse connections between states to dense connections between states. The visualization below is a miniature version of the same geometry.

Current state Goal Policy move Teleport candidates

Teleport radius

1 5 teleport candidates

Teleport probability 0.25

Experiment states 65536

Local probability 0.75

For different values of teleportation radius, I collect exactly 6K sampled transitions. Then I run two approaches (batch TD(0) and Empirical MRP + DP) to the same convergence tolerance.

One interesting aspect here is that in the experiment I set $\alpha = 1$, which algebraically makes both updates equivalent. For a state $s$, the batch TD update becomes

$$ v_{\text{new}}(s) = v(s) + \frac{1}{N(s)} \sum_{t:S_t=s} \left[ R_{t+1} + \gamma v(S_{t+1}) - v(s) \right] = \frac{1}{N(s)} \sum_{t:S_t=s} \left[ R_{t+1} + \gamma v(S_{t+1}) \right]. $$

Grouping the same samples by their successor state gives the empirical Bellman update

$$ v_{\text{new}}(s) = \sum_{s'} \frac{N(s,s')}{N(s)} \left[ \bar r(s,s') + \gamma v(s') \right]. $$

So in this experiment the two methods perform the same mathematical update. They just use different representations: batch TD operates on raw sampled transitions, while the certainty-equivalent version operates on the compressed set of observed edges, and as such will almost always be faster at the sweep stage. The problem is that building that compressed representation, the empirical MRP, is costly, despite being done only once.

The efficiency of this compression depends on how many times the same edge was visited. If many raw samples repeat the same transition $(s,s')$, then the empirical MRP can replace all of those repeated samples with one edge count and one average reward. But the smaller the number of repeated visits per edge, the less compression the empirical MRP provides. In the limiting case where every observed edge is visited only once, Empirical MRP + DP gives no compression gain during the update stage, but still pays the upfront cost of constructing the edge representation.

To compare empirical behavior, the left plot reports wall-clock time for this Python implementation. For each radius, I generate 10 independent datasets and plot the median runtime. The Empirical MRP + DP timing includes building the empirical model from the 6K transitions and then applying the empirical Bellman operator until convergence. The batch TD timing includes preparing the raw transition representation and then sweeping over those transitions until convergence.

These timings are implementation-dependent, but they answer the practical question directly: in this implementation, which approach finishes faster as the environment becomes more connected? The graph-density plot shows the reason for the trend by tracking $E/m$, the fraction of raw samples that remain as unique observed edges. The memory plot uses stored records as a simple proxy: batch TD stores roughly $S + m$ values/transitions, while Empirical MRP + DP stores roughly $S + E$ values/edges.

Results are generated from 10 independent simulated datasets per radius; wall-clock values are medians from the local Python implementation.

As we mentioned, batch TD touches $m$ raw transitions each sweep, while Empirical MRP + DP touches $E$ empirical edges after the model has been built. As the radius grows, $E/m$ rises, and the compressed empirical model becomes less cheap relative to replaying the batch.