The main goal of this post is to examine more closely the connection between Temporal Difference methods, specifically batch TD(0), and Dynamic Programming. Sutton discusses several aspects of the TD <-> DP relationship, but here I try to focus on one particular question: how the fixed point of batch TD(0) relates to the fixed point of the Bellman operator when the model dynamics are estimated from observed trajectories by maximum likelihood.
Reinforcement Learning
Batch TD(0) as Dynamic Programming on the Empirical MRP
Problem Statement & Approach
Let us consider a prediction problem rather than a control problem. We want to evaluate the state-value function of a fixed policy $\pi(a \mid s)$ from a finite collection of observed trajectories. The same general logic can later be extended to state-action value functions, but here I focus only on state values.
In this section I will show that if batch TD(0) is run until convergence, then its fixed point coincides with the fixed point obtained by the following two-stage procedure:
- estimate the policy-induced Markov Reward Process from the observed trajectories by maximum likelihood;
- run dynamic programming policy evaluation on that estimated model.
Let us now formalize the assumptions and the claim.
Claim
We assume a finite episodic Markov Reward Process induced by a fixed policy $\pi$, and a collection of trajectories sampled from that same process. A single trajectory is
$$\tau_i = (S_0^{(i)}, R_1^{(i)}, S_1^{(i)}, R_2^{(i)}, \dots, S_{T_i}^{(i)}).$$We make the following assumptions.
- Markov reward process. The policy-induced process satisfies the Markov property: $$P(S_{t+1}, R_{t+1} \mid S_t, S_{t-1}, R_t, \dots) = P(S_{t+1}, R_{t+1} \mid S_t).$$
- Time-homogeneous dynamics. The policy-induced transition-reward law does not depend on time: $$P(S_{t+k+1}, R_{t+k+1} \mid S_{t+k}) = P(S_{t+1}, R_{t+1} \mid S_t).$$
- Reward-model restriction. For each observed transition pair $(s, s')$, we restrict attention to conditional reward models $p_\pi(r \mid s, s')$ whose maximum-likelihood estimator has conditional expectation equal to the empirical mean reward over all observed transitions from $s$ to $s'$.
Under these assumptions, the fixed point of the Bellman operator
$$(T_\pi v)(s) = \sum_r \sum_{s'} p_\pi(r, s' \mid s)\,[r + \gamma v(s')]$$applied to the maximum-likelihood estimate of the policy-induced Markov Reward Process is equal to the fixed point of batch TD(0), defined by
$$v(s) \leftarrow v(s) + \frac{\alpha}{N(s)} \sum_i \sum_{t : S_t^{(i)} = s} \left[R_{t+1}^{(i)} + \gamma v(S_{t+1}^{(i)}) - v(s)\right],$$where $N(s)$ is the number of times state $s$ was visited across all trajectories.
Before going further, let us define the following notation:
- $S$: the set of all observed nonterminal states,
- $S^+$: the set of all observed states, including terminal states.
Proof Plan
When I was deriving these results, I intuitively followed steps:
- Derive the fixed-point equation of batch TD(0)
- Rewrite the Bellman fixed-point equation for the policy-induced process into a form that makes its connection to the batch TD(0) fixed point explicit
- Compute the maximum-likelihood estimators of the relevant transition and reward probability distributions from the observed trajectories, substitute them into the reformulated Bellman equation, and verify that the resulting fixed point is exactly the same as the fixed point of batch TD(0).
In practice I iterated between the first two steps several times, but this is the cleanest way to present the argument.
Finding Fixed Point of Batch TD(0)
Simplify Batch TD(0) Update
Let us consider the batch TD(0) update:
$$ [Eq. 2] \ \ \ v(s) \leftarrow v(s) + \frac{\alpha}{N(s)} \sum_{i} \sum_{t : S_t^{(i)} = s} \Bigl[ R_{t+1}^{(i)} + \gamma v(S_{t+1}^{(i)}) - v(s) \Bigr] $$Here we average over all trajectories and all time steps within each trajectory for which we observed state $s$.
However, there is an additional simplification, notice that the term $S_{t+1}$ represents the state visited immediately after $s$. It is easy to notice that the update to a state value function $v(s)$ is bootstrapped only from states that followed state $s$ across all observed trajectories. This can be expressed using a simple indicator function:
$$ \mathbf{1}\{S_t = s,\; S_{t+1} = s'\}. $$Hence the batch TD(0) update can be rewritten as
$$ [Eq. 3] \ \ \ v(s) \leftarrow v(s) + \frac{\alpha}{N(s)} \sum_i \sum_{t} \sum_{s' \in S^+} \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} \Bigl[ R_{t+1}^{(i)} + \gamma v(s') - v(s) \Bigr] $$Notice that when we sum over all timestamps, we don't enforce sum over set $[t: S_t = s]$, and the reason for that is the intersection of states where $S_t=s$ and $S_t=s \land S_{t+1}=s'$ is $S_t=s \land S_{t+1}=s'$, so with the indicator function we added, we can sum over all $t$.
Derive Fixed Point of Batch TD(0) Update
Let's show that the fixed point of batch TD(0) is
$$ v(s) = \sum_{s' \in S^+} \frac{N(s,s')}{N(s)} \Bigl[ \bar r(s,s') + \gamma v(s') \Bigr]. $$where
$$ N(s,s') = \sum_i \sum_t \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} $$ $$ N(s) = \sum_{s'} N(s,s'), $$ $$ \bar r(s,s')= \frac{1}{N(s,s')} \sum_i \sum_t \mathbf{1}\{S_t^{(i)}=s,\;S_{t+1}^{(i)}=s'\}R_{t+1}^{(i)}, \qquad N(s,s')>0. $$Derivation
So at fixed point $\forall s \in S$ update for $v(s)$ is 0, which means based on equation $[Eq. 3]$
$$ \sum_{s' \in S^+} \sum_i \sum_{t} \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} \Bigl[ R_{t+1}^{(i)} + \gamma v(s') - v(s) \Bigr] = 0. $$This can be split into 3 different components
Part 1: reward term
Therefore,
$$ \sum_{s' \in S^+} \sum_i \sum_t \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} R_{t+1}^{(i)} = \sum_{s' \in S^+} N(s, s')/N(s, s') \sum_i \sum_t \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} R_{t+1}^{(i)} = \sum_{s' \in S^+} N(s, s') \bar r(s, s') $$Part 2: bootstrapped next-state value term
Now consider the term involving $\gamma v(s')$. Since $\gamma v(s')$ does not depend on $i$ or $t$, it can be pulled outside the inner sums:
$$ \sum_{s' \in S^+} \sum_i \sum_{t} \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} \gamma v(s') = \sum_{s' \in S^+} \gamma v(s') \sum_i \sum_{t} \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} $$If you look at the inner sum it is just number of times across all trajectories where state $s$ was followed by state $s'$, which is exactly $N(s, s')$
Now all we are left with is outer sum
Summing over all successor states $s' \in S^+$, the bootstrapped value part becomes
$$ \gamma \sum_{s' \in S^+} N(s,s') v(s'). $$Part 3: current-state value term
Finally, consider the term involving $v(s)$. Since $v(s)$ does not depend on $s', i, or \ t$, it can be factored out of all sums
$$ \sum_i \sum_{t} \sum_{s' \in S^+} \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} v(s) = v(s) \sum_{s' \in S^+} \sum_i \sum_{t} \mathbf{1}\{S_t^{(i)} = s,\; S_{t+1}^{(i)} = s'\} $$Notice, that we already saw two inner sums, in Part 2 of the derivation and they are just equal to $N(s, s')$, now $\sum_{s' \in S^+} N(s, s') = N(s)$, which is just number of times we saw state $s$ in all of the trajectories.
Part 4: Putting the three parts together
Substituting the three simplified expressions back into the fixed-point equation, we obtain
$$ \sum_{s' \in S^+} N(s,s') \bar r(s,s') + \gamma \sum_{s' \in S^+} N(s,s') v(s') - N(s)\, v(s) = 0. $$Rearranging,
$$ N(s)\, v(s) = \sum_{s' \in S^+} N(s,s') \bar r(s,s') + \gamma \sum_{s' \in S^+} N(s,s') v(s'). $$Dividing both sides by $N(s)$, we get
$$ v(s) = \sum_{s' \in S^+} \frac{N(s,s')}{N(s)} \Bigl[ \bar r(s,s') + \gamma v(s') \Bigr]. $$We only evaluate this equation for states $s \in S$, so $N(s)>0$. Also, $\bar r(s,s')$ is only defined for observed transition pairs with $N(s,s')>0$, pairs with $N(s,s')=0$ contribute zero in the indicator-based form.
Fixed Point of MLE + DP
Let us split this part into two subsections. In the first, we simplify the Bellman operator; in the second, we compute the necessary maximum-likelihood estimates from existing trajectories and express the Bellman operator through those estimates.
Simplify Bellman Operator
Given the Bellman operator of the policy-induced Markov Reward Process
$$(T_{\pi} v)(s) = \sum_r \sum_{s'} p_\pi(R_{t+1}=r, S_{t+1}=s' \mid S_t = s)[r + \gamma v(s')]$$show that it can be rewritten
$$ (T_{\pi} v)(s)= \sum_{s'} p_\pi(s' \mid s) \Bigl[E_\pi[r \mid s,s'] + \gamma v(s')\Bigr]. $$Starting from the MRP form, factor the joint conditional law as
$$ p_\pi(r,s' \mid s) = p_\pi(r \mid s,s')\,p_\pi(s' \mid s). $$Substituting this factorization into the Bellman operator immediately gives
$$(T_\pi v)(s)= \sum_{s'} p_{\pi}(s'|s) \Bigl[ \sum_r p_{\pi}(r| s', s)r + \gamma v(s') \Bigr]$$ $$ = \sum_{s'} p_\pi(s' \mid s) \Bigl[E_\pi[r \mid s,s'] + \gamma v(s')\Bigr]. $$Now after we rewrote Bellman operator we see the probabilities that we need to find maximum likelihood for $p_{\pi}(s'|s)$ and $p_{\pi}(r| s', s)$. And it is intuitively clear at this point if you look at fixed point of TD(0) what MLE of these distributions should be.
Maximum Likelihood Estimation
Maximum likelihood of trajectory
Even though from this point onward we work directly with the policy-induced MRP, it is useful to make explicit how the MRP trajectory law is derived from the underlying MDP under a fixed policy.
Consider first the MDP view of a single trajectory:
$$ \tau_i^{\mathrm{MDP}} = \bigl( S_0^{(i)}, A_0^{(i)}, R_1^{(i)}, S_1^{(i)}, A_1^{(i)}, \ldots, R_T^{(i)}, S_T^{(i)} \bigr). $$Conditioning on the initial state and using the Markov property together with a fixed policy $\pi$, the trajectory probability is
$$ p\!\left( A_0^{(i)}, R_1^{(i)}, S_1^{(i)}, \ldots, R_T^{(i)}, S_T^{(i)} \mid S_0^{(i)} \right) = \prod_{t=0}^{T-1} p\!\left( R_{t+1}^{(i)}, S_{t+1}^{(i)} \mid S_t^{(i)}, A_t^{(i)} \right) \pi\!\left( A_t^{(i)} \mid S_t^{(i)} \right). $$For policy evaluation, actions are not decision variables anymore; they are integrated out by the fixed policy. Therefore the induced MRP kernel is
$$ p_\pi(r,s' \mid s) = \sum_{a \in A} p(r,s' \mid s,a)\,\pi(a \mid s). $$Marginalizing over the full action sequence gives the trajectory law of the induced MRP:
$$ p_\pi\!\left( R_1^{(i)}, S_1^{(i)}, \ldots, R_T^{(i)}, S_T^{(i)} \mid S_0^{(i)} \right) = \sum_{a_0 \in A}\cdots\sum_{a_{T-1} \in A} \prod_{t=0}^{T-1} p\!\left( R_{t+1}^{(i)}, S_{t+1}^{(i)} \mid S_t^{(i)}, a_t \right) \pi(a_t \mid S_t^{(i)}) = \prod_{t=0}^{T-1} p_\pi\!\left( R_{t+1}^{(i)}, S_{t+1}^{(i)} \mid S_t^{(i)} \right). $$Hopefully this adds clarity and from here on, we will be working in MRP notation. A single trajectory is
$$ \tau_i = \bigl( S_0^{(i)}, R_1^{(i)}, S_1^{(i)}, \ldots, R_T^{(i)}, S_T^{(i)} \bigr). $$The conditional probability of the full trajectory, given the initial state, is
$$ p(\tau_i \mid S_0^{(i)}) = p\bigl( R_1^{(i)}, S_1^{(i)}, \ldots, R_T^{(i)}, S_T^{(i)} \mid S_0^{(i)} \bigr). $$Because the policy-induced process is an MRP, this can be rewritten as
$$ p(\tau_i \mid S_0^{(i)}) = \prod_{t=0}^{T-1} p\!\left( R_{t+1}^{(i)}, S_{t+1}^{(i)} \mid S_t^{(i)} \right). $$From here we can factor the joint conditional distribution as
$$ p_\pi(r,s' \mid s) = p_\pi(r \mid s,s')\,p_\pi(s' \mid s), $$which gives
$$ p_\pi\!\left( R_1^{(i)}, S_1^{(i)}, \ldots, R_T^{(i)}, S_T^{(i)} \mid S_0^{(i)} \right) = \prod_{t=0}^{T-1} p_\pi\!\left( R_{t+1}^{(i)} \mid S_t^{(i)}, S_{t+1}^{(i)} \right) p_\pi\!\left( S_{t+1}^{(i)} \mid S_t^{(i)} \right). $$From here one can derive MLEs for
$$ p_\pi(S_{t+1} \mid S_t) \quad\text{and}\quad p_\pi(R_{t+1} \mid S_t, S_{t+1}). $$such that the probability of observed trajectories is maximized, i.e. we parameterize both distributions and find parameters such that likelihood of observing trajectories we see is maximized.
MLE for Transition Kernel $p_{\pi}(s' | s)$
Claim
Assuming a direct categorical parameterization of the transition kernel, with no additional structural assumptions, i.e. $p_{\pi}(s'|s) = \theta_{\pi,s,s'}$ and $\sum_{s'}\theta_{\pi,s,s'} = 1$
$$ \hat{p}_\pi^{\mathrm{MLE}}(s' \mid s) = \frac{N(s,s')}{N(s)}. $$Derivation
Note, if a single transition probability is categorical, because we can have multiple it translates into a multinomial, and we get that in order to find $\theta$ that maximizes likelihood of trajectory, we need
$$ \max L(\theta_{\pi, s, s'}) = \max \prod_{s' \in S^+} \theta_{\pi, s,s'}^{\,N(s,s')}, $$which is equivalent to finding maximum of
$$ \ell(\theta_s) = \sum_{s' \in S^+} N(s,s') \log \theta_{\pi, s,s'}. $$Since, must also satisfy the normalization constraint
$$ \sum_{s' \in S^+} \theta_{\pi, s,s'} = 1. $$This translates into constrained optimization problem, which can be expressed through Lagrangian
$$ L(\theta_s,\lambda) = \sum_{s' \in S^+} N(s,s') \log \theta_{\pi, s,s'} + \lambda \left( \sum_{s' \in S^+} \theta_{s,s'} - 1 \right). $$F.O.C:
$$ [1] \ \frac{\partial L}{\partial \theta_{\pi, s,s'}} = \frac{N(s,s')}{\theta_{\pi, s,s'}} + \lambda = 0, $$ $$ \theta_{\pi, s,s'} = -\frac{N(s,s')}{\lambda}. $$ $$ [2] \ \frac{\partial L}{\partial \lambda} = 0 $$ $$ \sum_{s' \in S^+} \theta_{\pi, s,s'} = 1 $$ $$ \sum_{s' \in S^+} -{N(s,s')} = \lambda $$ $$ - N(s) = \lambda $$Substituting back to $[1]$ F.O.C we obtain the MLE
$$ \hat{\theta}_{s,s'}^{\mathrm{MLE}} = \hat{p}_\pi^{\mathrm{MLE}}(s' \mid s) = \frac{N(s,s')}{N(s)}. $$Note if you take S.O.C it is easy to see this is indeed a maximum.
Bringing it all together
We now have all the pieces needed to prove the main claim.
From the fixed-point equation of batch TD(0), we derived that for every nonterminal state $s$ visited in the trajectories,
$$ v(s) = \sum_{s' \in S^+} \frac{N(s,s')}{N(s)} \Bigl[ \bar r(s,s') + \gamma v(s') \Bigr]. $$On the other hand, after rewriting the Bellman operator under the policy-induced probability measure, we showed that policy evaluation in the induced MRP can be written as
$$ (T_\pi v)(s) = \sum_{s'} p_\pi(s' \mid s) \Bigl[ E_\pi[r \mid s,s'] + \gamma v(s') \Bigr]. $$So if we now replace the unknown quantities in the Bellman operator by their maximum-likelihood estimates, we obtain
$$ (\hat T_\pi v)(s) = \sum_{s' \in S^+} \hat p_\pi^{\mathrm{MLE}}(s' \mid s) \Bigl[ \hat E_\pi^{\mathrm{MLE}}[r \mid s,s'] + \gamma v(s') \Bigr]. $$From the transition-likelihood calculation, we already derived
$$ \hat p_\pi^{\mathrm{MLE}}(s' \mid s) = \frac{N(s,s')}{N(s)}. $$For the reward term, there is a subtle but important point, batch TD(0) uses the empirical conditional mean
$$ \bar r(s,s') = \frac{1}{N(s,s')} \sum_i \sum_t \mathbf{1}\{S_t^{(i)}=s,\;S_{t+1}^{(i)}=s'\}R_{t+1}^{(i)}. $$Therefore, the equivalence with MLE + DP holds only when the $p^{MLE}_{\pi}(r | s, s')$ is such that the conditional expectation of reward coincides with this empirical conditional mean. This is true for many common models (Gaussian), but not for all (e.g. Laplace).
Substituting these two estimators into the Bellman operator gives
$$ (\hat T_\pi v)(s) = \sum_{s' \in S^+} \frac{N(s,s')}{N(s)} \Bigl[ \bar r(s,s') + \gamma v(s') \Bigr]. $$But this is exactly the same equation as the fixed-point equation of batch TD(0). Therefore, the fixed point of batch TD(0) coincides with the fixed point obtained by
- estimating the policy-induced transition and reward model by maximum likelihood from the observed trajectories, and then
- running dynamic programming policy evaluation with that estimated model.