Batch TD(0) as Dynamic Programming on the Empirical MRP
A step-by-step derivation showing that the fixed point of batch TD(0) coincides with the Bellman solution of the empirical Markov reward process defined by transition-count MLEs and empirical conditional mean rewards.
Read post →