Temporal Difference Learning
- $Q$ learning acts to reduce the difference between
successive $Q$ estimates, in one-step time differences:
\[ Q^{(1)}(s_t,a_t) \equiv r_t + \gamma \max_{a} \hat{Q}(s_{t+1},a) \]
So, why not use two steps?
\[ Q^{(2)}(s_t,a_t) \equiv r_t + \gamma r_{t+1} + \gamma^2 \max_{a}
\hat{Q}(s_{t+2},a) \]
Or $n$ steps?
\[ Q^{(n)}(s_t,a_t) \equiv r_t + \gamma r_{t+1} + \cdots
+ \gamma^{(n-1)}r_{t+n-1} + \gamma^n \max_{a}\hat{Q}(s_{t+n},a) \]
- We can blend all these by letting $\lambda$ be the
number of steps we wish to use:
\[Q^{\lambda}(s_{t},a_{t}) \equiv (1- \lambda) \left[
Q^{(1)}(s_t,a_t) + \lambda Q^{(2)}(s_t,a_t) + \lambda^2 Q^{(3)}(s_t,a_t) +
\cdots \right] \]
José M. Vidal
.
20 of 22