Temporal Difference Learning

$Q$ learning acts to reduce the difference between successive $Q$ estimates, in one-step time differences: \[ Q^{(1)}(s_t,a_t) \equiv r_t + \gamma \max_{a} \hat{Q}(s_{t+1},a) \] So, why not use two steps? \[ Q^{(2)}(s_t,a_t) \equiv r_t + \gamma r_{t+1} + \gamma^2 \max_{a} \hat{Q}(s_{t+2},a) \] Or $n$ steps? \[ Q^{(n)}(s_t,a_t) \equiv r_t + \gamma r_{t+1} + \cdots + \gamma^{(n-1)}r_{t+n-1} + \gamma^n \max_{a}\hat{Q}(s_{t+n},a) \]
We can blend all these by letting $\lambda$ be the number of steps we wish to use: \[Q^{\lambda}(s_{t},a_{t}) \equiv (1- \lambda) \left[ Q^{(1)}(s_t,a_t) + \lambda Q^{(2)}(s_t,a_t) + \lambda^2 Q^{(3)}(s_t,a_t) + \cdots \right] \]