Learning Q

We notice that $Q$ and $V^*$ are closely related, specifically: \[ V^{*}(s) = \max_{a'}Q(s,a') \]
This allows us to write $Q$ recursively as \[ Q(s_t,a_t) = r(s_t,a_t) + \gamma V^{*}(\delta(s_t,a_t))) \] \[ Q(s_t,a_t) = r(s_t,a_t) + \gamma \max_{a'}Q(s_{t+1},a') \]
Now, we let $\hat{Q}$ denote learner's current approximation to $Q$ and use the training rule \[ \hat{Q}(s,a) \leftarrow r + \gamma \max_{a'}\hat{Q}(s',a') \] where $s'$ is the state resulting from applying action $a$ in state $s$