Learning Q
- We notice that $Q$ and $V^*$ are closely related, specifically:
\[ V^{*}(s) = \max_{a'}Q(s,a') \]
- This allows us to write $Q$ recursively as
\[
Q(s_t,a_t) = r(s_t,a_t) + \gamma V^{*}(\delta(s_t,a_t))) \]
\[
Q(s_t,a_t) = r(s_t,a_t) + \gamma \max_{a'}Q(s_{t+1},a') \]
- Now, we let $\hat{Q}$ denote learner's current approximation
to $Q$ and use the training rule
\[ \hat{Q}(s,a) \leftarrow r + \gamma \max_{a'}\hat{Q}(s',a') \]
where $s'$ is the state resulting from applying action $a$ in state $s$
José M. Vidal
.
12 of 22