Nondeterministic Rewards and Actions
- The Q-learning algorithm mentioned only works for
deterministic $r$ and $\delta$. Lets fix it.
- First we redefine $V, Q$ by taking expected values
\[
V^{\pi}(s) \equiv E[ r_{t} + \gamma r_{t+1} + \gamma^{2} r_{t+2} + \ldots
\]
\[
V^{\pi}(s) \equiv E \left[ \sum_{i=0}^{\infty} \gamma^{i} r_{t+i} \right]
\]
so that
\[ Q(s,a) \equiv E[r(s,a) + \gamma V^{*}(\delta(s,a)) ] \]
José M. Vidal
.
18 of 22