Value function
- For each possible policy $\pi$ the agent might adopt, we can define an evaluation
function over states
\[
V^{\pi}(s) \equiv r_{t} + \gamma r_{t+1} + \gamma^{2} r_{t+2} + \ldots \]
\[
V^{\pi}(s) \equiv \sum_{i=0}^{\infty} \gamma^{i} r_{t+i}
\] where $ r_{t}, r_{t+1}, \ldots$ are generated by following policy
$\pi$ starting at state $s$, where γ is the discount.
- $V^{\pi}(s)$ is the discounted cumulative reward of policy $\pi$.
- We can now say that the learning task is to learn the optimal policy $\pi^{*}$
\[ \pi^{*} \equiv \arg \max_{\pi} V^{\pi}(s), (\forall s) \]
José M. Vidal
.
7 of 22