Value function

For each possible policy $\pi$ the agent might adopt, we can define an evaluation function over states \[ V^{\pi}(s) \equiv r_{t} + \gamma r_{t+1} + \gamma^{2} r_{t+2} + \ldots \] \[ V^{\pi}(s) \equiv \sum_{i=0}^{\infty} \gamma^{i} r_{t+i} \] where $ r_{t}, r_{t+1}, \ldots$ are generated by following policy $\pi$ starting at state $s$, where γ is the discount.
$V^{\pi}(s)$ is the discounted cumulative reward of policy $\pi$.
We can now say that the learning task is to learn the optimal policy $\pi^{*}$ \[ \pi^{*} \equiv \arg \max_{\pi} V^{\pi}(s), (\forall s) \]