The Learning Task

Given that the agent inhabits a Markov process it must now learn action policy $\pi : S \rightarrow A$ that maximizes \[ E[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots ] \] from any starting state in $S$.
$0 \leq \gamma < 1$ is the discount factor for future rewards.
The target function is $\pi: S \rightarrow A$.
The training examples are of the form $\langle \langle s, a \rangle , r \rangle$