The Learning Task
- Given that the agent inhabits a Markov process it must now learn action policy $\pi : S \rightarrow A$ that maximizes
\[ E[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots ] \]
from any starting state in $S$.
- $0 \leq \gamma < 1$ is the discount factor for future
rewards.
- The target function is $\pi: S \rightarrow A$.
- The training examples are of the form $\langle \langle s, a \rangle , r \rangle$
José M. Vidal
.
6 of 22