Reinforcement Learning

Q-Learning Motivation

So we want to learn $V^{\pi^{*}} \equiv V^{*}$.
The agent could just do a lookahead search to choose the best action for each state: \[ \pi^{*}(s) = \arg \max_{a} [r(s,a) + \gamma V^{*}(\delta(s,a))] \] easy, right?
Yes, but only if we know $\delta$ and $r$.
Most often, we don't.

José M. Vidal .

10 of 22