Reinforcement Learning

Q-Learning Motivation

So we want to learn $V^{\pi^{*}} \equiv V^{*}$.
The agent could just do a lookahead search to choose the best action for each state: \[ \pi^{*}(s) = \arg \max_{a} [r(s,a) + \gamma V^{*}(\delta(s,a))] \] easy, right?

José M. Vidal .

9 of 22