Nondeterministic Q-Learning
- We can then alter the training rule to be
\[ \hat{Q}_{n}(s,a) \leftarrow (1-\alpha_{n})\hat{Q}_{n-1}(s,a) + \alpha_{n}[r+ \max_{a'}\hat{Q}_{n-1}(s',a')] \]
where
\[ \alpha_{n} = \frac{1}{1 + \mbox{visits}_n(s,a)} \]
- Note that when $\alpha = 1$ we have the old equation. As
such, all we are doing is making revisions to $\hat{Q}$ more
gradual.
- We can still prove convergence of $\hat{Q}$ to $Q$.
José M. Vidal
.
19 of 22