Nondeterministic Q-Learning

We can then alter the training rule to be \[ \hat{Q}_{n}(s,a) \leftarrow (1-\alpha_{n})\hat{Q}_{n-1}(s,a) + \alpha_{n}[r+ \max_{a'}\hat{Q}_{n-1}(s',a')] \] where \[ \alpha_{n} = \frac{1}{1 + \mbox{visits}_n(s,a)} \]
Note that when $\alpha = 1$ we have the old equation. As such, all we are doing is making revisions to $\hat{Q}$ more gradual.
We can still prove convergence of $\hat{Q}$ to $Q$.