How to Choose an Action

Reinforcement Learning

Notice that the algorithm did not specify how to choose an action.
However, convergence requires that every state be visited infinitely often.
The is the classic explore vs. exploit problem! (aka, the n-armed bandit problem).
A common strategy is to decide stochastically using \[ P(a_i|s) = \frac{k^{\hat{Q}(s,a_i)}}{\sum_j k^{\hat{Q}(s,a_j)}} \]