How to Choose an Action
- Notice that the algorithm did not specify how to choose an
action.
- However, convergence requires that every state be
visited infinitely often.
- The is the classic explore vs. exploit problem! (aka, the
n-armed bandit problem).
- A common strategy is to decide stochastically using
\[
P(a_i|s) = \frac{k^{\hat{Q}(s,a_i)}}{\sum_j k^{\hat{Q}(s,a_j)}}
\]
José M. Vidal
.
16 of 22