Variations
- Updating after each move can make it take very long to
converge if the only reward is at the end.
- Instead, we can save the whole set of rewards and update
at the end, in reverse order.
- Another technique is to store past state-action
transitions and their rewards and re-train on them
periodically.
- This helps when the Q values of the neighbors have
changed.
José M. Vidal
.
17 of 22