Introduction

So far, our machine learning problems have assumed:
1. that we can present the learner with correct examples,
2. and that learning goes first, then we test.
Imagine a robot learning to dock with the batter charger.
It gets a reward (charge) when successful, but must determine how to encourage actions that led there (temporal credit assignment problem).
- The reward might be probabilistic.
It can explore the domain, finding a better route? (exploration vs. exploitation problem)
The state might only be partially observable.
It might have to learn several related task with the same environment and using the same sensors.