We continue to develop temporal-difference (TD) learning, which is a central and novel idea in RL.
The method learns from experience without a model (like Monte Carlo).
It updates estimates based on other learned estimates (like Dynamic Programming).
An important difference from MC is that TD makes useful updates after each time step.
Q-Learning is an off-policy algorithm that was an early breakthrough in RL. It is based on TD learning.
In this module, we will cover the details of TD learning and Q-Learning, and implement and study the ideas in code.
At the conclusion of this module, you should be able to:
- Explain how Q-Learning works and how it learns off policy
- Use Q-Learning to compute value functions
- Perform sensitivity analysis on a Q-Learning algorithm