Skip to content

Latest commit

 

History

History
18 lines (12 loc) · 842 Bytes

intro_and_learning_outcomes.md

File metadata and controls

18 lines (12 loc) · 842 Bytes

INTRODUCTION TO THE MODULE

We continue to develop temporal-difference (TD) learning, which is a central and novel idea in RL.
The method learns from experience without a model (like Monte Carlo).
It updates estimates based on other learned estimates (like Dynamic Programming).
An important difference from MC is that TD makes useful updates after each time step.

Q-Learning is an off-policy algorithm that was an early breakthrough in RL. It is based on TD learning.
In this module, we will cover the details of TD learning and Q-Learning, and implement and study the ideas in code.

LEARNING OUTCOMES

At the conclusion of this module, you should be able to:

  • Explain how Q-Learning works and how it learns off policy
  • Use Q-Learning to compute value functions
  • Perform sensitivity analysis on a Q-Learning algorithm