You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
High-level idea: for the true reward on the actual environment T, define a lower bound of this true reward on the learned model T̂. This lower bound can be computed by solving maximizing the modified one-step reward r̃(s, a) = r(s, a) − λ u(s, a) on the learned model T̂.
Pros of MOPO:
A strong theoretical guarantee. The maximization objective is an lower bound of the true objective.
The gap between this lower bound and the true objective is smaller where the transition model T̂ is more accurate, and larger when the learned model T̂ is accurate. Hence the agent would pay a price to exploit a state/action tuple where the transition model is less accurate, so as to balance between exploration and exploitation.
Cons of MOPO:
The lower bound can be very loose.
Computing this loose bound requires computing the total variational distance between T(s, a), and T̂(s, a), which we don't have.
As a comparison, our proposed regularization term −λ log p(s, a) shares the the second advantage (penalizing the total cost if the model exploits unseen state/action pair), but it doesn't require computing this regularization term through a loose bound, and it is easy to approximate gradient of this regularization term through diffusion.
Eventually we will compare our approach against MOPO, and the comparison metric is the true objective ∑ γᵗ r(s, a) on the real environment, when both MOPO and our system use the same training data.
But MOPO uses an ensemble of models to estimate the error, I don't immediately see why we have to use an ensemble of models. If we compare our approach with one dynamics model versus MOPO with an ensemble of models, is that still a fair comparison?
The text was updated successfully, but these errors were encountered: