Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baselines & Selling Points #2

Open
hjsuh94 opened this issue Mar 25, 2023 · 1 comment
Open

Baselines & Selling Points #2

hjsuh94 opened this issue Mar 25, 2023 · 1 comment

Comments

@hjsuh94
Copy link
Owner

hjsuh94 commented Mar 25, 2023

  1. Why Model-Based?
  • It's possible to be more data efficient although model-free might have better asymptotic performance
  • Models allow easily injecting inductive biases
  1. What about other generative models?
  • VAE: taking gradients is not straightforward
  • Denoising AE
  • Normalizing Flows
  1. What about other planners / policy optimizers that use diffusion?
  • Janner's approach of directly doing diffusion on the trajectory level
  • MPPI
  1. What if we don't include distribution risk / uncertainty?
  • Policy Gradient with vs. without distribution risk.
  • MPPI / SGD for planning problems
  1. What about other approaches that tackle similar distribution risk problems?
  • Compare against MOPO with includes ensembles and variance.
@hongkai-dai
Copy link
Collaborator

hongkai-dai commented Mar 25, 2023

A quick summary of my understanding on MOPO:

High-level idea: for the true reward on the actual environment T, define a lower bound of this true reward on the learned model . This lower bound can be computed by solving maximizing the modified one-step reward r̃(s, a) = r(s, a) − λ u(s, a) on the learned model .

Pros of MOPO:

  1. A strong theoretical guarantee. The maximization objective is an lower bound of the true objective.
  2. The gap between this lower bound and the true objective is smaller where the transition model is more accurate, and larger when the learned model is accurate. Hence the agent would pay a price to exploit a state/action tuple where the transition model is less accurate, so as to balance between exploration and exploitation.

Cons of MOPO:

  1. The lower bound can be very loose.
  2. Computing this loose bound requires computing the total variational distance between T(s, a), and T̂(s, a), which we don't have.

As a comparison, our proposed regularization term −λ log p(s, a) shares the the second advantage (penalizing the total cost if the model exploits unseen state/action pair), but it doesn't require computing this regularization term through a loose bound, and it is easy to approximate gradient of this regularization term through diffusion.

Eventually we will compare our approach against MOPO, and the comparison metric is the true objective ∑ γᵗ r(s, a) on the real environment, when both MOPO and our system use the same training data.

But MOPO uses an ensemble of models to estimate the error, I don't immediately see why we have to use an ensemble of models. If we compare our approach with one dynamics model versus MOPO with an ensemble of models, is that still a fair comparison?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants