Baselines & Selling Points #2

hjsuh94 · 2023-03-25T02:48:04Z

Why Model-Based?

It's possible to be more data efficient although model-free might have better asymptotic performance
Models allow easily injecting inductive biases

What about other generative models?

VAE: taking gradients is not straightforward
Denoising AE
Normalizing Flows

What about other planners / policy optimizers that use diffusion?

Janner's approach of directly doing diffusion on the trajectory level
MPPI

What if we don't include distribution risk / uncertainty?

Policy Gradient with vs. without distribution risk.
MPPI / SGD for planning problems

What about other approaches that tackle similar distribution risk problems?

Compare against MOPO with includes ensembles and variance.

hongkai-dai · 2023-03-25T23:37:52Z

A quick summary of my understanding on MOPO:

High-level idea: for the true reward on the actual environment T, define a lower bound of this true reward on the learned model T̂. This lower bound can be computed by solving maximizing the modified one-step reward r̃(s, a) = r(s, a) − λ u(s, a) on the learned model T̂.

Pros of MOPO:

A strong theoretical guarantee. The maximization objective is an lower bound of the true objective.
The gap between this lower bound and the true objective is smaller where the transition model T̂ is more accurate, and larger when the learned model T̂ is accurate. Hence the agent would pay a price to exploit a state/action tuple where the transition model is less accurate, so as to balance between exploration and exploitation.

Cons of MOPO:

The lower bound can be very loose.
Computing this loose bound requires computing the total variational distance between T(s, a), and T̂(s, a), which we don't have.

As a comparison, our proposed regularization term −λ log p(s, a) shares the the second advantage (penalizing the total cost if the model exploits unseen state/action pair), but it doesn't require computing this regularization term through a loose bound, and it is easy to approximate gradient of this regularization term through diffusion.

Eventually we will compare our approach against MOPO, and the comparison metric is the true objective ∑ γᵗ r(s, a) on the real environment, when both MOPO and our system use the same training data.

But MOPO uses an ensemble of models to estimate the error, I don't immediately see why we have to use an ensemble of models. If we compare our approach with one dynamics model versus MOPO with an ensemble of models, is that still a fair comparison?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Baselines & Selling Points #2

Baselines & Selling Points #2

hjsuh94 commented Mar 25, 2023

hongkai-dai commented Mar 25, 2023 •

edited

Loading

Baselines & Selling Points #2

Baselines & Selling Points #2

Comments

hjsuh94 commented Mar 25, 2023

hongkai-dai commented Mar 25, 2023 • edited Loading

hongkai-dai commented Mar 25, 2023 •

edited

Loading