Skip to content

Commit

Permalink
Fix/terminated truncated (#252)
Browse files Browse the repository at this point in the history
* Decoupled RSSM for DV3 agent

* Initialize posterior with prior if is_first is True

* Fix PlayerDV3 creation in evaluation

* Fix representation_model

* Fix compute first prior state with a zero posterior

* DV3 replay ratio conversion

* Removed expl parameters dependent on old per_Rank_gradient_steps

* feat: update repeats computation

* feat: update learning starts in config

* fix: remove files

* feat: update repeats

* Let Dv3 compute bootstrap correctly

* feat: added replay ratio and update exploration

* Fix exploration actions computation on DV1

* Fix naming

* Add replay-ratio to SAC

* feat: added replay ratio to p2e algos

* feat: update configs and utils of p2e algos

* Add replay-ratio to SAC-AE

* Add DrOQ replay ratio

* Fix tests

* Fix mispelled

* Fix wrong attribute accesing

* FIx naming and configs

* feat: add terminated and truncated to dreamer, p2e and ppo algos

* fix: dmc wrapper

* feat: update algos to split terminated from truncated

* fix: crafter and diambra wrappers

* feat: replace done with truncated key in when the buffer is added to the checkpoint

* feat: added truncated/terminated to minedojo environment

* feat: added terminated/truncated to minerl and super mario bros envs

* docs: update howto

* fix: minedojo wrapper

* docs: update

* fix: minedojo

* update dependencies

* fix: minedojo

* fix: dv3 small configs

* fix: episode buffer and tests

---------

Co-authored-by: belerico <belo.fede@outlook.com>
  • Loading branch information
michele-milesi and belerico authored Apr 2, 2024
1 parent 875166a commit fdd3a84
Show file tree
Hide file tree
Showing 39 changed files with 460 additions and 285 deletions.
5 changes: 5 additions & 0 deletions howto/learn_in_minedojo.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,14 @@ Moreover, we restrict the look-up/down actions between `min_pitch` and `max_pitc
In addition, we added the forward action when the agent selects one of the following actions: `jump`, `sprint`, and `sneak`.
Finally, we added sticky actions for the `jump` and `attack` actions. You can set the values of the `sticky_jump` and `sticky_attack` parameters through the `env.sticky_jump` and `env.sticky_attack` cli arguments, respectively. The sticky actions, if set, force the agent to repeat the selected actions for a certain number of steps.

> [!NOTE]
>
> The `env.sticky_attack` parameter is set to `0` if the `env.break_speed_multiplier > 1`.
For more information about the MineDojo action space, check [here](https://docs.minedojo.org/sections/core_api/action_space.html).

> [!NOTE]
>
> Since the MineDojo environments have a multi-discrete action space, the sticky actions can be easily implemented. The agent will perform the selected action and the sticky actions simultaneously.
>
> The action repeat in the Minecraft environments is set to 1, indeed, It makes no sense to force the agent to repeat an action such as crafting (it may not have enough material for the second action).
Expand Down
5 changes: 5 additions & 0 deletions howto/learn_in_minerl.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,10 +47,15 @@ In addition, we added the forward action when the agent selects one of the follo
Finally, we added sticky actions for the `jump` and `attack` actions. You can set the values of the `sticky_jump` and `sticky_attack` parameters through the `env.sticky_jump` and `env.sticky_attack` arguments, respectively. The sticky actions, if set, force the agent to repeat the selected actions for a certain number of steps.

> [!NOTE]
>
> Since the MineRL environments have a multi-discrete action space, the sticky actions can be easily implemented. The agent will perform the selected action and the sticky actions simultaneously.
>
> The action repeat in the Minecraft environments is set to 1, indeed, It makes no sense to force the agent to repeat an action such as crafting (it may not have enough material for the second action).
> [!NOTE]
>
> The `env.sticky_attack` parameter is set to `0` if the `env.break_speed_multiplier > 1`.
## Headless machines

If you work on a headless machine, you need to software renderer. We recommend to adopt one of the following solutions:
Expand Down
1 change: 0 additions & 1 deletion howto/logs_and_checkpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,6 @@ AGGREGATOR_KEYS = {
"State/post_entropy",
"State/prior_entropy",
"State/kl",
"Params/exploration_amount",
"Grads/world_model",
"Grads/actor",
"Grads/critic",
Expand Down
1 change: 1 addition & 0 deletions howto/select_observations.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ In the first case, the observations are returned in the form of python dictionar

### Both observations
The algorithms that can work with both image and vector observations are specified in [Table 1](../README.md) in the README, and are reported here:
* A2C
* PPO
* PPO Recurrent
* SAC-AE
Expand Down
10 changes: 6 additions & 4 deletions howto/work_with_steps.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,13 @@ The hyper-parameters that refer to the *policy steps* are:
* `exploration_steps`: the number of policy steps in which the agent explores the environment in the P2E algorithms.
* `max_episode_steps`: the maximum number of policy steps an episode can last (`max_steps`); when this number is reached a `terminated=True` is returned by the environment. This means that if you decide to have an action repeat greater than one (`action_repeat > 1`), then the environment performs a maximum number of steps equal to: `env_steps = max_steps * action_repeat`$.
* `learning_starts`: how many policy steps the agent has to perform before starting the training.
* `train_every`: how many policy steps the agent has to perform between one training and the next.

## Gradient steps
A *gradient step* consists of an update of the parameters of the agent, i.e., a call of the *train* function. The gradient step is proportional to the number of parallel processes, indeed, if there are $n$ parallel processes, `n * gradient_steps` calls to the *train* method will be executed.
A *gradient step* consists of an update of the parameters of the agent, i.e., a call of the *train* function. The gradient step is proportional to the number of parallel processes, indeed, if there are $n$ parallel processes, `n * per_rank_gradient_steps` calls to the *train* method will be executed.

The hyper-parameters which refer to the *gradient steps* are:
* `algo.per_rank_gradient_steps`: the number of gradient steps per rank to perform in a single iteration.
* `algo.per_rank_pretrain_steps`: the number of gradient steps per rank to perform in the first iteration.
* `algo.per_rank_pretrain_steps`: the number of gradient steps per rank to perform in the first iteration.

> [!NOTE]
>
> The `replay_ratio` is the ratio between the gradient steps and the policy steps played by the agente.
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -81,8 +81,8 @@ atari = [
"gymnasium[accept-rom-license]==0.29.*",
"gymnasium[other]==0.29.*",
]
minedojo = ["minedojo==0.1", "importlib_resources==5.12.0"]
minerl = ["setuptools==66.0.0", "minerl==0.4.4"]
minedojo = ["minedojo==0.1", "importlib_resources==5.12.0", "gym==0.21.0"]
minerl = ["setuptools==66.0.0", "minerl==0.4.4", "gym==0.19.0"]
diambra = ["diambra==0.0.17", "diambra-arena==2.2.6"]
crafter = ["crafter==1.8.3"]
mlflow = ["mlflow==2.11.1"]
Expand Down
8 changes: 3 additions & 5 deletions sheeprl/algos/a2c/a2c.py
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
actions = torch.cat(actions, -1).cpu().numpy()

# Single environment step
obs, rewards, done, truncated, info = envs.step(real_actions.reshape(envs.action_space.shape))
obs, rewards, terminated, truncated, info = envs.step(real_actions.reshape(envs.action_space.shape))
truncated_envs = np.nonzero(truncated)[0]
if len(truncated_envs) > 0:
real_next_obs = {
Expand All @@ -266,10 +266,8 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
rewards[truncated_envs] += cfg.algo.gamma * vals.cpu().numpy().reshape(
rewards[truncated_envs].shape
)

dones = np.logical_or(done, truncated)
dones = dones.reshape(cfg.env.num_envs, -1)
rewards = rewards.reshape(cfg.env.num_envs, -1)
dones = np.logical_or(terminated, truncated).reshape(cfg.env.num_envs, -1).astype(np.uint8)
rewards = rewards.reshape(cfg.env.num_envs, -1)

# Update the step data
step_data["dones"] = dones[np.newaxis]
Expand Down
25 changes: 15 additions & 10 deletions sheeprl/algos/dreamer_v1/dreamer_v1.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,9 +176,9 @@ def train(
# compute predictions for terminal steps, if required
if cfg.algo.world_model.use_continues and world_model.continue_model:
qc = Independent(Bernoulli(logits=world_model.continue_model(latent_states)), 1)
continue_targets = (1 - data["dones"]) * cfg.algo.gamma
continues_targets = (1 - data["terminated"]) * cfg.algo.gamma
else:
qc = continue_targets = None
qc = continues_targets = None

# compute the distributions of the states (posteriors and priors)
# it is necessary an Independent distribution because
Expand All @@ -200,7 +200,7 @@ def train(
cfg.algo.world_model.kl_free_nats,
cfg.algo.world_model.kl_regularizer,
qc,
continue_targets,
continues_targets,
cfg.algo.world_model.continue_scale_factor,
)
fabric.backward(rec_loss)
Expand Down Expand Up @@ -554,7 +554,8 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
if k in cfg.algo.cnn_keys.encoder:
obs[k] = obs[k].reshape(cfg.env.num_envs, -1, *obs[k].shape[-2:])
step_data[k] = obs[k][np.newaxis]
step_data["dones"] = np.zeros((1, cfg.env.num_envs, 1))
step_data["terminated"] = np.zeros((1, cfg.env.num_envs, 1))
step_data["truncated"] = np.zeros((1, cfg.env.num_envs, 1))
step_data["actions"] = np.zeros((1, cfg.env.num_envs, sum(actions_dim)))
step_data["rewards"] = np.zeros((1, cfg.env.num_envs, 1))
rb.add(step_data, validate_args=cfg.buffer.validate_args)
Expand Down Expand Up @@ -601,8 +602,10 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
real_actions = (
torch.cat([real_act.argmax(dim=-1) for real_act in real_actions], dim=-1).cpu().numpy()
)
next_obs, rewards, dones, truncated, infos = envs.step(real_actions.reshape(envs.action_space.shape))
dones = np.logical_or(dones, truncated).astype(np.uint8)
next_obs, rewards, terminated, truncated, infos = envs.step(
real_actions.reshape(envs.action_space.shape)
)
dones = np.logical_or(terminated, truncated).astype(np.uint8)

if cfg.metric.log_level > 0 and "final_info" in infos:
for i, agent_ep_info in enumerate(infos["final_info"]):
Expand Down Expand Up @@ -631,7 +634,8 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
# next_obs becomes the new obs
obs = next_obs

step_data["dones"] = dones[np.newaxis]
step_data["terminated"] = terminated[np.newaxis]
step_data["truncated"] = truncated[np.newaxis]
step_data["actions"] = actions[np.newaxis]
step_data["rewards"] = clip_rewards_fn(rewards)[np.newaxis]
rb.add(step_data, validate_args=cfg.buffer.validate_args)
Expand All @@ -643,13 +647,14 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
reset_data = {}
for k in obs_keys:
reset_data[k] = (next_obs[k][dones_idxes])[np.newaxis]
reset_data["dones"] = np.zeros((1, reset_envs, 1))
reset_data["terminated"] = np.zeros((1, reset_envs, 1))
reset_data["truncated"] = np.zeros((1, reset_envs, 1))
reset_data["actions"] = np.zeros((1, reset_envs, np.sum(actions_dim)))
reset_data["rewards"] = np.zeros((1, reset_envs, 1))
rb.add(reset_data, dones_idxes, validate_args=cfg.buffer.validate_args)
# Reset dones so that `is_first` is updated
for d in dones_idxes:
step_data["dones"][0, d] = np.zeros_like(step_data["dones"][0, d])
step_data["terminated"][0, d] = np.zeros_like(step_data["terminated"][0, d])
step_data["truncated"][0, d] = np.zeros_like(step_data["truncated"][0, d])
# Reset internal agent states
player.init_states(reset_envs=dones_idxes)

Expand Down
37 changes: 22 additions & 15 deletions sheeprl/algos/dreamer_v2/dreamer_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,9 +168,9 @@ def train(
# Compute the distribution over the terminal steps, if required
if cfg.algo.world_model.use_continues and world_model.continue_model:
pc = Independent(Bernoulli(logits=world_model.continue_model(latent_states)), 1)
continue_targets = (1 - data["dones"]) * cfg.algo.gamma
continues_targets = (1 - data["terminated"]) * cfg.algo.gamma
else:
pc = continue_targets = None
pc = continues_targets = None

# Reshape posterior and prior logits to shape [T, B, 32, 32]
priors_logits = priors_logits.view(*priors_logits.shape[:-1], stochastic_size, discrete_size)
Expand All @@ -190,7 +190,7 @@ def train(
cfg.algo.world_model.kl_free_avg,
cfg.algo.world_model.kl_regularizer,
pc,
continue_targets,
continues_targets,
cfg.algo.world_model.discount_scale_factor,
)
fabric.backward(rec_loss)
Expand Down Expand Up @@ -264,8 +264,8 @@ def train(
predicted_rewards = world_model.reward_model(imagined_trajectories)
if cfg.algo.world_model.use_continues and world_model.continue_model:
continues = logits_to_probs(world_model.continue_model(imagined_trajectories), is_binary=True)
true_done = (1 - data["dones"]).reshape(1, -1, 1) * cfg.algo.gamma
continues = torch.cat((true_done, continues[1:]))
true_continue = (1 - data["terminated"]).reshape(1, -1, 1) * cfg.algo.gamma
continues = torch.cat((true_continue, continues[1:]))
else:
continues = torch.ones_like(predicted_rewards.detach()) * cfg.algo.gamma

Expand Down Expand Up @@ -576,12 +576,14 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
obs = envs.reset(seed=cfg.seed)[0]
for k in obs_keys:
step_data[k] = obs[k][np.newaxis]
step_data["dones"] = np.zeros((1, cfg.env.num_envs, 1))
step_data["terminated"] = np.zeros((1, cfg.env.num_envs, 1))
step_data["truncated"] = np.zeros((1, cfg.env.num_envs, 1))
if cfg.dry_run:
step_data["dones"] = step_data["dones"] + 1
step_data["truncated"] = step_data["truncated"] + 1
step_data["terminated"] = step_data["terminated"] + 1
step_data["actions"] = np.zeros((1, cfg.env.num_envs, sum(actions_dim)))
step_data["rewards"] = np.zeros((1, cfg.env.num_envs, 1))
step_data["is_first"] = np.ones_like(step_data["dones"])
step_data["is_first"] = np.ones_like(step_data["terminated"])
rb.add(step_data, validate_args=cfg.buffer.validate_args)
player.init_states()

Expand Down Expand Up @@ -627,9 +629,11 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
torch.cat([real_act.argmax(dim=-1) for real_act in real_actions], dim=-1).cpu().numpy()
)

step_data["is_first"] = copy.deepcopy(step_data["dones"])
next_obs, rewards, dones, truncated, infos = envs.step(real_actions.reshape(envs.action_space.shape))
dones = np.logical_or(dones, truncated).astype(np.uint8)
step_data["is_first"] = copy.deepcopy(np.logical_or(step_data["terminated"], step_data["truncated"]))
next_obs, rewards, terminated, truncated, infos = envs.step(
real_actions.reshape(envs.action_space.shape)
)
dones = np.logical_or(terminated, truncated).astype(np.uint8)
if cfg.dry_run and buffer_type == "episode":
dones = np.ones_like(dones)

Expand Down Expand Up @@ -657,7 +661,8 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
# Next_obs becomes the new obs
obs = next_obs

step_data["dones"] = dones.reshape((1, cfg.env.num_envs, -1))
step_data["terminated"] = terminated.reshape((1, cfg.env.num_envs, -1))
step_data["truncated"] = truncated.reshape((1, cfg.env.num_envs, -1))
step_data["actions"] = actions.reshape((1, cfg.env.num_envs, -1))
step_data["rewards"] = clip_rewards_fn(rewards).reshape((1, cfg.env.num_envs, -1))
rb.add(step_data, validate_args=cfg.buffer.validate_args)
Expand All @@ -669,14 +674,16 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
reset_data = {}
for k in obs_keys:
reset_data[k] = (next_obs[k][dones_idxes])[np.newaxis]
reset_data["dones"] = np.zeros((1, reset_envs, 1))
reset_data["terminated"] = np.zeros((1, reset_envs, 1))
reset_data["truncated"] = np.zeros((1, reset_envs, 1))
reset_data["actions"] = np.zeros((1, reset_envs, np.sum(actions_dim)))
reset_data["rewards"] = np.zeros((1, reset_envs, 1))
reset_data["is_first"] = np.ones_like(reset_data["dones"])
reset_data["is_first"] = np.ones_like(reset_data["terminated"])
rb.add(reset_data, dones_idxes, validate_args=cfg.buffer.validate_args)
# Reset dones so that `is_first` is updated
for d in dones_idxes:
step_data["dones"][0, d] = np.zeros_like(step_data["dones"][0, d])
step_data["terminated"][0, d] = np.zeros_like(step_data["terminated"][0, d])
step_data["truncated"][0, d] = np.zeros_like(step_data["truncated"][0, d])
# Reset internal agent states
player.init_states(dones_idxes)

Expand Down
Loading

0 comments on commit fdd3a84

Please sign in to comment.