Fix/terminated truncated (#252)

* Decoupled RSSM for DV3 agent * Initialize posterior with prior if is_first is True * Fix PlayerDV3 creation in evaluation * Fix representation_model * Fix compute first prior state with a zero posterior * DV3 replay ratio conversion * Removed expl parameters dependent on old per_Rank_gradient_steps * feat: update repeats computation * feat: update learning starts in config * fix: remove files * feat: update repeats * Let Dv3 compute bootstrap correctly * feat: added replay ratio and update exploration * Fix exploration actions computation on DV1 * Fix naming * Add replay-ratio to SAC * feat: added replay ratio to p2e algos * feat: update configs and utils of p2e algos * Add replay-ratio to SAC-AE * Add DrOQ replay ratio * Fix tests * Fix mispelled * Fix wrong attribute accesing * FIx naming and configs * feat: add terminated and truncated to dreamer, p2e and ppo algos * fix: dmc wrapper * feat: update algos to split terminated from truncated * fix: crafter and diambra wrappers * feat: replace done with truncated key in when the buffer is added to the checkpoint * feat: added truncated/terminated to minedojo environment * feat: added terminated/truncated to minerl and super mario bros envs * docs: update howto * fix: minedojo wrapper * docs: update * fix: minedojo * update dependencies * fix: minedojo * fix: dv3 small configs * fix: episode buffer and tests --------- Co-authored-by: belerico <belo.fede@outlook.com>
Eclectic-Sheep · Apr 2, 2024 · fdd3a84 · fdd3a84
1 parent 875166a
commit fdd3a84
Show file tree

Hide file tree

Showing 39 changed files with 460 additions and 285 deletions.
diff --git a/howto/learn_in_minedojo.md b/howto/learn_in_minedojo.md
@@ -62,9 +62,14 @@ Moreover, we restrict the look-up/down actions between `min_pitch` and `max_pitc
 In addition, we added the forward action when the agent selects one of the following actions: `jump`, `sprint`, and `sneak`.
 Finally, we added sticky actions for the `jump` and `attack` actions. You can set the values of the `sticky_jump` and `sticky_attack` parameters through the `env.sticky_jump` and `env.sticky_attack` cli arguments, respectively. The sticky actions, if set, force the agent to repeat the selected actions for a certain number of steps.
 
+> [!NOTE]
+>
+> The `env.sticky_attack` parameter is set to `0` if the `env.break_speed_multiplier > 1`.
+
 For more information about the MineDojo action space, check [here](https://docs.minedojo.org/sections/core_api/action_space.html).
 
 > [!NOTE]
+>
 > Since the MineDojo environments have a multi-discrete action space, the sticky actions can be easily implemented. The agent will perform the selected action and the sticky actions simultaneously.
 >
 > The action repeat in the Minecraft environments is set to 1, indeed, It makes no sense to force the agent to repeat an action such as crafting (it may not have enough material for the second action).

diff --git a/howto/learn_in_minerl.md b/howto/learn_in_minerl.md
@@ -47,10 +47,15 @@ In addition, we added the forward action when the agent selects one of the follo
 Finally, we added sticky actions for the `jump` and `attack` actions. You can set the values of the `sticky_jump` and `sticky_attack` parameters through the `env.sticky_jump` and `env.sticky_attack` arguments, respectively. The sticky actions, if set, force the agent to repeat the selected actions for a certain number of steps.
 
 > [!NOTE]
+>
 > Since the MineRL environments have a multi-discrete action space, the sticky actions can be easily implemented. The agent will perform the selected action and the sticky actions simultaneously.
 >
 > The action repeat in the Minecraft environments is set to 1, indeed, It makes no sense to force the agent to repeat an action such as crafting (it may not have enough material for the second action).
 
+> [!NOTE]
+>
+> The `env.sticky_attack` parameter is set to `0` if the `env.break_speed_multiplier > 1`.
+
 ## Headless machines
 
 If you work on a headless machine, you need to software renderer. We recommend to adopt one of the following solutions:

diff --git a/howto/logs_and_checkpoints.md b/howto/logs_and_checkpoints.md
@@ -122,7 +122,6 @@ AGGREGATOR_KEYS = {
     "State/post_entropy",
     "State/prior_entropy",
     "State/kl",
-    "Params/exploration_amount",
     "Grads/world_model",
     "Grads/actor",
     "Grads/critic",

diff --git a/howto/select_observations.md b/howto/select_observations.md
@@ -8,6 +8,7 @@ In the first case, the observations are returned in the form of python dictionar
 
 ### Both observations
 The algorithms that can work with both image and vector observations are specified in [Table 1](../README.md) in the README, and are reported here:
+* A2C
 * PPO
 * PPO Recurrent
 * SAC-AE

diff --git a/howto/work_with_steps.md b/howto/work_with_steps.md
@@ -22,11 +22,13 @@ The hyper-parameters that refer to the *policy steps* are:
 * `exploration_steps`: the number of policy steps in which the agent explores the environment in the P2E algorithms.
 * `max_episode_steps`: the maximum number of policy steps an episode can last (`max_steps`); when this number is reached a `terminated=True` is returned by the environment. This means that if you decide to have an action repeat greater than one (`action_repeat > 1`), then the environment performs a maximum number of steps equal to: `env_steps = max_steps * action_repeat`$.
 * `learning_starts`: how many policy steps the agent has to perform before starting the training.
-* `train_every`: how many policy steps the agent has to perform between one training and the next.
 
 ## Gradient steps
-A *gradient step* consists of an update of the parameters of the agent, i.e., a call of the *train* function. The gradient step is proportional to the number of parallel processes, indeed, if there are $n$ parallel processes, `n * gradient_steps` calls to the *train* method will be executed.
+A *gradient step* consists of an update of the parameters of the agent, i.e., a call of the *train* function. The gradient step is proportional to the number of parallel processes, indeed, if there are $n$ parallel processes, `n * per_rank_gradient_steps` calls to the *train* method will be executed.
 
 The hyper-parameters which refer to the *gradient steps* are:
-* `algo.per_rank_gradient_steps`: the number of gradient steps per rank to perform in a single iteration.
-* `algo.per_rank_pretrain_steps`: the number of gradient steps per rank to perform in the first iteration.
+* `algo.per_rank_pretrain_steps`: the number of gradient steps per rank to perform in the first iteration.
+
+> [!NOTE]
+>
+> The `replay_ratio` is the ratio between the gradient steps and the policy steps played by the agente.
diff --git a/pyproject.toml b/pyproject.toml
@@ -81,8 +81,8 @@ atari = [
   "gymnasium[accept-rom-license]==0.29.*",
   "gymnasium[other]==0.29.*",
 ]
-minedojo = ["minedojo==0.1", "importlib_resources==5.12.0"]
-minerl = ["setuptools==66.0.0", "minerl==0.4.4"]
+minedojo = ["minedojo==0.1", "importlib_resources==5.12.0", "gym==0.21.0"]
+minerl = ["setuptools==66.0.0", "minerl==0.4.4", "gym==0.19.0"]
 diambra = ["diambra==0.0.17", "diambra-arena==2.2.6"]
 crafter = ["crafter==1.8.3"]
 mlflow = ["mlflow==2.11.1"]

diff --git a/sheeprl/algos/a2c/a2c.py b/sheeprl/algos/a2c/a2c.py
@@ -243,7 +243,7 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
                     actions = torch.cat(actions, -1).cpu().numpy()
 
                     # Single environment step
-                    obs, rewards, done, truncated, info = envs.step(real_actions.reshape(envs.action_space.shape))
+                    obs, rewards, terminated, truncated, info = envs.step(real_actions.reshape(envs.action_space.shape))
                     truncated_envs = np.nonzero(truncated)[0]
                     if len(truncated_envs) > 0:
                         real_next_obs = {
@@ -266,10 +266,8 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
                         rewards[truncated_envs] += cfg.algo.gamma * vals.cpu().numpy().reshape(
                             rewards[truncated_envs].shape
                         )
-
-                dones = np.logical_or(done, truncated)
-                dones = dones.reshape(cfg.env.num_envs, -1)
-                rewards = rewards.reshape(cfg.env.num_envs, -1)
+                    dones = np.logical_or(terminated, truncated).reshape(cfg.env.num_envs, -1).astype(np.uint8)
+                    rewards = rewards.reshape(cfg.env.num_envs, -1)
 
                 # Update the step data
                 step_data["dones"] = dones[np.newaxis]

diff --git a/sheeprl/algos/dreamer_v1/dreamer_v1.py b/sheeprl/algos/dreamer_v1/dreamer_v1.py
@@ -176,9 +176,9 @@ def train(
     # compute predictions for terminal steps, if required
     if cfg.algo.world_model.use_continues and world_model.continue_model:
         qc = Independent(Bernoulli(logits=world_model.continue_model(latent_states)), 1)
-        continue_targets = (1 - data["dones"]) * cfg.algo.gamma
+        continues_targets = (1 - data["terminated"]) * cfg.algo.gamma
     else:
-        qc = continue_targets = None
+        qc = continues_targets = None
 
     # compute the distributions of the states (posteriors and priors)
     # it is necessary an Independent distribution because
@@ -200,7 +200,7 @@ def train(
         cfg.algo.world_model.kl_free_nats,
         cfg.algo.world_model.kl_regularizer,
         qc,
-        continue_targets,
+        continues_targets,
         cfg.algo.world_model.continue_scale_factor,
     )
     fabric.backward(rec_loss)
@@ -554,7 +554,8 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
         if k in cfg.algo.cnn_keys.encoder:
             obs[k] = obs[k].reshape(cfg.env.num_envs, -1, *obs[k].shape[-2:])
         step_data[k] = obs[k][np.newaxis]
-    step_data["dones"] = np.zeros((1, cfg.env.num_envs, 1))
+    step_data["terminated"] = np.zeros((1, cfg.env.num_envs, 1))
+    step_data["truncated"] = np.zeros((1, cfg.env.num_envs, 1))
     step_data["actions"] = np.zeros((1, cfg.env.num_envs, sum(actions_dim)))
     step_data["rewards"] = np.zeros((1, cfg.env.num_envs, 1))
     rb.add(step_data, validate_args=cfg.buffer.validate_args)
@@ -601,8 +602,10 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
                         real_actions = (
                             torch.cat([real_act.argmax(dim=-1) for real_act in real_actions], dim=-1).cpu().numpy()
                         )
-                next_obs, rewards, dones, truncated, infos = envs.step(real_actions.reshape(envs.action_space.shape))
-                dones = np.logical_or(dones, truncated).astype(np.uint8)
+                next_obs, rewards, terminated, truncated, infos = envs.step(
+                    real_actions.reshape(envs.action_space.shape)
+                )
+                dones = np.logical_or(terminated, truncated).astype(np.uint8)
 
             if cfg.metric.log_level > 0 and "final_info" in infos:
                 for i, agent_ep_info in enumerate(infos["final_info"]):
@@ -631,7 +634,8 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
             # next_obs becomes the new obs
             obs = next_obs
 
-            step_data["dones"] = dones[np.newaxis]
+            step_data["terminated"] = terminated[np.newaxis]
+            step_data["truncated"] = truncated[np.newaxis]
             step_data["actions"] = actions[np.newaxis]
             step_data["rewards"] = clip_rewards_fn(rewards)[np.newaxis]
             rb.add(step_data, validate_args=cfg.buffer.validate_args)
@@ -643,13 +647,14 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
                 reset_data = {}
                 for k in obs_keys:
                     reset_data[k] = (next_obs[k][dones_idxes])[np.newaxis]
-                reset_data["dones"] = np.zeros((1, reset_envs, 1))
+                reset_data["terminated"] = np.zeros((1, reset_envs, 1))
+                reset_data["truncated"] = np.zeros((1, reset_envs, 1))
                 reset_data["actions"] = np.zeros((1, reset_envs, np.sum(actions_dim)))
                 reset_data["rewards"] = np.zeros((1, reset_envs, 1))
                 rb.add(reset_data, dones_idxes, validate_args=cfg.buffer.validate_args)
-                # Reset dones so that `is_first` is updated
                 for d in dones_idxes:
-                    step_data["dones"][0, d] = np.zeros_like(step_data["dones"][0, d])
+                    step_data["terminated"][0, d] = np.zeros_like(step_data["terminated"][0, d])
+                    step_data["truncated"][0, d] = np.zeros_like(step_data["truncated"][0, d])
                 # Reset internal agent states
                 player.init_states(reset_envs=dones_idxes)
 

diff --git a/sheeprl/algos/dreamer_v2/dreamer_v2.py b/sheeprl/algos/dreamer_v2/dreamer_v2.py
@@ -168,9 +168,9 @@ def train(
     # Compute the distribution over the terminal steps, if required
     if cfg.algo.world_model.use_continues and world_model.continue_model:
         pc = Independent(Bernoulli(logits=world_model.continue_model(latent_states)), 1)
-        continue_targets = (1 - data["dones"]) * cfg.algo.gamma
+        continues_targets = (1 - data["terminated"]) * cfg.algo.gamma
     else:
-        pc = continue_targets = None
+        pc = continues_targets = None
 
     # Reshape posterior and prior logits to shape [T, B, 32, 32]
     priors_logits = priors_logits.view(*priors_logits.shape[:-1], stochastic_size, discrete_size)
@@ -190,7 +190,7 @@ def train(
         cfg.algo.world_model.kl_free_avg,
         cfg.algo.world_model.kl_regularizer,
         pc,
-        continue_targets,
+        continues_targets,
         cfg.algo.world_model.discount_scale_factor,
     )
     fabric.backward(rec_loss)
@@ -264,8 +264,8 @@ def train(
     predicted_rewards = world_model.reward_model(imagined_trajectories)
     if cfg.algo.world_model.use_continues and world_model.continue_model:
         continues = logits_to_probs(world_model.continue_model(imagined_trajectories), is_binary=True)
-        true_done = (1 - data["dones"]).reshape(1, -1, 1) * cfg.algo.gamma
-        continues = torch.cat((true_done, continues[1:]))
+        true_continue = (1 - data["terminated"]).reshape(1, -1, 1) * cfg.algo.gamma
+        continues = torch.cat((true_continue, continues[1:]))
     else:
         continues = torch.ones_like(predicted_rewards.detach()) * cfg.algo.gamma
 
@@ -576,12 +576,14 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
     obs = envs.reset(seed=cfg.seed)[0]
     for k in obs_keys:
         step_data[k] = obs[k][np.newaxis]
-    step_data["dones"] = np.zeros((1, cfg.env.num_envs, 1))
+    step_data["terminated"] = np.zeros((1, cfg.env.num_envs, 1))
+    step_data["truncated"] = np.zeros((1, cfg.env.num_envs, 1))
     if cfg.dry_run:
-        step_data["dones"] = step_data["dones"] + 1
+        step_data["truncated"] = step_data["truncated"] + 1
+        step_data["terminated"] = step_data["terminated"] + 1
     step_data["actions"] = np.zeros((1, cfg.env.num_envs, sum(actions_dim)))
     step_data["rewards"] = np.zeros((1, cfg.env.num_envs, 1))
-    step_data["is_first"] = np.ones_like(step_data["dones"])
+    step_data["is_first"] = np.ones_like(step_data["terminated"])
     rb.add(step_data, validate_args=cfg.buffer.validate_args)
     player.init_states()
 
@@ -627,9 +629,11 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
                             torch.cat([real_act.argmax(dim=-1) for real_act in real_actions], dim=-1).cpu().numpy()
                         )
 
-                step_data["is_first"] = copy.deepcopy(step_data["dones"])
-                next_obs, rewards, dones, truncated, infos = envs.step(real_actions.reshape(envs.action_space.shape))
-                dones = np.logical_or(dones, truncated).astype(np.uint8)
+                step_data["is_first"] = copy.deepcopy(np.logical_or(step_data["terminated"], step_data["truncated"]))
+                next_obs, rewards, terminated, truncated, infos = envs.step(
+                    real_actions.reshape(envs.action_space.shape)
+                )
+                dones = np.logical_or(terminated, truncated).astype(np.uint8)
                 if cfg.dry_run and buffer_type == "episode":
                     dones = np.ones_like(dones)
 
@@ -657,7 +661,8 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
             # Next_obs becomes the new obs
             obs = next_obs
 
-            step_data["dones"] = dones.reshape((1, cfg.env.num_envs, -1))
+            step_data["terminated"] = terminated.reshape((1, cfg.env.num_envs, -1))
+            step_data["truncated"] = truncated.reshape((1, cfg.env.num_envs, -1))
             step_data["actions"] = actions.reshape((1, cfg.env.num_envs, -1))
             step_data["rewards"] = clip_rewards_fn(rewards).reshape((1, cfg.env.num_envs, -1))
             rb.add(step_data, validate_args=cfg.buffer.validate_args)
@@ -669,14 +674,16 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
                 reset_data = {}
                 for k in obs_keys:
                     reset_data[k] = (next_obs[k][dones_idxes])[np.newaxis]
-                reset_data["dones"] = np.zeros((1, reset_envs, 1))
+                reset_data["terminated"] = np.zeros((1, reset_envs, 1))
+                reset_data["truncated"] = np.zeros((1, reset_envs, 1))
                 reset_data["actions"] = np.zeros((1, reset_envs, np.sum(actions_dim)))
                 reset_data["rewards"] = np.zeros((1, reset_envs, 1))
-                reset_data["is_first"] = np.ones_like(reset_data["dones"])
+                reset_data["is_first"] = np.ones_like(reset_data["terminated"])
                 rb.add(reset_data, dones_idxes, validate_args=cfg.buffer.validate_args)
                 # Reset dones so that `is_first` is updated
                 for d in dones_idxes:
-                    step_data["dones"][0, d] = np.zeros_like(step_data["dones"][0, d])
+                    step_data["terminated"][0, d] = np.zeros_like(step_data["terminated"][0, d])
+                    step_data["truncated"][0, d] = np.zeros_like(step_data["truncated"][0, d])
                 # Reset internal agent states
                 player.init_states(dones_idxes)