Eclectic-Sheep · belerico · Apr 2, 2024 · Feb 8, 2024 · Feb 8, 2024 · Feb 12, 2024
@@ -62,9 +62,14 @@ Moreover, we restrict the look-up/down actions between `min_pitch` and `max_pitc
 In addition, we added the forward action when the agent selects one of the following actions: `jump`, `sprint`, and `sneak`.
 Finally, we added sticky actions for the `jump` and `attack` actions. You can set the values of the `sticky_jump` and `sticky_attack` parameters through the `env.sticky_jump` and `env.sticky_attack` cli arguments, respectively. The sticky actions, if set, force the agent to repeat the selected actions for a certain number of steps.
 
+> [!NOTE]
+>
+> The `env.sticky_attack` parameter is set to `0` if the `env.break_speed_multiplier > 1`.
+
 For more information about the MineDojo action space, check [here](https://docs.minedojo.org/sections/core_api/action_space.html).
 
 > [!NOTE]
+>
 > Since the MineDojo environments have a multi-discrete action space, the sticky actions can be easily implemented. The agent will perform the selected action and the sticky actions simultaneously.
 >
 > The action repeat in the Minecraft environments is set to 1, indeed, It makes no sense to force the agent to repeat an action such as crafting (it may not have enough material for the second action).

@@ -47,10 +47,15 @@ In addition, we added the forward action when the agent selects one of the follo
 Finally, we added sticky actions for the `jump` and `attack` actions. You can set the values of the `sticky_jump` and `sticky_attack` parameters through the `env.sticky_jump` and `env.sticky_attack` arguments, respectively. The sticky actions, if set, force the agent to repeat the selected actions for a certain number of steps.
 
 > [!NOTE]
+>
 > Since the MineRL environments have a multi-discrete action space, the sticky actions can be easily implemented. The agent will perform the selected action and the sticky actions simultaneously.
 >
 > The action repeat in the Minecraft environments is set to 1, indeed, It makes no sense to force the agent to repeat an action such as crafting (it may not have enough material for the second action).
 
+> [!NOTE]
+>
+> The `env.sticky_attack` parameter is set to `0` if the `env.break_speed_multiplier > 1`.
+
 ## Headless machines
 
 If you work on a headless machine, you need to software renderer. We recommend to adopt one of the following solutions:

@@ -122,7 +122,6 @@ AGGREGATOR_KEYS = {
     "State/post_entropy",
     "State/prior_entropy",
     "State/kl",
-    "Params/exploration_amount",
     "Grads/world_model",
     "Grads/actor",
     "Grads/critic",

@@ -8,6 +8,7 @@ In the first case, the observations are returned in the form of python dictionar
 
 ### Both observations
 The algorithms that can work with both image and vector observations are specified in [Table 1](../README.md) in the README, and are reported here:
+* A2C
 * PPO
 * PPO Recurrent
 * SAC-AE

@@ -22,11 +22,13 @@ The hyper-parameters that refer to the *policy steps* are:
 * `exploration_steps`: the number of policy steps in which the agent explores the environment in the P2E algorithms.
 * `max_episode_steps`: the maximum number of policy steps an episode can last (`max_steps`); when this number is reached a `terminated=True` is returned by the environment. This means that if you decide to have an action repeat greater than one (`action_repeat > 1`), then the environment performs a maximum number of steps equal to: `env_steps = max_steps * action_repeat`$.
 * `learning_starts`: how many policy steps the agent has to perform before starting the training.
-* `train_every`: how many policy steps the agent has to perform between one training and the next.
 
 ## Gradient steps
-A *gradient step* consists of an update of the parameters of the agent, i.e., a call of the *train* function. The gradient step is proportional to the number of parallel processes, indeed, if there are $n$ parallel processes, `n * gradient_steps` calls to the *train* method will be executed.
+A *gradient step* consists of an update of the parameters of the agent, i.e., a call of the *train* function. The gradient step is proportional to the number of parallel processes, indeed, if there are $n$ parallel processes, `n * per_rank_gradient_steps` calls to the *train* method will be executed.
 
 The hyper-parameters which refer to the *gradient steps* are:
-* `algo.per_rank_gradient_steps`: the number of gradient steps per rank to perform in a single iteration.
-* `algo.per_rank_pretrain_steps`: the number of gradient steps per rank to perform in the first iteration.
+* `algo.per_rank_pretrain_steps`: the number of gradient steps per rank to perform in the first iteration.
+
+> [!NOTE]
+>
+> The `replay_ratio` is the ratio between the gradient steps and the policy steps played by the agente.
@@ -81,8 +81,8 @@ atari = [
   "gymnasium[accept-rom-license]==0.29.*",
   "gymnasium[other]==0.29.*",
 ]
-minedojo = ["minedojo==0.1", "importlib_resources==5.12.0"]
-minerl = ["setuptools==66.0.0", "minerl==0.4.4"]
+minedojo = ["minedojo==0.1", "importlib_resources==5.12.0", "gym==0.21.0"]
+minerl = ["setuptools==66.0.0", "minerl==0.4.4", "gym==0.19.0"]
 diambra = ["diambra==0.0.17", "diambra-arena==2.2.6"]
 crafter = ["crafter==1.8.3"]
 mlflow = ["mlflow==2.11.1"]

@@ -243,7 +243,7 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
                     actions = torch.cat(actions, -1).cpu().numpy()
 
                     # Single environment step
-                    obs, rewards, done, truncated, info = envs.step(real_actions.reshape(envs.action_space.shape))
+                    obs, rewards, terminated, truncated, info = envs.step(real_actions.reshape(envs.action_space.shape))
                     truncated_envs = np.nonzero(truncated)[0]
                     if len(truncated_envs) > 0:
                         real_next_obs = {
@@ -266,10 +266,8 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
                         rewards[truncated_envs] += cfg.algo.gamma * vals.cpu().numpy().reshape(
                             rewards[truncated_envs].shape
                         )
-
-                dones = np.logical_or(done, truncated)
-                dones = dones.reshape(cfg.env.num_envs, -1)
-                rewards = rewards.reshape(cfg.env.num_envs, -1)
+                    dones = np.logical_or(terminated, truncated).reshape(cfg.env.num_envs, -1).astype(np.uint8)
+                    rewards = rewards.reshape(cfg.env.num_envs, -1)
 
                 # Update the step data
                 step_data["dones"] = dones[np.newaxis]

@@ -176,9 +176,9 @@ def train(
     # compute predictions for terminal steps, if required
     if cfg.algo.world_model.use_continues and world_model.continue_model:
         qc = Independent(Bernoulli(logits=world_model.continue_model(latent_states)), 1)
-        continue_targets = (1 - data["dones"]) * cfg.algo.gamma
+        continues_targets = (1 - data["terminated"]) * cfg.algo.gamma
     else:
-        qc = continue_targets = None
+        qc = continues_targets = None
 
     # compute the distributions of the states (posteriors and priors)
     # it is necessary an Independent distribution because
@@ -200,7 +200,7 @@ def train(
         cfg.algo.world_model.kl_free_nats,
         cfg.algo.world_model.kl_regularizer,
         qc,
-        continue_targets,
+        continues_targets,
         cfg.algo.world_model.continue_scale_factor,
     )
     fabric.backward(rec_loss)
@@ -554,7 +554,8 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
         if k in cfg.algo.cnn_keys.encoder:
             obs[k] = obs[k].reshape(cfg.env.num_envs, -1, *obs[k].shape[-2:])
         step_data[k] = obs[k][np.newaxis]
-    step_data["dones"] = np.zeros((1, cfg.env.num_envs, 1))
+    step_data["terminated"] = np.zeros((1, cfg.env.num_envs, 1))
+    step_data["truncated"] = np.zeros((1, cfg.env.num_envs, 1))
     step_data["actions"] = np.zeros((1, cfg.env.num_envs, sum(actions_dim)))
     step_data["rewards"] = np.zeros((1, cfg.env.num_envs, 1))
     rb.add(step_data, validate_args=cfg.buffer.validate_args)
@@ -601,8 +602,10 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
                         real_actions = (
                             torch.cat([real_act.argmax(dim=-1) for real_act in real_actions], dim=-1).cpu().numpy()
                         )
-                next_obs, rewards, dones, truncated, infos = envs.step(real_actions.reshape(envs.action_space.shape))
-                dones = np.logical_or(dones, truncated).astype(np.uint8)
+                next_obs, rewards, terminated, truncated, infos = envs.step(
+                    real_actions.reshape(envs.action_space.shape)
+                )
+                dones = np.logical_or(terminated, truncated).astype(np.uint8)
 
             if cfg.metric.log_level > 0 and "final_info" in infos:
                 for i, agent_ep_info in enumerate(infos["final_info"]):
@@ -631,7 +634,8 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
             # next_obs becomes the new obs
             obs = next_obs
 
-            step_data["dones"] = dones[np.newaxis]
+            step_data["terminated"] = terminated[np.newaxis]
+            step_data["truncated"] = truncated[np.newaxis]
             step_data["actions"] = actions[np.newaxis]
             step_data["rewards"] = clip_rewards_fn(rewards)[np.newaxis]
             rb.add(step_data, validate_args=cfg.buffer.validate_args)
@@ -643,13 +647,14 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
                 reset_data = {}
                 for k in obs_keys:
                     reset_data[k] = (next_obs[k][dones_idxes])[np.newaxis]
-                reset_data["dones"] = np.zeros((1, reset_envs, 1))
+                reset_data["terminated"] = np.zeros((1, reset_envs, 1))
+                reset_data["truncated"] = np.zeros((1, reset_envs, 1))
                 reset_data["actions"] = np.zeros((1, reset_envs, np.sum(actions_dim)))
                 reset_data["rewards"] = np.zeros((1, reset_envs, 1))
                 rb.add(reset_data, dones_idxes, validate_args=cfg.buffer.validate_args)
-                # Reset dones so that `is_first` is updated
                 for d in dones_idxes:
-                    step_data["dones"][0, d] = np.zeros_like(step_data["dones"][0, d])
+                    step_data["terminated"][0, d] = np.zeros_like(step_data["terminated"][0, d])
+                    step_data["truncated"][0, d] = np.zeros_like(step_data["truncated"][0, d])
                 # Reset internal agent states
                 player.init_states(reset_envs=dones_idxes)
 

@@ -168,9 +168,9 @@ def train(
     # Compute the distribution over the terminal steps, if required
     if cfg.algo.world_model.use_continues and world_model.continue_model:
         pc = Independent(Bernoulli(logits=world_model.continue_model(latent_states)), 1)
-        continue_targets = (1 - data["dones"]) * cfg.algo.gamma
+        continues_targets = (1 - data["terminated"]) * cfg.algo.gamma
     else:
-        pc = continue_targets = None
+        pc = continues_targets = None
 
     # Reshape posterior and prior logits to shape [T, B, 32, 32]
     priors_logits = priors_logits.view(*priors_logits.shape[:-1], stochastic_size, discrete_size)
@@ -190,7 +190,7 @@ def train(
         cfg.algo.world_model.kl_free_avg,
         cfg.algo.world_model.kl_regularizer,
         pc,
-        continue_targets,
+        continues_targets,
         cfg.algo.world_model.discount_scale_factor,
     )
     fabric.backward(rec_loss)
@@ -264,8 +264,8 @@ def train(
     predicted_rewards = world_model.reward_model(imagined_trajectories)
     if cfg.algo.world_model.use_continues and world_model.continue_model:
         continues = logits_to_probs(world_model.continue_model(imagined_trajectories), is_binary=True)
-        true_done = (1 - data["dones"]).reshape(1, -1, 1) * cfg.algo.gamma
-        continues = torch.cat((true_done, continues[1:]))
+        true_continue = (1 - data["terminated"]).reshape(1, -1, 1) * cfg.algo.gamma
+        continues = torch.cat((true_continue, continues[1:]))
     else:
         continues = torch.ones_like(predicted_rewards.detach()) * cfg.algo.gamma
 
@@ -576,12 +576,14 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
     obs = envs.reset(seed=cfg.seed)[0]
     for k in obs_keys:
         step_data[k] = obs[k][np.newaxis]
-    step_data["dones"] = np.zeros((1, cfg.env.num_envs, 1))
+    step_data["terminated"] = np.zeros((1, cfg.env.num_envs, 1))
+    step_data["truncated"] = np.zeros((1, cfg.env.num_envs, 1))
     if cfg.dry_run:
-        step_data["dones"] = step_data["dones"] + 1
+        step_data["truncated"] = step_data["truncated"] + 1
+        step_data["terminated"] = step_data["terminated"] + 1
     step_data["actions"] = np.zeros((1, cfg.env.num_envs, sum(actions_dim)))
     step_data["rewards"] = np.zeros((1, cfg.env.num_envs, 1))
-    step_data["is_first"] = np.ones_like(step_data["dones"])
+    step_data["is_first"] = np.ones_like(step_data["terminated"])
     rb.add(step_data, validate_args=cfg.buffer.validate_args)
     player.init_states()
 
@@ -627,9 +629,11 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
                             torch.cat([real_act.argmax(dim=-1) for real_act in real_actions], dim=-1).cpu().numpy()
                         )
 
-                step_data["is_first"] = copy.deepcopy(step_data["dones"])
-                next_obs, rewards, dones, truncated, infos = envs.step(real_actions.reshape(envs.action_space.shape))
-                dones = np.logical_or(dones, truncated).astype(np.uint8)
+                step_data["is_first"] = copy.deepcopy(np.logical_or(step_data["terminated"], step_data["truncated"]))
+                next_obs, rewards, terminated, truncated, infos = envs.step(
+                    real_actions.reshape(envs.action_space.shape)
+                )
+                dones = np.logical_or(terminated, truncated).astype(np.uint8)
                 if cfg.dry_run and buffer_type == "episode":
                     dones = np.ones_like(dones)
 
@@ -657,7 +661,8 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
             # Next_obs becomes the new obs
             obs = next_obs
 
-            step_data["dones"] = dones.reshape((1, cfg.env.num_envs, -1))
+            step_data["terminated"] = terminated.reshape((1, cfg.env.num_envs, -1))
+            step_data["truncated"] = truncated.reshape((1, cfg.env.num_envs, -1))
             step_data["actions"] = actions.reshape((1, cfg.env.num_envs, -1))
             step_data["rewards"] = clip_rewards_fn(rewards).reshape((1, cfg.env.num_envs, -1))
             rb.add(step_data, validate_args=cfg.buffer.validate_args)
@@ -669,14 +674,16 @@ def main(fabric: Fabric, cfg: Dict[str, Any]):
                 reset_data = {}
                 for k in obs_keys:
                     reset_data[k] = (next_obs[k][dones_idxes])[np.newaxis]
-                reset_data["dones"] = np.zeros((1, reset_envs, 1))
+                reset_data["terminated"] = np.zeros((1, reset_envs, 1))
+                reset_data["truncated"] = np.zeros((1, reset_envs, 1))
                 reset_data["actions"] = np.zeros((1, reset_envs, np.sum(actions_dim)))
                 reset_data["rewards"] = np.zeros((1, reset_envs, 1))
-                reset_data["is_first"] = np.ones_like(reset_data["dones"])
+                reset_data["is_first"] = np.ones_like(reset_data["terminated"])
                 rb.add(reset_data, dones_idxes, validate_args=cfg.buffer.validate_args)
                 # Reset dones so that `is_first` is updated
                 for d in dones_idxes:
-                    step_data["dones"][0, d] = np.zeros_like(step_data["dones"][0, d])
+                    step_data["terminated"][0, d] = np.zeros_like(step_data["terminated"][0, d])
+                    step_data["truncated"][0, d] = np.zeros_like(step_data["truncated"][0, d])
                 # Reset internal agent states
                 player.init_states(dones_idxes)