[RLlib; Off-policy] Add sequence sampling to 'EpisodeReplayBuffer'. #48116

simonsays1980 · 2024-10-21T10:16:06Z

Why are these changes needed?

At the moment replay buffers do not allow to sample sequences (which could become helpul in case of stateful policies or bias reduction in value functions). This PR offers a solution that

Samples sequences in the EpisodeReplayBuffer.
Adds given states to the episodes.
Simpliifes episode sampling by using SingleAgentEpisode.slice instead of constructing episodes from basic data structures.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…construction to slicing. This is still incomplete b/c it needs the correct discounting of rewards in case a sequence is returned. In addition the steps in between the end of the sequence and the n-step needs to be rmeoved. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

sven1977 · 2024-10-21T18:03:13Z

rllib/utils/replay_buffers/episode_replay_buffer.py

@@ -518,78 +525,82 @@ def _sample_episodes(
            if random_n_step:
                actual_n_step = int(self.rng.integers(n_step[0], n_step[1]))

+            lookback = int(episode_ts != 0)


Based on our slack discussion, just some thoughts:

Maybe the user should provide the lookback value?

Even if the user requires a lookback of 10000, it shouldn't matter in filtering out episodes here, b/c if the episode is not reaching back that many timesteps, then that's also ok and the lookback will result in fill values, not actual episode values.

Thus, in the if-block below, we should simply do:

if episode_ts + batch_length_T + (actual_n_step - 1) > len(episode): continue

If no lookback is provided I guess we run into this line and error out: https://github.com/ray-project/ray/blob/5669b479e13fe65308071f4de8b5b0763afa6aa7/rllib/connectors/common/add_states_from_episodes_to_batch.py#L374

We could set fill=True. The things is that the starting point is 0 but the length is 1.

sven1977 · 2024-10-21T18:05:19Z

rllib/utils/replay_buffers/episode_replay_buffer.py

+                    episode_ts - lookback,
+                    episode_ts + actual_n_step + batch_length_T - 1,
+                ),
+                len_lookback_buffer=lookback,


This arg should already take care of the slice-adjustment, so you don't need to do this math above anymore:

sampled_episode = episode.slice( slice(episode_ts, episode_ts + batch_length_T + (actual_n_step - 1)), len_lookback_buffer=lookback, )

sven1977 · 2024-10-21T18:07:32Z

rllib/utils/replay_buffers/episode_replay_buffer.py

@@ -473,6 +478,8 @@ def _sample_episodes(
                the extra model outputs at the `"obs"` in the batch is included (the
                timestep at which the action is computed).
            finalize: If episodes should be finalized.
+            states: States of stateful `RLModule` that can be added to the


I'm not sure how this would work. We sample (randomly) some episode chunk(s) from the buffer, no? So how would the user know, which states (at which timesteps and for which episodes) to provide??

The thing is that we actually have cases where the buffer does not know how a state looks like as it does not know the module. For example: in behavior cloning the expert policy contains state only in rare cases, but if we want to train a stateful model we need to provide somehow states. My goal is to make the offline API as powerful as possible to apply it to real industry cases that will necessarily come with complex modules.
A different approach would be to add a connector that does so. This is probably the "nicer" solution and more aligned with our design. What do yout hink @sven1977 ?

sven1977

I have 2 high level questions, which I think we should answer first:

Do we even need n_step in combination with sequence sampling? It's overkill, no. If I have an RNN-based loss/model and I sample sequence chunks from the buffer, then I don't really care about n-step, b/c I already have something better: whole and longer sequences + a memory-capable model.
Do we need lookback? I think we do b/c the first state-out of each batch row is the state-out of the previous(!) timestep (the one in a lookback buffer of size 1). I do NOT think, however, that we need it for the (discounted) rewards. Unless, however, :) you have an LSTM that requires prev-action/reward inputs as well. As a first solution, I think users should have to provide the lookback as an argument to the sample method.

…d lookback. Furthermore, added 'get_initial_state' to 'DQNRainbowRLModule' and adapted module for stateful training. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

…adeletion for sampled episodes. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

…s necessary for plain DQN to learn. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

…g architecture. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

rllib/tuned_examples/ppo/stateless_cartpole_ppo.py

rllib/utils/replay_buffers/episode_replay_buffer.py

sven1977 · 2024-10-30T10:37:44Z

rllib/utils/replay_buffers/episode_replay_buffer.py

-                        }
-                        if include_extra_model_outputs
-                        else {}
+            if batch_length_T == 0:


if batch_length_T is None:

rllib/utils/replay_buffers/episode_replay_buffer.py

sven1977 · 2024-10-30T10:38:04Z

rllib/utils/replay_buffers/episode_replay_buffer.py

            sampled_episodes.append(sampled_episode)

            # Increment counter.
-            B += 1
+            B += batch_length_T or 1


sven1977 · 2024-10-30T10:40:12Z

rllib/connectors/common/add_states_from_episodes_to_batch.py

+                if Columns.NEXT_OBS in batch:
+                    self.add_n_batch_items(
+                        batch=batch,
+                        column="new_state_in",


nit: create a new Column constant?

I had a similar idea, but did not before it was running. Have created a new column now.

sven1977 · 2024-10-30T10:41:48Z

rllib/connectors/common/add_states_from_episodes_to_batch.py

+                        key=Columns.STATE_OUT
+                    )
+                else:
+                    state_outs = tree.map_structure(


Can we explain here why we would repeat the lookback state len(episode) times? What's the logic behind doing this?

Good point. The main case where this happens is in offline learning when the expert was non-stateful. I have added some comments to make it more clear.

rllib/algorithms/dqn/torch/dqn_rainbow_torch_rl_module.py

rllib/connectors/common/add_states_from_episodes_to_batch.py

python/ray/data/_internal/planner/plan_udf_map_op.py

sven1977 · 2024-12-28T12:49:12Z

python/ray/data/exceptions.py

+
+# from ray.util.rpdb import _is_ray_debugger_post_mortem_enabled


Suggested change

# from ray.util.rpdb import _is_ray_debugger_post_mortem_enabled

from ray.util.rpdb import _is_ray_debugger_post_mortem_enabled

Thanks for the fix @sven1977 ! I was already wondering where this linter message came from. I did not change it myself however and hoped for the master to fix it.

sven1977 · 2024-12-28T12:49:30Z

rllib/algorithms/dqn/default_dqn_rl_module.py

@@ -132,6 +132,13 @@ def compute_q_values(self, batch: Dict[str, TensorType]) -> Dict[str, TensorType
            {"af": self.af, "vf": self.vf} if self.uses_dueling else self.af,
        )

+    @override(RLModule)


haha, nice :)

rllib/core/columns.py

rllib/tuned_examples/dqn/stateless_cartpole_dqn.py

rllib/tuned_examples/ppo/stateless_cartpole_ppo.py

rllib/algorithms/dqn/torch/default_dqn_torch_rl_module.py

rllib/connectors/common/add_observations_from_episodes_to_batch.py

sven1977 · 2024-12-28T13:04:39Z

rllib/utils/replay_buffers/episode_replay_buffer.py

+            del episode
+
+            # Add the actually chosen n-step in this episode.
+            sampled_episode.extra_model_outputs["n_step"] = InfiniteLookbackBuffer(


Do we need to add this information to the episode, even if we don't do n-step?

Yes, because in the loss we use the n_step as an exponent.

rllib/utils/replay_buffers/episode_replay_buffer.py

sven1977

LGTM! Great hustle through making this PR work, @simonsays1980 !!

Just a few cleanup nits and 2-3 remaining smaller questions.

Co-authored-by: Sven Mika <sven@anyscale.io> Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

…ate_in' to 'next_state_in'. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

…ode-replay-buffer

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

…ode-replay-buffer

…ePreLearner' in regard to 'n_step' and step counting. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

…ode-replay-buffer

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

…ode-replay-buffer

… in the CI tests. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

…ay-project#48116)

…ay-project#48116) Signed-off-by: Roshan Kathawate <roshankathawate@gmail.com>

sven1977 changed the title ~~[RLlib; Off-policy] - Add sequence sampling to 'EpisodeReplayBuffer'.~~ [RLlib; Off-policy] Add sequence sampling to 'EpisodeReplayBuffer'. Oct 21, 2024

sven1977 marked this pull request as ready for review October 21, 2024 12:18

sven1977 self-requested a review as a code owner October 21, 2024 12:18

simonsays1980 added enhancement Request for new feature and/or capability rllib RLlib related issues rllib-offline-rl Offline RL problems labels Oct 21, 2024

sven1977 reviewed Oct 21, 2024

View reviewed changes

simonsays1980 added 7 commits October 22, 2024 11:09

Removed states from 'EpisodeReplayBuffer._sample_episodes' and remove…

c26bf76

…d lookback. Furthermore, added 'get_initial_state' to 'DQNRainbowRLModule' and adapted module for stateful training. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Fixed multiple shape errors in DQN when training with LSTM.

8489719

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Fixed some shape errors in 'foward_train' of DQN. Furthermore, added …

36bf8ac

…adeletion for sampled episodes. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Removed number of atoms from tuned example as we use always 1.

602aebb

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Modified no-sequence sampling to replicate the master version. This i…

0b8ec9c

…s necessary for plain DQN to learn. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Added state-ins for the next timestep to compute DQN TD-loss.

45befc6

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Added next obs states to batch of the q-network when using the duelin…

4ef4255

…g architecture. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

sven1977 reviewed Oct 30, 2024

View reviewed changes

rllib/tuned_examples/ppo/stateless_cartpole_ppo.py Outdated Show resolved Hide resolved

sven1977 reviewed Oct 30, 2024

View reviewed changes

rllib/utils/replay_buffers/episode_replay_buffer.py Outdated Show resolved Hide resolved

sven1977 reviewed Oct 30, 2024

View reviewed changes

rllib/utils/replay_buffers/episode_replay_buffer.py Outdated Show resolved Hide resolved

sven1977 reviewed Oct 30, 2024

View reviewed changes

rllib/utils/replay_buffers/episode_replay_buffer.py Outdated Show resolved Hide resolved

sven1977 reviewed Oct 30, 2024

View reviewed changes

rllib/utils/replay_buffers/episode_replay_buffer.py Outdated Show resolved Hide resolved

sven1977 reviewed Oct 30, 2024

View reviewed changes

rllib/utils/replay_buffers/episode_replay_buffer.py Outdated Show resolved Hide resolved

sven1977 reviewed Oct 30, 2024

View reviewed changes

rllib/algorithms/dqn/torch/dqn_rainbow_torch_rl_module.py Outdated Show resolved Hide resolved

sven1977 reviewed Oct 30, 2024

View reviewed changes

rllib/connectors/common/add_states_from_episodes_to_batch.py Outdated Show resolved Hide resolved

simonsays1980 requested a review from a team as a code owner December 28, 2024 11:32

sven1977 reviewed Dec 28, 2024

View reviewed changes

python/ray/data/_internal/planner/plan_udf_map_op.py Outdated Show resolved Hide resolved

sven1977 reviewed Dec 28, 2024

View reviewed changes

rllib/core/columns.py Outdated Show resolved Hide resolved

sven1977 reviewed Dec 28, 2024

View reviewed changes

rllib/tuned_examples/dqn/stateless_cartpole_dqn.py Show resolved Hide resolved

sven1977 reviewed Dec 28, 2024

View reviewed changes

rllib/tuned_examples/ppo/stateless_cartpole_ppo.py Outdated Show resolved Hide resolved

sven1977 reviewed Dec 28, 2024

View reviewed changes

rllib/algorithms/dqn/torch/default_dqn_torch_rl_module.py Outdated Show resolved Hide resolved

sven1977 reviewed Dec 28, 2024

View reviewed changes

rllib/algorithms/dqn/torch/default_dqn_torch_rl_module.py Outdated Show resolved Hide resolved

sven1977 reviewed Dec 28, 2024

View reviewed changes

rllib/connectors/common/add_observations_from_episodes_to_batch.py Show resolved Hide resolved

sven1977 reviewed Dec 28, 2024

View reviewed changes

rllib/utils/replay_buffers/episode_replay_buffer.py Show resolved Hide resolved

sven1977 approved these changes Dec 28, 2024

View reviewed changes

sven1977 enabled auto-merge (squash) December 28, 2024 22:34

github-actions bot added the go add ONLY when ready to merge, run all tests label Dec 28, 2024

Update python/ray/data/_internal/planner/plan_udf_map_op.py

081b645

Co-authored-by: Sven Mika <sven@anyscale.io> Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

github-actions bot disabled auto-merge December 30, 2024 11:11

simonsays1980 added 10 commits December 30, 2024 23:17

Removed commented code in DefaultDQNTorchRLModule and changed 'new_st…

59b56c7

…ate_in' to 'next_state_in'. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Added assertion to avoid sequence sampling and multi-n-step.

10978bb

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Merge branch 'master' into offpolicy-enable-sequence-sampling-in-epis…

d55dd5e

…ode-replay-buffer

WIP

99aac92

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Merge branch 'master' into offpolicy-enable-sequence-sampling-in-epis…

42448d9

…ode-replay-buffer

Fixed a couple of small nits in the 'EpisodeReplayBuffer' and 'Offlin…

913f7e4

…ePreLearner' in regard to 'n_step' and step counting. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Merge branch 'master' into offpolicy-enable-sequence-sampling-in-epis…

43870ec

…ode-replay-buffer

Fixed a linting error.

9854a7d

Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

Merge branch 'master' into offpolicy-enable-sequence-sampling-in-epis…

4ee0cc3

…ode-replay-buffer

Redid small change in 'InfiniteLookbackBuffer' b/c it caused an error…

a9f53cf

… in the CI tests. Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

sven1977 merged commit a29d0c4 into ray-project:master Jan 6, 2025
5 checks passed

roshankathawate pushed a commit to roshankathawate/ray that referenced this pull request Jan 7, 2025

[RLlib; offline RL] Add sequence sampling to 'EpisodeReplayBuffer'. (r…

82248af

…ay-project#48116)

roshankathawate pushed a commit to roshankathawate/ray that referenced this pull request Jan 9, 2025

[RLlib; offline RL] Add sequence sampling to 'EpisodeReplayBuffer'. (r…

444bf29

…ay-project#48116) Signed-off-by: Roshan Kathawate <roshankathawate@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib; Off-policy] Add sequence sampling to 'EpisodeReplayBuffer'. #48116

[RLlib; Off-policy] Add sequence sampling to 'EpisodeReplayBuffer'. #48116

simonsays1980 commented Oct 21, 2024 •

edited

Loading

sven1977 Oct 21, 2024

simonsays1980 Oct 21, 2024

sven1977 Oct 21, 2024

sven1977 Oct 21, 2024

simonsays1980 Oct 21, 2024

sven1977 left a comment

sven1977 Oct 30, 2024

sven1977 Oct 30, 2024

sven1977 Oct 30, 2024

simonsays1980 Oct 30, 2024

sven1977 Oct 30, 2024

simonsays1980 Oct 30, 2024

sven1977 Dec 28, 2024

simonsays1980 Dec 30, 2024

sven1977 Dec 28, 2024

sven1977 Dec 28, 2024

simonsays1980 Dec 30, 2024

sven1977 left a comment


		# from ray.util.rpdb import _is_ray_debugger_post_mortem_enabled


	# from ray.util.rpdb import _is_ray_debugger_post_mortem_enabled
	from ray.util.rpdb import _is_ray_debugger_post_mortem_enabled

[RLlib; Off-policy] Add sequence sampling to 'EpisodeReplayBuffer'. #48116

[RLlib; Off-policy] Add sequence sampling to 'EpisodeReplayBuffer'. #48116

Conversation

simonsays1980 commented Oct 21, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sven1977 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sven1977 left a comment

Choose a reason for hiding this comment

simonsays1980 commented Oct 21, 2024 •

edited

Loading