[RLlib] Remove `vtrace_drop_last_ts` option and add proper vf bootstrapping to IMPALA and APPO. #36013

sven1977 · 2023-06-02T05:26:25Z

IMPALA and APPO both use the vtrace value bootstrapping method for better correcting for the slight off-policy'ness of their async samplers.

All of RLlib's implementations (tf/torch; Learner API-stack/old API stack) perform a "trick" to get around the need to compute an extra bootstrapped value at the very end of any sampled trajectory. PPO - on the other hand - performs this step correctly in its Policies' postprocess_trajectory() methods.
Instead, for APPO and IMPALA, RLlib cuts the last timesteps in each row of the BxT shaped train batch (on all data columns used in the loss, such as rewards, dones, etc..) and use that last timestep as the bootstrapped value. This might be problematic in cases, where an important reward is located exactly on the dropped timestep. Such a reward signal would simply be ignored by the algo.

An option to NOT cut this last timestep has recently been introduced (config.vtrace_drop_last_ts=False), however, the inaccuracy in not having a proper value-function computed V(t+1) at the end of a trajectory remains, b/c the core problem of this computation not happening in the postprocess_trajectory method has not been solved.

This PR therefore changes:

The vtrace_drop_last_ts option is deprecated for all frameworks and API stacks.
Proper value function bootstrapping is performed correctly at the end of each trajectory, using the next_obs as obs input, the last internal state output as internal state input (in case of RNNs), and storing the results in a new SampleBatch column, called VALUES_BOOTSTRAPPED. Hereby, the bootstrapped values are only stored on the last t of the batch (all other t's are set to 0.0), indicating that these values are to be used at t+1 (bootstrapping).
In the APPO/IMPALA loss functions, we now properly utilize these bootstrapped values at the ends of the sampled (and possibly zero-padded) value tensors.

All the above steps fix the underlying mathematical inaccuracy and should yield better learning performance.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>

avnishn

I have questions about why we need to concatenate together values_time_major and bootstrap_values_time major, if we end up passing them separately to the vtrace function.

Also I noticed that you changed APPO learner, but didn't make any modifications to the IMPALA learner. Was this intended? It seems like it may have not been since you also modified the impala policies.

avnishn · 2023-06-02T16:27:19Z

rllib/algorithms/appo/appo_tf_policy.py

+                train_batch[SampleBatch.VALUES_BOOTSTRAPPED]
+            )
+            # Add values to bootstrap values to yield correct t=1 to T+1 trajectories,
+            # with T being the rollout length (max trajectory len).


I'm finding this comment confusing. Can you annotate with what the shapes of values_time_major and bootstrap_values_time_major are supposed to be.

I expanded the docstring of ray.rllib.evaluation.postprocessing.compute_bootstrap_value() and added the computation example there (how to add the two columns together to get the final vtrace-usable value estimates).
Then I linked from each of the 4 loss functions (IMPALA+APPO vs torch+tf) to this new docstring.

avnishn · 2023-06-05T15:17:08Z

rllib/evaluation/postprocessing.py

+
+@DeveloperAPI
+def compute_bootstrap_value(sample_batch, policy):
+    """TODO (sven): docstr"""


hmm we need this dosctring before we merge.

we should make it clear that we're modifying sample_batch in place.

sorry, forgot, will fix.

avnishn · 2023-06-05T15:34:32Z

rllib/algorithms/appo/appo_tf_policy.py

@@ -312,7 +307,7 @@ def reduce_mean_valid(t):
                value_targets = make_time_major(
                    train_batch[Postprocessing.VALUE_TARGETS]
                )
-                delta = values_time_major - value_targets
+                delta = values_time_major[:-1] - value_targets


can we rename values_time_major to something like values_time_major_w_bootstrap_value?

avnishn · 2023-06-05T15:47:53Z

rllib/algorithms/appo/tf/appo_tf_learner.py

+        # Adding values and bootstrap_values yields the correct values+bootstrap
+        # configuration, from which we can then take t=-1 (last timestep) to get
+        # the bootstrap_value arg for the vtrace function below.
+        values_time_major += bootstrap_values_time_major


I don't think we need to do this. below we end up passing in values_time_major[:-1] and values_time_major[-1] separately, which means that we never needed to expand values_time_major, and could have instead directly passed in the boostrap_value

I do think we still need this. For the following reason:
In APPO and IMPALA, we build/preprocess the train batch along the rollout_fragment_len (or max_seq_len if LSTM) boundaries. This means that in case a rollout ends within an episode, this rollout's last trajectory will end up in the train batch with a zero-padded right side, thus, the bootstrapped value for this fragment is in the middle of the train batch, NOT at time-axis index -1!
So adding these two together here covers that particular case as well. It'll lead to most bootstrapped value to be at the end of the train batch rows, but some (in those cases where the rollout was done within an episode) will be located in the middle of the train batch's rows. Here: "row" means a trajectory (along the T-axis) within the (B, T, ...) train batch.

I'm struggling to understand how compute_bootstrap_value handles this situation where the rollout ends within an episode. I'm reading the code for it, and on the surface I don't see anything that searches in the middle of an episode for the reward at the terminated timestep. It looks like we're only checking the last timestep in the sample batch.

Note that compute_bootstrap_value is only ever called by a Policy's postprocess_trajectory method, which - by design - is only called when a rollout has ended (either within or at the terminal of an episode).
If at a terminal: Assume the value to be 0.0 (no value computation necessary)
If NOT at a terminal: Use the Policy's vf to compute the value at the last timestep of the trajectory. This is the "bootstrap" value to be used in the losses (instead of dropping the last ts and using that ts as a "bootstrapped" value).

avnishn · 2023-06-05T15:49:53Z

rllib/algorithms/appo/torch/appo_torch_learner.py

same comments in appo_tf_learner but here.

avnishn · 2023-06-05T15:50:09Z

rllib/algorithms/appo/appo_torch_policy.py

same comments in appo_tf_policy here.

avnishn · 2023-06-05T15:51:55Z

rllib/algorithms/impala/impala.py

@@ -293,6 +283,9 @@ def training(
        Returns:
            This updated AlgorithmConfig object.
        """
+        if vtrace_drop_last_ts is not None:
+            deprecation_warning(old="vtrace_drop_last_ts", error=False)


I think we should error, otherwise people will pass this kwarg and think that its still having an effect. Even though there is a deprecation error it will be ambiguous.

Agreed, forced explicitness is better here!

avnishn · 2023-06-05T15:58:17Z

rllib/algorithms/impala/impala_tf_policy.py

-                values=make_time_major(values, drop_last=drop_last),
-                bootstrap_value=make_time_major(values)[-1],
+                rewards=make_time_major(rewards),
+                values=values_time_major[:-1],


why essentially append the bootstrap value to values_time_major if we're just going to end up only passing in the original values_time_major? we don't use values_time_major anywhere else afaict.

True, but see my explanation above for those cases where a rollout ends in the middle of an episode and the bootstrapped value will then "show up" in the middle of the train batch row, NOT at -1.

avnishn · 2023-06-05T15:58:53Z

rllib/algorithms/impala/impala_torch_policy.py

same comments as in impala_tf_policy.py

avnishn · 2023-06-05T16:02:09Z

rllib/evaluation/tests/test_trajectory_view_api.py

@@ -129,7 +129,7 @@ def test_traj_view_lstm_prev_actions_and_rewards(self):
            view_req_policy = policy.view_requirements
            # 7=obs, prev-a + r, 2x state-in, 2x state-out.
            assert len(view_req_model) == 7, view_req_model
-            assert len(view_req_policy) == 22, (len(view_req_policy), view_req_policy)
+            assert len(view_req_policy) == 23, (len(view_req_policy), view_req_policy)


Did this change by 1 because we added a new sample_batch key?

Yup, correct.

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…lstm_w_v_trace_drop_last_false

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…lstm_w_v_trace_drop_last_false

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…lstm_w_v_trace_drop_last_false

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…lstm_w_v_trace_drop_last_false

avnishn

small questions, mainly for my understanding. otherwise lgtm

avnishn · 2023-06-21T14:36:34Z

rllib/algorithms/appo/appo_tf_policy.py

-                        ),
-                        actions=tf.unstack(
-                            make_time_major(loss_actions, drop_last=drop_last), axis=2
+                            unpacked_old_policy_behaviour_logits


do you not need to keep the tf unstack logic? I guess if vtrace tests are passing, then no...

avnishn · 2023-06-21T14:43:29Z

rllib/algorithms/impala/impala_torch_policy.py

@@ -222,6 +222,8 @@ def __init__(self, observation_space, action_space, config):
            max_seq_len=config["model"]["max_seq_len"],
        )

+        ValueNetworkMixin.__init__(self, config)


why did we add the value network mixin here? what were we doing before?

For IMPALA, nothing. We did not need to compute any VF_PREDs, ever. Because we were doing this in the loss function, using a model.value() call.

rllib/evaluation/postprocessing.py

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…lstm_w_v_trace_drop_last_false

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…apping to IMPALA and APPO. (ray-project#36013)

…apping to IMPALA and APPO. (ray-project#36013) Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

wip

d53832a

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 requested review from gjoliver, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners June 2, 2023 05:26

sven1977 added 3 commits June 2, 2023 10:03

wip

faab512

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

935b7f7

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

4a44f70

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 assigned avnishn Jun 2, 2023

sven1977 changed the title ~~[WIP; RLlib] Remove vtrace_drop_last_ts option and add proper vf bootstrapping to IMPALA and APPO.~~ [RLlib] Remove vtrace_drop_last_ts option and add proper vf bootstrapping to IMPALA and APPO. Jun 2, 2023

sven1977 added 6 commits June 2, 2023 15:27

wip

7b3d7b5

Signed-off-by: sven1977 <svenmika1977@gmail.com>

fix test

c2451bb

Signed-off-by: sven1977 <svenmika1977@gmail.com>

fix test

5f0b92c

Signed-off-by: sven1977 <svenmika1977@gmail.com>

fix test

605cf6e

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

8a0e5c9

Signed-off-by: sven1977 <svenmika1977@gmail.com>

fix

95bd73f

Signed-off-by: sven1977 <svenmika1977@gmail.com>

avnishn reviewed Jun 5, 2023

View reviewed changes

sven1977 added 10 commits June 9, 2023 12:54

wip

957e185

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

20de58c

…lstm_w_v_trace_drop_last_false

wip

a2730a4

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

1752f21

…lstm_w_v_trace_drop_last_false

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

72f4375

…lstm_w_v_trace_drop_last_false

wip

4305ffa

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

644098e

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

c8cc8ae

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

65bb4be

…lstm_w_v_trace_drop_last_false

wip

56a3dea

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 added 5 commits June 21, 2023 14:46

wip

bf0bb36

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

9c8c7e2

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

d2d0eb3

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

3718254

…lstm_w_v_trace_drop_last_false

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

5f43198

…lstm_w_v_trace_drop_last_false

avnishn approved these changes Jun 21, 2023

View reviewed changes

sven1977 added 4 commits June 21, 2023 16:58

wip

0ab5703

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

17be7cc

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

8e46813

…lstm_w_v_trace_drop_last_false

wip

f55532c

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 merged commit e14c9b1 into ray-project:master Jun 22, 2023

kouroshHakha pushed a commit to kouroshHakha/ray that referenced this pull request Jun 30, 2023

[RLlib] Remove vtrace_drop_last_ts option and add proper vf bootstr…

0ef5824

…apping to IMPALA and APPO. (ray-project#36013)

kouroshHakha mentioned this pull request Jun 30, 2023

[RLlib] Remove vtrace_drop_last_ts option and add proper vf bootstr… #36992

Closed

8 tasks

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023

[RLlib] Remove vtrace_drop_last_ts option and add proper vf bootstr…

bf1ba05

…apping to IMPALA and APPO. (ray-project#36013) Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Remove `vtrace_drop_last_ts` option and add proper vf bootstrapping to IMPALA and APPO. #36013

[RLlib] Remove `vtrace_drop_last_ts` option and add proper vf bootstrapping to IMPALA and APPO. #36013

sven1977 commented Jun 2, 2023 •

edited

Loading

avnishn left a comment

avnishn Jun 2, 2023

sven1977 Jun 9, 2023

avnishn Jun 5, 2023

avnishn Jun 5, 2023

sven1977 Jun 9, 2023

sven1977 Jun 9, 2023

avnishn Jun 5, 2023

avnishn Jun 5, 2023

sven1977 Jun 9, 2023

avnishn Jun 9, 2023

sven1977 Jun 20, 2023

avnishn Jun 5, 2023

avnishn Jun 5, 2023

avnishn Jun 5, 2023

sven1977 Jun 9, 2023

sven1977 Jun 9, 2023

avnishn Jun 5, 2023

sven1977 Jun 9, 2023

avnishn Jun 5, 2023

avnishn Jun 5, 2023

sven1977 Jun 5, 2023

avnishn left a comment

avnishn Jun 21, 2023

avnishn Jun 21, 2023

sven1977 Jun 21, 2023

[RLlib] Remove vtrace_drop_last_ts option and add proper vf bootstrapping to IMPALA and APPO. #36013

[RLlib] Remove vtrace_drop_last_ts option and add proper vf bootstrapping to IMPALA and APPO. #36013

Conversation

sven1977 commented Jun 2, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

avnishn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avnishn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[RLlib] Remove `vtrace_drop_last_ts` option and add proper vf bootstrapping to IMPALA and APPO. #36013

[RLlib] Remove `vtrace_drop_last_ts` option and add proper vf bootstrapping to IMPALA and APPO. #36013

sven1977 commented Jun 2, 2023 •

edited

Loading