Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Remove vtrace_drop_last_ts option and add proper vf bootstrapping to IMPALA and APPO. #36013

Merged

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Jun 2, 2023

IMPALA and APPO both use the vtrace value bootstrapping method for better correcting for the slight off-policy'ness of their async samplers.

All of RLlib's implementations (tf/torch; Learner API-stack/old API stack) perform a "trick" to get around the need to compute an extra bootstrapped value at the very end of any sampled trajectory. PPO - on the other hand - performs this step correctly in its Policies' postprocess_trajectory() methods.
Instead, for APPO and IMPALA, RLlib cuts the last timesteps in each row of the BxT shaped train batch (on all data columns used in the loss, such as rewards, dones, etc..) and use that last timestep as the bootstrapped value. This might be problematic in cases, where an important reward is located exactly on the dropped timestep. Such a reward signal would simply be ignored by the algo.

An option to NOT cut this last timestep has recently been introduced (config.vtrace_drop_last_ts=False), however, the inaccuracy in not having a proper value-function computed V(t+1) at the end of a trajectory remains, b/c the core problem of this computation not happening in the postprocess_trajectory method has not been solved.

This PR therefore changes:

  • The vtrace_drop_last_ts option is deprecated for all frameworks and API stacks.
  • Proper value function bootstrapping is performed correctly at the end of each trajectory, using the next_obs as obs input, the last internal state output as internal state input (in case of RNNs), and storing the results in a new SampleBatch column, called VALUES_BOOTSTRAPPED. Hereby, the bootstrapped values are only stored on the last t of the batch (all other t's are set to 0.0), indicating that these values are to be used at t+1 (bootstrapping).
  • In the APPO/IMPALA loss functions, we now properly utilize these bootstrapped values at the ends of the sampled (and possibly zero-padded) value tensors.

All the above steps fix the underlying mathematical inaccuracy and should yield better learning performance.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>
sven1977 added 3 commits June 2, 2023 10:03
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@sven1977 sven1977 changed the title [WIP; RLlib] Remove vtrace_drop_last_ts option and add proper vf bootstrapping to IMPALA and APPO. [RLlib] Remove vtrace_drop_last_ts option and add proper vf bootstrapping to IMPALA and APPO. Jun 2, 2023
sven1977 added 6 commits June 2, 2023 15:27
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Copy link
Member

@avnishn avnishn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have questions about why we need to concatenate together values_time_major and bootstrap_values_time major, if we end up passing them separately to the vtrace function.

Also I noticed that you changed APPO learner, but didn't make any modifications to the IMPALA learner. Was this intended? It seems like it may have not been since you also modified the impala policies.

train_batch[SampleBatch.VALUES_BOOTSTRAPPED]
)
# Add values to bootstrap values to yield correct t=1 to T+1 trajectories,
# with T being the rollout length (max trajectory len).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding this comment confusing. Can you annotate with what the shapes of values_time_major and bootstrap_values_time_major are supposed to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded the docstring of ray.rllib.evaluation.postprocessing.compute_bootstrap_value() and added the computation example there (how to add the two columns together to get the final vtrace-usable value estimates).
Then I linked from each of the 4 loss functions (IMPALA+APPO vs torch+tf) to this new docstring.


@DeveloperAPI
def compute_bootstrap_value(sample_batch, policy):
"""TODO (sven): docstr"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm we need this dosctring before we merge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make it clear that we're modifying sample_batch in place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, forgot, will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -312,7 +307,7 @@ def reduce_mean_valid(t):
value_targets = make_time_major(
train_batch[Postprocessing.VALUE_TARGETS]
)
delta = values_time_major - value_targets
delta = values_time_major[:-1] - value_targets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename values_time_major to something like values_time_major_w_bootstrap_value?

# Adding values and bootstrap_values yields the correct values+bootstrap
# configuration, from which we can then take t=-1 (last timestep) to get
# the bootstrap_value arg for the vtrace function below.
values_time_major += bootstrap_values_time_major
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do this. below we end up passing in values_time_major[:-1] and values_time_major[-1] separately, which means that we never needed to expand values_time_major, and could have instead directly passed in the boostrap_value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we still need this. For the following reason:
In APPO and IMPALA, we build/preprocess the train batch along the rollout_fragment_len (or max_seq_len if LSTM) boundaries. This means that in case a rollout ends within an episode, this rollout's last trajectory will end up in the train batch with a zero-padded right side, thus, the bootstrapped value for this fragment is in the middle of the train batch, NOT at time-axis index -1!
So adding these two together here covers that particular case as well. It'll lead to most bootstrapped value to be at the end of the train batch rows, but some (in those cases where the rollout was done within an episode) will be located in the middle of the train batch's rows. Here: "row" means a trajectory (along the T-axis) within the (B, T, ...) train batch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to understand how compute_bootstrap_value handles this situation where the rollout ends within an episode. I'm reading the code for it, and on the surface I don't see anything that searches in the middle of an episode for the reward at the terminated timestep. It looks like we're only checking the last timestep in the sample batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that compute_bootstrap_value is only ever called by a Policy's postprocess_trajectory method, which - by design - is only called when a rollout has ended (either within or at the terminal of an episode).
If at a terminal: Assume the value to be 0.0 (no value computation necessary)
If NOT at a terminal: Use the Policy's vf to compute the value at the last timestep of the trajectory. This is the "bootstrap" value to be used in the losses (instead of dropping the last ts and using that ts as a "bootstrapped" value).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments in appo_tf_learner but here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments in appo_tf_policy here.

@@ -293,6 +283,9 @@ def training(
Returns:
This updated AlgorithmConfig object.
"""
if vtrace_drop_last_ts is not None:
deprecation_warning(old="vtrace_drop_last_ts", error=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should error, otherwise people will pass this kwarg and think that its still having an effect. Even though there is a deprecation error it will be ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, forced explicitness is better here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

values=make_time_major(values, drop_last=drop_last),
bootstrap_value=make_time_major(values)[-1],
rewards=make_time_major(rewards),
values=values_time_major[:-1],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why essentially append the bootstrap value to values_time_major if we're just going to end up only passing in the original values_time_major? we don't use values_time_major anywhere else afaict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but see my explanation above for those cases where a rollout ends in the middle of an episode and the bootstrapped value will then "show up" in the middle of the train batch row, NOT at -1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as in impala_tf_policy.py

@@ -129,7 +129,7 @@ def test_traj_view_lstm_prev_actions_and_rewards(self):
view_req_policy = policy.view_requirements
# 7=obs, prev-a + r, 2x state-in, 2x state-out.
assert len(view_req_model) == 7, view_req_model
assert len(view_req_policy) == 22, (len(view_req_policy), view_req_policy)
assert len(view_req_policy) == 23, (len(view_req_policy), view_req_policy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this change by 1 because we added a new sample_batch key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, correct.

sven1977 added 10 commits June 9, 2023 12:54
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
sven1977 added 5 commits June 21, 2023 14:46
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Copy link
Member

@avnishn avnishn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small questions, mainly for my understanding. otherwise lgtm

),
actions=tf.unstack(
make_time_major(loss_actions, drop_last=drop_last), axis=2
unpacked_old_policy_behaviour_logits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you not need to keep the tf unstack logic? I guess if vtrace tests are passing, then no...

@@ -222,6 +222,8 @@ def __init__(self, observation_space, action_space, config):
max_seq_len=config["model"]["max_seq_len"],
)

ValueNetworkMixin.__init__(self, config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we add the value network mixin here? what were we doing before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For IMPALA, nothing. We did not need to compute any VF_PREDs, ever. Because we were doing this in the loss function, using a model.value() call.

rllib/evaluation/postprocessing.py Show resolved Hide resolved
sven1977 added 4 commits June 21, 2023 16:58
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@sven1977 sven1977 merged commit e14c9b1 into ray-project:master Jun 22, 2023
kouroshHakha pushed a commit to kouroshHakha/ray that referenced this pull request Jun 30, 2023
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…apping to IMPALA and APPO. (ray-project#36013)

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants