-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Remove vtrace_drop_last_ts
option and add proper vf bootstrapping to IMPALA and APPO.
#36013
[RLlib] Remove vtrace_drop_last_ts
option and add proper vf bootstrapping to IMPALA and APPO.
#36013
Conversation
vtrace_drop_last_ts
option and add proper vf bootstrapping to IMPALA and APPO.vtrace_drop_last_ts
option and add proper vf bootstrapping to IMPALA and APPO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have questions about why we need to concatenate together values_time_major and bootstrap_values_time major, if we end up passing them separately to the vtrace function.
Also I noticed that you changed APPO learner, but didn't make any modifications to the IMPALA learner. Was this intended? It seems like it may have not been since you also modified the impala policies.
train_batch[SampleBatch.VALUES_BOOTSTRAPPED] | ||
) | ||
# Add values to bootstrap values to yield correct t=1 to T+1 trajectories, | ||
# with T being the rollout length (max trajectory len). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding this comment confusing. Can you annotate with what the shapes of values_time_major and bootstrap_values_time_major are supposed to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expanded the docstring of ray.rllib.evaluation.postprocessing.compute_bootstrap_value()
and added the computation example there (how to add the two columns together to get the final vtrace-usable value estimates).
Then I linked from each of the 4 loss functions (IMPALA+APPO vs torch+tf) to this new docstring.
rllib/evaluation/postprocessing.py
Outdated
|
||
@DeveloperAPI | ||
def compute_bootstrap_value(sample_batch, policy): | ||
"""TODO (sven): docstr""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm we need this dosctring before we merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should make it clear that we're modifying sample_batch in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, forgot, will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -312,7 +307,7 @@ def reduce_mean_valid(t): | |||
value_targets = make_time_major( | |||
train_batch[Postprocessing.VALUE_TARGETS] | |||
) | |||
delta = values_time_major - value_targets | |||
delta = values_time_major[:-1] - value_targets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rename values_time_major
to something like values_time_major_w_bootstrap_value
?
# Adding values and bootstrap_values yields the correct values+bootstrap | ||
# configuration, from which we can then take t=-1 (last timestep) to get | ||
# the bootstrap_value arg for the vtrace function below. | ||
values_time_major += bootstrap_values_time_major |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do this. below we end up passing in values_time_major[:-1] and values_time_major[-1] separately, which means that we never needed to expand values_time_major, and could have instead directly passed in the boostrap_value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think we still need this. For the following reason:
In APPO and IMPALA, we build/preprocess the train batch along the rollout_fragment_len (or max_seq_len
if LSTM) boundaries. This means that in case a rollout ends within an episode, this rollout's last trajectory will end up in the train batch with a zero-padded right side, thus, the bootstrapped value for this fragment is in the middle of the train batch, NOT at time-axis index -1!
So adding these two together here covers that particular case as well. It'll lead to most bootstrapped value to be at the end of the train batch rows, but some (in those cases where the rollout was done within an episode) will be located in the middle of the train batch's rows. Here: "row" means a trajectory (along the T-axis) within the (B, T, ...) train batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to understand how compute_bootstrap_value
handles this situation where the rollout ends within an episode. I'm reading the code for it, and on the surface I don't see anything that searches in the middle of an episode for the reward at the terminated timestep. It looks like we're only checking the last timestep in the sample batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that compute_bootstrap_value
is only ever called by a Policy's postprocess_trajectory method, which - by design - is only called when a rollout has ended (either within or at the terminal of an episode).
If at a terminal: Assume the value to be 0.0 (no value computation necessary)
If NOT at a terminal: Use the Policy's vf to compute the value at the last timestep of the trajectory. This is the "bootstrap" value to be used in the losses (instead of dropping the last ts and using that ts as a "bootstrapped" value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments in appo_tf_learner but here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments in appo_tf_policy here.
rllib/algorithms/impala/impala.py
Outdated
@@ -293,6 +283,9 @@ def training( | |||
Returns: | |||
This updated AlgorithmConfig object. | |||
""" | |||
if vtrace_drop_last_ts is not None: | |||
deprecation_warning(old="vtrace_drop_last_ts", error=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should error, otherwise people will pass this kwarg and think that its still having an effect. Even though there is a deprecation error it will be ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, forced explicitness is better here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
values=make_time_major(values, drop_last=drop_last), | ||
bootstrap_value=make_time_major(values)[-1], | ||
rewards=make_time_major(rewards), | ||
values=values_time_major[:-1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why essentially append the bootstrap value to values_time_major if we're just going to end up only passing in the original values_time_major? we don't use values_time_major anywhere else afaict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but see my explanation above for those cases where a rollout ends in the middle of an episode and the bootstrapped value will then "show up" in the middle of the train batch row, NOT at -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as in impala_tf_policy.py
@@ -129,7 +129,7 @@ def test_traj_view_lstm_prev_actions_and_rewards(self): | |||
view_req_policy = policy.view_requirements | |||
# 7=obs, prev-a + r, 2x state-in, 2x state-out. | |||
assert len(view_req_model) == 7, view_req_model | |||
assert len(view_req_policy) == 22, (len(view_req_policy), view_req_policy) | |||
assert len(view_req_policy) == 23, (len(view_req_policy), view_req_policy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this change by 1 because we added a new sample_batch key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, correct.
…lstm_w_v_trace_drop_last_false
…lstm_w_v_trace_drop_last_false
…lstm_w_v_trace_drop_last_false
…lstm_w_v_trace_drop_last_false
…lstm_w_v_trace_drop_last_false
…lstm_w_v_trace_drop_last_false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small questions, mainly for my understanding. otherwise lgtm
), | ||
actions=tf.unstack( | ||
make_time_major(loss_actions, drop_last=drop_last), axis=2 | ||
unpacked_old_policy_behaviour_logits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you not need to keep the tf unstack logic? I guess if vtrace tests are passing, then no...
@@ -222,6 +222,8 @@ def __init__(self, observation_space, action_space, config): | |||
max_seq_len=config["model"]["max_seq_len"], | |||
) | |||
|
|||
ValueNetworkMixin.__init__(self, config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did we add the value network mixin here? what were we doing before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For IMPALA, nothing. We did not need to compute any VF_PREDs, ever. Because we were doing this in the loss function, using a model.value()
call.
…lstm_w_v_trace_drop_last_false
…apping to IMPALA and APPO. (ray-project#36013)
…apping to IMPALA and APPO. (ray-project#36013) Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
IMPALA and APPO both use the vtrace value bootstrapping method for better correcting for the slight off-policy'ness of their async samplers.
All of RLlib's implementations (tf/torch; Learner API-stack/old API stack) perform a "trick" to get around the need to compute an extra bootstrapped value at the very end of any sampled trajectory. PPO - on the other hand - performs this step correctly in its Policies'
postprocess_trajectory()
methods.Instead, for APPO and IMPALA, RLlib cuts the last timesteps in each row of the BxT shaped train batch (on all data columns used in the loss, such as rewards, dones, etc..) and use that last timestep as the bootstrapped value. This might be problematic in cases, where an important reward is located exactly on the dropped timestep. Such a reward signal would simply be ignored by the algo.
An option to NOT cut this last timestep has recently been introduced (
config.vtrace_drop_last_ts=False
), however, the inaccuracy in not having a proper value-function computed V(t+1) at the end of a trajectory remains, b/c the core problem of this computation not happening in thepostprocess_trajectory
method has not been solved.This PR therefore changes:
vtrace_drop_last_ts
option is deprecated for all frameworks and API stacks.All the above steps fix the underlying mathematical inaccuracy and should yield better learning performance.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.