-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Monitor not compatible with gym.wrappers.TimeLimit #477
Comments
Hello,
why would you do that and not the other way around? (time limit first and monitor afterward) |
I did it this way around because Turns out, wrapping it first into |
yes, in fact, the timelimit is normally specified with the env definition, see how it is done in open ai gym with the
You can pass a callable to
|
Thank you, but that way I cannot easily change the number of steps per epsiode.
I see, I think I'll do that for now. |
Important Note: We do not do technical support, nor consulting and don't answer personal questions per email.
Please post your question on the RL Discord, Reddit or Stack Overflow in that case.
If your issue is related to a custom gym environment, please use the custom gym env template.
🐛 Bug
When using the
Monitor
class fromstable_baselines3.common.monitor
and wrapping the environment again intogym.wrappers.TimeLimit
thedone
is respected.What I mean by that is, that when I create an environment and wrap it into a
Monitor
and afterwards into agym.wrappers.TimeLimit
to limit the time steps, theevaluate_policy
never returns.Instead, it runs the environment for the limited number of steps, then resets it, and finally starts all over again.
As far as I can tell, it happens as follows:
Once the maximum number of steps for
TimeLimit
are surpassed, it writesnot done
into theinfo
dict:info['TimeLimit.truncated'] = not done
. Note,done
should always beFalse
in L19, otherwise the environment would have exited before, making the value in the dict alwaysTrue
. However, it doesn't really matter for us whats inside the dict.Afterwards it sets
done
toTrue
. Thenevaluate_policy
checks if the environment isdone
, which it is. Next, it checks if the environment is wrapped, again, that's true for us.Now the problem is, due to a problem with Atari, the key
episode
has to be present. However, it is not, but instead the keyTimeLimit.truncated
is. As the key is not present,evaluate_policy
skips thisdone
signal. Thus, we finish the loop and due to theTimeLimit
we reset the environment and start over.See:
https://github.com/openai/gym/blob/0.18.3/gym/wrappers/time_limit.py#L18-L20
and
stable-baselines3/stable_baselines3/common/evaluation.py
Lines 100 to 105 in b52c6fc
To Reproduce
To reproduce the issue, run the code as follows.
In this case it will not exit but instead stay in a loop, see above.
To successfully run the code, remove the
env = Monitor(env)
Expected behavior
Wrapping the monitored environment into a
TimeLimit
should exit after the maximum number of steps defined in saidTimeLimit
.One solution would be for the dictionary check also allow
TimeLimit.truncated
as a valid key.### System Info
Describe the characteristic of your environment:
pip3 install stable-baselines3[extra]==1.1.0a11
3.8.5
1.8.1+cu102
0.18.0
Additional context
Add any other context about the problem here.
Checklist
The text was updated successfully, but these errors were encountered: