Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Monitor not compatible with gym.wrappers.TimeLimit #477

Closed
3 tasks done
juhannc opened this issue Jun 15, 2021 · 4 comments
Closed
3 tasks done

[Bug] Monitor not compatible with gym.wrappers.TimeLimit #477

juhannc opened this issue Jun 15, 2021 · 4 comments
Labels
bug Something isn't working openai gym related to OpenAI Gym interface

Comments

@juhannc
Copy link

juhannc commented Jun 15, 2021

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email.
Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

If your issue is related to a custom gym environment, please use the custom gym env template.

🐛 Bug

When using the Monitor class from stable_baselines3.common.monitor and wrapping the environment again into gym.wrappers.TimeLimit the done is respected.

What I mean by that is, that when I create an environment and wrap it into a Monitor and afterwards into a gym.wrappers.TimeLimit to limit the time steps, the evaluate_policy never returns.
Instead, it runs the environment for the limited number of steps, then resets it, and finally starts all over again.

As far as I can tell, it happens as follows:

Once the maximum number of steps for TimeLimit are surpassed, it writes not done into the info dict: info['TimeLimit.truncated'] = not done. Note, done should always be False in L19, otherwise the environment would have exited before, making the value in the dict always True. However, it doesn't really matter for us whats inside the dict.
Afterwards it sets done to True. Then evaluate_policy checks if the environment is done, which it is. Next, it checks if the environment is wrapped, again, that's true for us.
Now the problem is, due to a problem with Atari, the key episode has to be present. However, it is not, but instead the key TimeLimit.truncated is. As the key is not present, evaluate_policy skips this done signal. Thus, we finish the loop and due to the TimeLimit we reset the environment and start over.

See:

https://github.com/openai/gym/blob/0.18.3/gym/wrappers/time_limit.py#L18-L20

and

if dones[i]:
if is_monitor_wrapped:
# Atari wrapper can send a "done" signal when
# the agent loses a life, but it does not correspond
# to the true end of episode
if "episode" in info.keys():

To Reproduce

To reproduce the issue, run the code as follows.
In this case it will not exit but instead stay in a loop, see above.
To successfully run the code, remove the env = Monitor(env)

#!/usr/bin/env python3

import gym

from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

env = gym.make('MountainCar-v0')
env = Monitor(env)
env = gym.wrappers.TimeLimit(env, max_episode_steps=10)

model = PPO(policy='MlpPolicy', env=env, verbose=1)

print("Start evaluating...")
mean_reward, std_reward = evaluate_policy(model=model,
                                          env=env,
                                          n_eval_episodes=1)
print("Evaluation ended")

model.learn(total_timesteps=100)

Expected behavior

Wrapping the monitored environment into a TimeLimit should exit after the maximum number of steps defined in said TimeLimit.

One solution would be for the dictionary check also allow TimeLimit.truncated as a valid key.

### System Info

Describe the characteristic of your environment:

  • Describe how the library was installed (pip, docker, source, ...): pip3 install stable-baselines3[extra]==1.1.0a11
  • GPU models and configuration: NVIDIA RTX 8000
  • Python version: 3.8.5
  • PyTorch version: 1.8.1+cu102
  • Gym version: 0.18.0
  • Versions of any other relevant libraries: N/A

Additional context

Add any other context about the problem here.

Checklist

  • I have checked that there is no similar issue in the repo (required)
  • I have read the documentation (required)
  • I have provided a minimal working example to reproduce the bug (required)
@juhannc juhannc added the bug Something isn't working label Jun 15, 2021
@araffin
Copy link
Member

araffin commented Jun 15, 2021

Hello,

When using the Monitor class from stable_baselines3.common.monitor and wrapping the environment again into gym.wrappers.TimeLimit

why would you do that and not the other way around? (time limit first and monitor afterward)

@araffin araffin added the openai gym related to OpenAI Gym interface label Jun 15, 2021
@juhannc
Copy link
Author

juhannc commented Jun 15, 2021

Hello,

When using the Monitor class from stable_baselines3.common.monitor and wrapping the environment again into gym.wrappers.TimeLimit

why would you do that and not the other way around? (time limit first and monitor afterward)

I did it this way around because make_vec_env did it as well.
I realized this bug first when working with vectorized environments but tried to simplify it as much as possible.

Turns out, wrapping it first into TimeLimit and then into Monitor works.
Maybe this strategy should be adapted for make_vec_env then as well?

@araffin
Copy link
Member

araffin commented Jun 16, 2021

Turns out, wrapping it first into TimeLimit and then into Monitor works.

yes, in fact, the timelimit is normally specified with the env definition, see how it is done in open ai gym with the max_episode_steps parameter: https://github.com/openai/gym/blob/master/gym/envs/__init__.py#L56

Maybe this strategy should be adapted for make_vec_env then as well?

You can pass a callable to make_vec_env (cf doc) so you should be already possible to wrap first your env if needed:

env = env_id(**env_kwargs)

@juhannc
Copy link
Author

juhannc commented Jun 17, 2021

yes, in fact, the timelimit is normally specified with the env definition, see how it is done in open ai gym with the max_episode_steps parameter: https://github.com/openai/gym/blob/master/gym/envs/__init__.py#L56

Thank you, but that way I cannot easily change the number of steps per epsiode.
That's why I wanted to use the TimeLimit wrapper manually instead.

You can pass a callable to make_vec_env (cf doc) so you should be already possible to wrap first your env if needed:

env = env_id(**env_kwargs)

I see, I think I'll do that for now.
Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working openai gym related to OpenAI Gym interface
Projects
None yet
Development

No branches or pull requests

2 participants