[Bug] Monitor not compatible with gym.wrappers.TimeLimit #477

juhannc · 2021-06-15T14:20:20Z

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email.
Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

If your issue is related to a custom gym environment, please use the custom gym env template.

🐛 Bug

When using the Monitor class from stable_baselines3.common.monitor and wrapping the environment again into gym.wrappers.TimeLimit the done is respected.

What I mean by that is, that when I create an environment and wrap it into a Monitor and afterwards into a gym.wrappers.TimeLimit to limit the time steps, the evaluate_policy never returns.
Instead, it runs the environment for the limited number of steps, then resets it, and finally starts all over again.

As far as I can tell, it happens as follows:

Once the maximum number of steps for TimeLimit are surpassed, it writes not done into the info dict: info['TimeLimit.truncated'] = not done. Note, done should always be False in L19, otherwise the environment would have exited before, making the value in the dict always True. However, it doesn't really matter for us whats inside the dict.
Afterwards it sets done to True. Then evaluate_policy checks if the environment is done, which it is. Next, it checks if the environment is wrapped, again, that's true for us.
Now the problem is, due to a problem with Atari, the key episode has to be present. However, it is not, but instead the key TimeLimit.truncated is. As the key is not present, evaluate_policy skips this done signal. Thus, we finish the loop and due to the TimeLimit we reset the environment and start over.

See:

https://github.com/openai/gym/blob/0.18.3/gym/wrappers/time_limit.py#L18-L20

and

stable-baselines3/stable_baselines3/common/evaluation.py

Lines 100 to 105 in b52c6fc

    
           if dones[i]: 
        
               if is_monitor_wrapped: 
        
                   # Atari wrapper can send a "done" signal when 
        
                   # the agent loses a life, but it does not correspond 
        
                   # to the true end of episode 
        
                   if "episode" in info.keys():

To Reproduce

To reproduce the issue, run the code as follows.
In this case it will not exit but instead stay in a loop, see above.
To successfully run the code, remove the env = Monitor(env)

#!/usr/bin/env python3

import gym

from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

env = gym.make('MountainCar-v0')
env = Monitor(env)
env = gym.wrappers.TimeLimit(env, max_episode_steps=10)

model = PPO(policy='MlpPolicy', env=env, verbose=1)

print("Start evaluating...")
mean_reward, std_reward = evaluate_policy(model=model,
                                          env=env,
                                          n_eval_episodes=1)
print("Evaluation ended")

model.learn(total_timesteps=100)

Expected behavior

Wrapping the monitored environment into a TimeLimit should exit after the maximum number of steps defined in said TimeLimit.

One solution would be for the dictionary check also allow TimeLimit.truncated as a valid key.

### System Info

Describe the characteristic of your environment:

Describe how the library was installed (pip, docker, source, ...): pip3 install stable-baselines3[extra]==1.1.0a11
GPU models and configuration: NVIDIA RTX 8000
Python version: 3.8.5
PyTorch version: 1.8.1+cu102
Gym version: 0.18.0
Versions of any other relevant libraries: N/A

Additional context

Add any other context about the problem here.

Checklist

I have checked that there is no similar issue in the repo (required)
I have read the documentation (required)
I have provided a minimal working example to reproduce the bug (required)

The text was updated successfully, but these errors were encountered:

araffin · 2021-06-15T14:31:05Z

Hello,

When using the Monitor class from stable_baselines3.common.monitor and wrapping the environment again into gym.wrappers.TimeLimit

why would you do that and not the other way around? (time limit first and monitor afterward)

juhannc · 2021-06-15T14:38:51Z

Hello,

When using the Monitor class from stable_baselines3.common.monitor and wrapping the environment again into gym.wrappers.TimeLimit

why would you do that and not the other way around? (time limit first and monitor afterward)

I did it this way around because make_vec_env did it as well.
I realized this bug first when working with vectorized environments but tried to simplify it as much as possible.

Turns out, wrapping it first into TimeLimit and then into Monitor works.
Maybe this strategy should be adapted for make_vec_env then as well?

araffin · 2021-06-16T12:50:45Z

Turns out, wrapping it first into TimeLimit and then into Monitor works.

yes, in fact, the timelimit is normally specified with the env definition, see how it is done in open ai gym with the max_episode_steps parameter: https://github.com/openai/gym/blob/master/gym/envs/__init__.py#L56

Maybe this strategy should be adapted for make_vec_env then as well?

You can pass a callable to make_vec_env (cf doc) so you should be already possible to wrap first your env if needed:

stable-baselines3/stable_baselines3/common/env_util.py

Line 82 in 18f4e3a

env = env_id(**env_kwargs)

juhannc · 2021-06-17T06:06:09Z

yes, in fact, the timelimit is normally specified with the env definition, see how it is done in open ai gym with the max_episode_steps parameter: https://github.com/openai/gym/blob/master/gym/envs/__init__.py#L56

Thank you, but that way I cannot easily change the number of steps per epsiode.
That's why I wanted to use the TimeLimit wrapper manually instead.

You can pass a callable to make_vec_env (cf doc) so you should be already possible to wrap first your env if needed:

stable-baselines3/stable_baselines3/common/env_util.py

Line 82 in 18f4e3a

env = env_id(**env_kwargs)

I see, I think I'll do that for now.
Thanks again!

juhannc added the bug Something isn't working label Jun 15, 2021

araffin added the openai gym related to OpenAI Gym interface label Jun 15, 2021

juhannc closed this as completed Jun 17, 2021

qgallouedec mentioned this issue Mar 3, 2022

[Bug] Monitor does not take into account modification introduced by a wrapper. #798

Closed

3 tasks

AlexPasqua mentioned this issue Sep 27, 2022

[Question] Evaluation helper on Monitor wrapped environments #894

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Monitor not compatible with gym.wrappers.TimeLimit #477

[Bug] Monitor not compatible with gym.wrappers.TimeLimit #477

juhannc commented Jun 15, 2021

araffin commented Jun 15, 2021

juhannc commented Jun 15, 2021

araffin commented Jun 16, 2021

juhannc commented Jun 17, 2021

[Bug] Monitor not compatible with gym.wrappers.TimeLimit #477

[Bug] Monitor not compatible with gym.wrappers.TimeLimit #477

Comments

juhannc commented Jun 15, 2021

🐛 Bug

To Reproduce

Expected behavior

Additional context

Checklist

araffin commented Jun 15, 2021

juhannc commented Jun 15, 2021

araffin commented Jun 16, 2021

juhannc commented Jun 17, 2021