-
Notifications
You must be signed in to change notification settings - Fork 8.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Normalize env #2387
Add Normalize env #2387
Conversation
I realized one thing after looking into this. The SB3's from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize, VecEnvWrapper
import gym
import numpy as np
class DummyRewardEnv(gym.Env):
metadata = {}
def __init__(self, return_reward_idx=0):
self.action_space = gym.spaces.Discrete(2)
self.observation_space = gym.spaces.Box(
low=np.array([-1.0]), high=np.array([1.0])
)
self.returned_rewards = [0, 1, 2, 3, 4]
self.return_reward_idx = return_reward_idx
self.t = self.return_reward_idx
def step(self, action):
self.t += 1
return np.array([self.t]), self.t, self.t == len(self.returned_rewards), {}
def reset(self):
self.t = self.return_reward_idx
return np.array([self.t])
def make_env(return_reward_idx):
def thunk():
env = DummyRewardEnv(return_reward_idx)
return env
return thunk
class OriginalBaselinesRunningMeanStd(object):
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
def __init__(self, epsilon=1e-4, shape=()):
self.mean = np.zeros(shape, 'float64')
self.var = np.ones(shape, 'float64')
self.count = epsilon
def update(self, x):
batch_mean = np.mean(x, axis=0)
batch_var = np.var(x, axis=0)
batch_count = x.shape[0]
self.update_from_moments(batch_mean, batch_var, batch_count)
def update_from_moments(self, batch_mean, batch_var, batch_count):
self.mean, self.var, self.count = update_mean_var_count_from_moments(
self.mean, self.var, self.count, batch_mean, batch_var, batch_count)
def update_mean_var_count_from_moments(mean, var, count, batch_mean, batch_var, batch_count):
delta = batch_mean - mean
tot_count = count + batch_count
new_mean = mean + delta * batch_count / tot_count
m_a = var * count
m_b = batch_var * batch_count
M2 = m_a + m_b + np.square(delta) * count * batch_count / tot_count
new_var = M2 / tot_count
new_count = tot_count
return new_mean, new_var, new_count
class OriginalBaselinesVecNormalize(VecEnvWrapper):
"""
A vectorized wrapper that normalizes the observations
and returns from an environment.
"""
def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8, use_tf=False):
VecEnvWrapper.__init__(self, venv)
self.ob_rms = OriginalBaselinesRunningMeanStd(shape=self.observation_space.shape) if ob else None
self.ret_rms = OriginalBaselinesRunningMeanStd(shape=()) if ret else None
self.clipob = clipob
self.cliprew = cliprew
self.ret = np.zeros(self.num_envs)
self.gamma = gamma
self.epsilon = epsilon
def step_wait(self):
obs, rews, news, infos = self.venv.step_wait()
self.ret = self.ret * self.gamma + rews
obs = self._obfilt(obs)
if self.ret_rms:
self.ret_rms.update(self.ret)
rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)
self.ret[news] = 0.
return obs, rews, news, infos
def _obfilt(self, obs):
if self.ob_rms:
self.ob_rms.update(obs)
obs = np.clip((obs - self.ob_rms.mean) / np.sqrt(self.ob_rms.var + self.epsilon), -self.clipob, self.clipob)
return obs
else:
return obs
def reset(self):
self.ret = np.zeros(self.num_envs)
obs = self.venv.reset()
return self._obfilt(obs)
env_fns = [make_env(0), make_env(1)]
print("SB3's VecNormalize")
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
envs = DummyVecEnv(env_fns)
envs = VecNormalize(envs)
envs.reset()
print(envs.obs_rms.mean)
obs, reward, done, _ = envs.step([envs.action_space.sample(), envs.action_space.sample()])
print(envs.obs_rms.mean)
print("OriginalBaselinesVecNormalize")
envs = DummyVecEnv(env_fns)
envs = OriginalBaselinesVecNormalize(envs)
envs.reset()
print(envs.ob_rms.mean)
obs, reward, done, _ = envs.step([envs.action_space.sample(), envs.action_space.sample()])
print(envs.ob_rms.mean)
|
Thanks for spotting the bug (luckily only in the reset, I know what's happening, bug was introduced in hill-a/stable-baselines#609), I will push a fix soon. Note: results should not be invalidated as it would still converge to the right mean at the end: one update for the observation was skipped (at reset) and the update for the return was called instead (but the return was zero) |
Hotfix is on its way ;) (after some tests, that should be okay as it converges to the same mean) |
@jkterry1 this is ready for review |
@vwxyzjn it seems to me like this wrapper is doing way too much, e.g. that normalize obs, normalize returns/rewards and clip rewards/returns should be 3 separate wrappers. |
@jkterry1 I agree that we should break the features into separate wrappers. One tricky thing is the clipping. Originally I was thinking that I could use the
So because of this, I propose we create two wrappers |
-All the transform wrappers are planned to be redone anyways |
On a second thought the existing wrappers would allow me to do clipping. @jkterry1 what’s your plan for redoing the transform wrappers? Also just want to make sure you are happy with two wrappers NormalizeObservation and NormalizeReturn before I refactor. |
The performance is matched to the old |
EDIT: this is a confusion on my part. I was printing out the variance of the empirical returns instead of the variance of the processed returns that are actually used for training. I am getting confused about what
However, at least according to an anecdotal experiment shown here or the screenshot below, this is not true. In particular, the variance is several order of magnitudes higher than 1. Another description of what Should I rename the wrapper to something like |
I think renaming it is wise. SmartScaleReward is too vauge though. Perhaps RewardRunningWindowNormalization? I mean, its a bit verbose, do you have a different idea? |
I have currently renamed it to ScaleRewardByReturnVariance
…________________________________
From: Benjamin Black ***@***.***>
Sent: Wednesday, September 8, 2021 7:13:36 PM
To: openai/gym ***@***.***>
Cc: Costa Huang ***@***.***>; Mention ***@***.***>
Subject: Re: [openai/gym] Add Normalize env (#2387)
I think renaming it is wise. SmartScaleReward is too vauge though. Perhaps RewardRunningWindowNormalization? I mean, its a bit verbose, do you have a different idea?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#2387 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABKMJEZJECHE7JIESG3T3FTUA7U2BANCNFSM5DK5F7MQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Ah, maybe RunningWindowRewardNormalization |
I am a little unsure of having the term "Normalization" since it sort of indicates a zero mean and unit variance, which is not the case. I think this is also the reason why [(Engstrom, Ilyas et al. 2020)](https://openreview.net/forum?id=r1etN1rtPB) named the technique "Reward scaling".
…________________________________
From: Benjamin Black ***@***.***>
Sent: Wednesday, September 8, 2021 9:17:51 PM
To: openai/gym ***@***.***>
Cc: Costa Huang ***@***.***>; Mention ***@***.***>
Subject: Re: [openai/gym] Add Normalize env (#2387)
Ah, maybe RunningWindowRewardNormalization
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#2387 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABKMJE3WEKRFXGSIEK7DGYLUBADL7ANCNFSM5DK5F7MQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Making a draft over here to add
NormalizeEnv
wrapper that normalizes returns and observations. It has been critical to the success of PPO with the robotics envs.