Add Normalize env #2387

vwxyzjn · 2021-09-03T04:09:18Z

Making a draft over here to add NormalizeEnv wrapper that normalizes returns and observations. It has been critical to the success of PPO with the robotics envs.

vwxyzjn · 2021-09-03T04:20:05Z

I realized one thing after looking into this. The SB3's RunningMeanStd is actually different from baselines's RunningMeanStd, as shown in the demo script below. Was wondering if @araffin and @Miffyli could shed some lights here. Would really appreciate it. If anything, the original implementation looks more correct to my naked eyes? (The first obs.reset() for the vector env returns obs=[0,1], and therefore the obs_rms.mean should be 0.5 instead of the 0 calculated by SB3's VecNormalize?)

from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize, VecEnvWrapper
import gym
import numpy as np

class DummyRewardEnv(gym.Env):
    metadata = {}
    def __init__(self, return_reward_idx=0):
        self.action_space = gym.spaces.Discrete(2)
        self.observation_space = gym.spaces.Box(
            low=np.array([-1.0]), high=np.array([1.0])
        )
        self.returned_rewards = [0, 1, 2, 3, 4]
        self.return_reward_idx = return_reward_idx
        self.t = self.return_reward_idx

    def step(self, action):
        self.t += 1
        return np.array([self.t]), self.t, self.t == len(self.returned_rewards), {}

    def reset(self):
        self.t = self.return_reward_idx
        return np.array([self.t])

def make_env(return_reward_idx):
    def thunk():
        env = DummyRewardEnv(return_reward_idx)
        return env
    return thunk


class OriginalBaselinesRunningMeanStd(object):
    # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
    def __init__(self, epsilon=1e-4, shape=()):
        self.mean = np.zeros(shape, 'float64')
        self.var = np.ones(shape, 'float64')
        self.count = epsilon

    def update(self, x):
        batch_mean = np.mean(x, axis=0)
        batch_var = np.var(x, axis=0)
        batch_count = x.shape[0]
        self.update_from_moments(batch_mean, batch_var, batch_count)

    def update_from_moments(self, batch_mean, batch_var, batch_count):
        self.mean, self.var, self.count = update_mean_var_count_from_moments(
            self.mean, self.var, self.count, batch_mean, batch_var, batch_count)

def update_mean_var_count_from_moments(mean, var, count, batch_mean, batch_var, batch_count):
    delta = batch_mean - mean
    tot_count = count + batch_count

    new_mean = mean + delta * batch_count / tot_count
    m_a = var * count
    m_b = batch_var * batch_count
    M2 = m_a + m_b + np.square(delta) * count * batch_count / tot_count
    new_var = M2 / tot_count
    new_count = tot_count

    return new_mean, new_var, new_count


class OriginalBaselinesVecNormalize(VecEnvWrapper):
    """
    A vectorized wrapper that normalizes the observations
    and returns from an environment.
    """

    def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8, use_tf=False):
        VecEnvWrapper.__init__(self, venv)
        self.ob_rms = OriginalBaselinesRunningMeanStd(shape=self.observation_space.shape) if ob else None
        self.ret_rms = OriginalBaselinesRunningMeanStd(shape=()) if ret else None
        self.clipob = clipob
        self.cliprew = cliprew
        self.ret = np.zeros(self.num_envs)
        self.gamma = gamma
        self.epsilon = epsilon

    def step_wait(self):
        obs, rews, news, infos = self.venv.step_wait()
        self.ret = self.ret * self.gamma + rews
        obs = self._obfilt(obs)
        if self.ret_rms:
            self.ret_rms.update(self.ret)
            rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)
        self.ret[news] = 0.
        return obs, rews, news, infos

    def _obfilt(self, obs):
        if self.ob_rms:
            self.ob_rms.update(obs)
            obs = np.clip((obs - self.ob_rms.mean) / np.sqrt(self.ob_rms.var + self.epsilon), -self.clipob, self.clipob)
            return obs
        else:
            return obs

    def reset(self):
        self.ret = np.zeros(self.num_envs)
        obs = self.venv.reset()
        return self._obfilt(obs)

env_fns = [make_env(0), make_env(1)]

print("SB3's VecNormalize")
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
envs = DummyVecEnv(env_fns)
envs = VecNormalize(envs)
envs.reset()
print(envs.obs_rms.mean)
obs, reward, done, _ = envs.step([envs.action_space.sample(), envs.action_space.sample()])
print(envs.obs_rms.mean)

print("OriginalBaselinesVecNormalize")
envs = DummyVecEnv(env_fns)
envs = OriginalBaselinesVecNormalize(envs)
envs.reset()
print(envs.ob_rms.mean)
obs, reward, done, _ = envs.step([envs.action_space.sample(), envs.action_space.sample()])
print(envs.ob_rms.mean)

$ python test_sb3_vecnormalize.py
SB3's VecNormalize
[0.]
[1.499925]
OriginalBaselinesVecNormalize
[0.499975]
[0.999975]

araffin · 2021-09-03T07:37:28Z

as wondering if @araffin and @Miffyli could shed some lights here. Would really appreciate it. If anything, the original implementation looks more correct to my naked eyes?

Thanks for spotting the bug (luckily only in the reset, I know what's happening, bug was introduced in hill-a/stable-baselines#609), I will push a fix soon.

Note: results should not be invalidated as it would still converge to the right mean at the end: one update for the observation was skipped (at reset) and the update for the return was called instead (but the return was zero)

araffin · 2021-09-03T08:12:54Z

Hotfix is on its way ;) (after some tests, that should be okay as it converges to the same mean)
DLR-RM/stable-baselines3#558

vwxyzjn · 2021-09-03T14:25:22Z

@jkterry1 this is ready for review

jkterry1 · 2021-09-03T16:28:06Z

@vwxyzjn it seems to me like this wrapper is doing way too much, e.g. that normalize obs, normalize returns/rewards and clip rewards/returns should be 3 separate wrappers.

vwxyzjn · 2021-09-03T19:51:13Z

@jkterry1 I agree that we should break the features into separate wrappers. One tricky thing is the clipping. Originally I was thinking that I could use the TransformObservation and TransformReward wrapper to do the clipping, but notice the implementation actually clips the post-normalized reward or obs:

            rews = np.clip(
                rews / np.sqrt(self.return_rms.var + self.epsilon),
                -self.clip_reward,
                self.clip_reward,
            )
            obs = np.clip(
                (obs - self.obs_rms.mean) / np.sqrt(self.obs_rms.var + self.epsilon),
                -self.clip_obs,
                self.clip_obs,
            )

So because of this, I propose we create two wrappers NormalizeObservation and NormalizeReturn. How does this sound?

jkterry1 · 2021-09-04T00:29:48Z

-All the transform wrappers are planned to be redone anyways
-I didn't understand the comments about the problem with clipping

vwxyzjn · 2021-09-04T14:43:15Z

On a second thought the existing wrappers would allow me to do clipping. @jkterry1 what’s your plan for redoing the transform wrappers?

Also just want to make sure you are happy with two wrappers NormalizeObservation and NormalizeReturn before I refactor.

vwxyzjn · 2021-09-08T15:35:32Z

The performance is matched to the old Normalize wrapper. https://wandb.ai/costa-huang/brax/reports/NormalizeReturn-and-NormalizeObservation--VmlldzoxMDA0Mjg3. Also pinging @benblack769.

vwxyzjn · 2021-09-08T17:36:38Z

EDIT: this is a confusion on my part. I was printing out the variance of the empirical returns instead of the variance of the processed returns that are actually used for training.

I am getting confused about what NormalizeReturn actually does. In the Phasic Policy Gradient paper, the authors suggest

we avoid this concern by normalizing rewards so that discounted returns have approximately unit variance

However, at least according to an anecdotal experiment shown here or the screenshot below, this is not true. In particular, the variance is several order of magnitudes higher than 1.

Another description of what NormalizeReturn does is that "the rewards are divided through by the standard deviation of a rolling discounted sum of the reward" (Engstrom, Ilyas et al. 2020), which I have found to be far accurate.

Should I rename the wrapper to something like SmartScaleReward??

benblack769 · 2021-09-08T23:13:24Z

I think renaming it is wise. SmartScaleReward is too vauge though. Perhaps RewardRunningWindowNormalization? I mean, its a bit verbose, do you have a different idea?

vwxyzjn · 2021-09-08T23:42:08Z

I have currently renamed it to ScaleRewardByReturnVariance

…

________________________________ From: Benjamin Black ***@***.***> Sent: Wednesday, September 8, 2021 7:13:36 PM To: openai/gym ***@***.***> Cc: Costa Huang ***@***.***>; Mention ***@***.***> Subject: Re: [openai/gym] Add Normalize env (#2387) I think renaming it is wise. SmartScaleReward is too vauge though. Perhaps RewardRunningWindowNormalization? I mean, its a bit verbose, do you have a different idea? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#2387 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABKMJEZJECHE7JIESG3T3FTUA7U2BANCNFSM5DK5F7MQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

benblack769 · 2021-09-09T01:17:33Z

Ah, maybe RunningWindowRewardNormalization

vwxyzjn · 2021-09-09T01:18:45Z

I am a little unsure of having the term "Normalization" since it sort of indicates a zero mean and unit variance, which is not the case. I think this is also the reason why [(Engstrom, Ilyas et al. 2020)](https://openreview.net/forum?id=r1etN1rtPB) named the technique "Reward scaling".

…

________________________________ From: Benjamin Black ***@***.***> Sent: Wednesday, September 8, 2021 9:17:51 PM To: openai/gym ***@***.***> Cc: Costa Huang ***@***.***>; Mention ***@***.***> Subject: Re: [openai/gym] Add Normalize env (#2387) Ah, maybe RunningWindowRewardNormalization — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#2387 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABKMJE3WEKRFXGSIEK7DGYLUBADL7ANCNFSM5DK5F7MQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

vwxyzjn added 3 commits September 2, 2021 22:12

initial commit

722ee1f

undo black

3826ff5

add code

a53bd88

araffin mentioned this pull request Sep 3, 2021

Hotfix for Vecnormalize DLR-RM/stable-baselines3#558

Merged

14 tasks

add test cases and refactor

3069685

vwxyzjn marked this pull request as ready for review September 3, 2021 14:09

vwxyzjn added 2 commits September 3, 2021 10:16

add docs

dfc8d22

black

f352346

documentation update

d7fd153

vwxyzjn added 4 commits September 8, 2021 10:17

break feature apart

17dfa8b

quick fix

ab18807

quick fix

e0b410e

quick fix

814129f

update documentation

2f4d22c

update documentation

b3d1fba

Update wrapper naming

5c36679

fix ci

025fa4f

jkterry1 merged commit b1b0086 into openai:master Sep 9, 2021

vwxyzjn mentioned this pull request Sep 12, 2021

Replacing gym's Mujoco envs with brax envs google/brax#49

Open

keraJLi mentioned this pull request Dec 17, 2024

[Proposal/Question] Incorrect documentation of NormalizeReward wrapper Farama-Foundation/Gymnasium#1272

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Normalize env #2387

Add Normalize env #2387

vwxyzjn commented Sep 3, 2021

vwxyzjn commented Sep 3, 2021 •

edited

Loading

araffin commented Sep 3, 2021

araffin commented Sep 3, 2021

vwxyzjn commented Sep 3, 2021

jkterry1 commented Sep 3, 2021

vwxyzjn commented Sep 3, 2021 •

edited

Loading

jkterry1 commented Sep 4, 2021

vwxyzjn commented Sep 4, 2021

vwxyzjn commented Sep 8, 2021

vwxyzjn commented Sep 8, 2021 •

edited

Loading

benblack769 commented Sep 8, 2021

vwxyzjn commented Sep 8, 2021 via email

benblack769 commented Sep 9, 2021

vwxyzjn commented Sep 9, 2021 via email •

edited

Loading

Add Normalize env #2387

Add Normalize env #2387

Conversation

vwxyzjn commented Sep 3, 2021

vwxyzjn commented Sep 3, 2021 • edited Loading

araffin commented Sep 3, 2021

araffin commented Sep 3, 2021

vwxyzjn commented Sep 3, 2021

jkterry1 commented Sep 3, 2021

vwxyzjn commented Sep 3, 2021 • edited Loading

jkterry1 commented Sep 4, 2021

vwxyzjn commented Sep 4, 2021

vwxyzjn commented Sep 8, 2021

vwxyzjn commented Sep 8, 2021 • edited Loading

benblack769 commented Sep 8, 2021

vwxyzjn commented Sep 8, 2021 via email

benblack769 commented Sep 9, 2021

vwxyzjn commented Sep 9, 2021 via email • edited Loading

vwxyzjn commented Sep 3, 2021 •

edited

Loading

vwxyzjn commented Sep 3, 2021 •

edited

Loading

vwxyzjn commented Sep 8, 2021 •

edited

Loading

vwxyzjn commented Sep 9, 2021 via email •

edited

Loading