Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Normalize env #2387

Merged
merged 15 commits into from
Sep 9, 2021
Merged

Add Normalize env #2387

merged 15 commits into from
Sep 9, 2021

Conversation

vwxyzjn
Copy link
Contributor

@vwxyzjn vwxyzjn commented Sep 3, 2021

Making a draft over here to add NormalizeEnv wrapper that normalizes returns and observations. It has been critical to the success of PPO with the robotics envs.

@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented Sep 3, 2021

I realized one thing after looking into this. The SB3's RunningMeanStd is actually different from baselines's RunningMeanStd, as shown in the demo script below. Was wondering if @araffin and @Miffyli could shed some lights here. Would really appreciate it. If anything, the original implementation looks more correct to my naked eyes? (The first obs.reset() for the vector env returns obs=[0,1], and therefore the obs_rms.mean should be 0.5 instead of the 0 calculated by SB3's VecNormalize?)

from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize, VecEnvWrapper
import gym
import numpy as np

class DummyRewardEnv(gym.Env):
    metadata = {}
    def __init__(self, return_reward_idx=0):
        self.action_space = gym.spaces.Discrete(2)
        self.observation_space = gym.spaces.Box(
            low=np.array([-1.0]), high=np.array([1.0])
        )
        self.returned_rewards = [0, 1, 2, 3, 4]
        self.return_reward_idx = return_reward_idx
        self.t = self.return_reward_idx

    def step(self, action):
        self.t += 1
        return np.array([self.t]), self.t, self.t == len(self.returned_rewards), {}

    def reset(self):
        self.t = self.return_reward_idx
        return np.array([self.t])

def make_env(return_reward_idx):
    def thunk():
        env = DummyRewardEnv(return_reward_idx)
        return env
    return thunk


class OriginalBaselinesRunningMeanStd(object):
    # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
    def __init__(self, epsilon=1e-4, shape=()):
        self.mean = np.zeros(shape, 'float64')
        self.var = np.ones(shape, 'float64')
        self.count = epsilon

    def update(self, x):
        batch_mean = np.mean(x, axis=0)
        batch_var = np.var(x, axis=0)
        batch_count = x.shape[0]
        self.update_from_moments(batch_mean, batch_var, batch_count)

    def update_from_moments(self, batch_mean, batch_var, batch_count):
        self.mean, self.var, self.count = update_mean_var_count_from_moments(
            self.mean, self.var, self.count, batch_mean, batch_var, batch_count)

def update_mean_var_count_from_moments(mean, var, count, batch_mean, batch_var, batch_count):
    delta = batch_mean - mean
    tot_count = count + batch_count

    new_mean = mean + delta * batch_count / tot_count
    m_a = var * count
    m_b = batch_var * batch_count
    M2 = m_a + m_b + np.square(delta) * count * batch_count / tot_count
    new_var = M2 / tot_count
    new_count = tot_count

    return new_mean, new_var, new_count


class OriginalBaselinesVecNormalize(VecEnvWrapper):
    """
    A vectorized wrapper that normalizes the observations
    and returns from an environment.
    """

    def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8, use_tf=False):
        VecEnvWrapper.__init__(self, venv)
        self.ob_rms = OriginalBaselinesRunningMeanStd(shape=self.observation_space.shape) if ob else None
        self.ret_rms = OriginalBaselinesRunningMeanStd(shape=()) if ret else None
        self.clipob = clipob
        self.cliprew = cliprew
        self.ret = np.zeros(self.num_envs)
        self.gamma = gamma
        self.epsilon = epsilon

    def step_wait(self):
        obs, rews, news, infos = self.venv.step_wait()
        self.ret = self.ret * self.gamma + rews
        obs = self._obfilt(obs)
        if self.ret_rms:
            self.ret_rms.update(self.ret)
            rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)
        self.ret[news] = 0.
        return obs, rews, news, infos

    def _obfilt(self, obs):
        if self.ob_rms:
            self.ob_rms.update(obs)
            obs = np.clip((obs - self.ob_rms.mean) / np.sqrt(self.ob_rms.var + self.epsilon), -self.clipob, self.clipob)
            return obs
        else:
            return obs

    def reset(self):
        self.ret = np.zeros(self.num_envs)
        obs = self.venv.reset()
        return self._obfilt(obs)

env_fns = [make_env(0), make_env(1)]

print("SB3's VecNormalize")
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
envs = DummyVecEnv(env_fns)
envs = VecNormalize(envs)
envs.reset()
print(envs.obs_rms.mean)
obs, reward, done, _ = envs.step([envs.action_space.sample(), envs.action_space.sample()])
print(envs.obs_rms.mean)

print("OriginalBaselinesVecNormalize")
envs = DummyVecEnv(env_fns)
envs = OriginalBaselinesVecNormalize(envs)
envs.reset()
print(envs.ob_rms.mean)
obs, reward, done, _ = envs.step([envs.action_space.sample(), envs.action_space.sample()])
print(envs.ob_rms.mean)
$ python test_sb3_vecnormalize.py
SB3's VecNormalize
[0.]
[1.499925]
OriginalBaselinesVecNormalize
[0.499975]
[0.999975]

@araffin
Copy link
Contributor

araffin commented Sep 3, 2021

as wondering if @araffin and @Miffyli could shed some lights here. Would really appreciate it. If anything, the original implementation looks more correct to my naked eyes?

Thanks for spotting the bug (luckily only in the reset, I know what's happening, bug was introduced in hill-a/stable-baselines#609), I will push a fix soon.

Note: results should not be invalidated as it would still converge to the right mean at the end: one update for the observation was skipped (at reset) and the update for the return was called instead (but the return was zero)

@araffin
Copy link
Contributor

araffin commented Sep 3, 2021

Hotfix is on its way ;) (after some tests, that should be okay as it converges to the same mean)
DLR-RM/stable-baselines3#558

@vwxyzjn vwxyzjn marked this pull request as ready for review September 3, 2021 14:09
@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented Sep 3, 2021

@jkterry1 this is ready for review

@jkterry1
Copy link
Collaborator

jkterry1 commented Sep 3, 2021

@vwxyzjn it seems to me like this wrapper is doing way too much, e.g. that normalize obs, normalize returns/rewards and clip rewards/returns should be 3 separate wrappers.

@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented Sep 3, 2021

@jkterry1 I agree that we should break the features into separate wrappers. One tricky thing is the clipping. Originally I was thinking that I could use the TransformObservation and TransformReward wrapper to do the clipping, but notice the implementation actually clips the post-normalized reward or obs:

            rews = np.clip(
                rews / np.sqrt(self.return_rms.var + self.epsilon),
                -self.clip_reward,
                self.clip_reward,
            )
            obs = np.clip(
                (obs - self.obs_rms.mean) / np.sqrt(self.obs_rms.var + self.epsilon),
                -self.clip_obs,
                self.clip_obs,
            )

So because of this, I propose we create two wrappers NormalizeObservation and NormalizeReturn. How does this sound?

@jkterry1
Copy link
Collaborator

jkterry1 commented Sep 4, 2021

-All the transform wrappers are planned to be redone anyways
-I didn't understand the comments about the problem with clipping

@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented Sep 4, 2021

On a second thought the existing wrappers would allow me to do clipping. @jkterry1 what’s your plan for redoing the transform wrappers?

Also just want to make sure you are happy with two wrappers NormalizeObservation and NormalizeReturn before I refactor.

@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented Sep 8, 2021

The performance is matched to the old Normalize wrapper. https://wandb.ai/costa-huang/brax/reports/NormalizeReturn-and-NormalizeObservation--VmlldzoxMDA0Mjg3. Also pinging @benblack769.

@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented Sep 8, 2021

EDIT: this is a confusion on my part. I was printing out the variance of the empirical returns instead of the variance of the processed returns that are actually used for training.

I am getting confused about what NormalizeReturn actually does. In the Phasic Policy Gradient paper, the authors suggest

we avoid this concern by normalizing rewards so that discounted returns have approximately unit variance

However, at least according to an anecdotal experiment shown here or the screenshot below, this is not true. In particular, the variance is several order of magnitudes higher than 1.

image

Another description of what NormalizeReturn does is that "the rewards are divided through by the standard deviation of a rolling discounted sum of the reward" (Engstrom, Ilyas et al. 2020), which I have found to be far accurate.

Should I rename the wrapper to something like SmartScaleReward??

@benblack769
Copy link

I think renaming it is wise. SmartScaleReward is too vauge though. Perhaps RewardRunningWindowNormalization? I mean, its a bit verbose, do you have a different idea?

@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented Sep 8, 2021 via email

@benblack769
Copy link

Ah, maybe RunningWindowRewardNormalization

@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented Sep 9, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants