-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add QR-DQN #13
Add QR-DQN #13
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, look good already =)
Obviously missing tests, documentation and benchmark but it is a good start!
Did you had any issue so far? or things that slow you down because of how SB3 work in implementing QR-DQN?
(asking for what we can improve in SB3 ;))
@araffin 1 2 3 4 (I only have 16GB of RAM and I usually store images as a list of LazyFrames, which uses 4 times less memory. Maybe it's better to add such a feature to reduce memory usage?) Thanks ;) |
what is tau exactly? (it would be nice if we have a better name, I'm quite new to quantile regression, so everything is not too clear yet)
that's fine ;)
yes ;) but please test it first on simpler env and with a smaller replay buffer size.
Look at the zoo and the replay buffer, we have a EDIT: I'm not sure for the lazy frame, but if it is simple enough to implement, it would be a good addition ;) (I think it was there in SB2 but not used) |
Sure, thank you for your advice.
We model the quantile function, which is the mapping from the cumulative probability to the quantile value.
I didn't notice it. It seems it stores the next observation and observation as the same array, and so efficient 👍
LazyFrames is a list of arrays, and convert a list into a (frame-stacked) array when called. Because it stores a frame-stacked array as a list of arrays, which exactly is a list of references to arrays, it never stores the same frame twice. Thank you for your kind response. |
thanks for the refresher =).
yep + a comment ;) (in case of doubt, more verbose name is always good ;))
yes, it is not as memory efficient as LazyFrame but it is as fast as a normal buffer and works also without frame stacking at the cost of some complexity (that's why it is False by default).
how much slower? (1.2x or ~2x slower) |
Yes 👍
I will test it again in a couple of days. |
sb3_contrib/qrdqn/qrdqn.py
Outdated
current_quantiles = th.gather(current_quantiles, dim=2, index=actions).squeeze(2) | ||
|
||
# Compute Quantile Huber loss | ||
loss = quantile_huber_loss(current_quantiles, target_quantiles) * self.n_quantiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you multiply by self.n_quantiles
? (that was not done in TQC original code if I recall... I did not check for QR-DQN yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We sum over a quantile dimension, which is common in QR-DQN, IQN, and FQF.
I'm not sure why they take the mean in TQC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that Dopamine is using the mean loss by default: https://github.com/google/dopamine/blob/master/dopamine/jax/agents/quantile/quantile_agent.py#L91
The same goes for IQN implementation of Pfrl: https://github.com/pfnet/pfrl/blob/master/pfrl/agents/iqn.py#L211
or the Facebook ReAgent implementation: https://github.com/facebookresearch/ReAgent/blob/master/reagent/training/qrdqn_trainer.py#L157
maybe that could be a parameter? (and check how it affects the learning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that Dopamine is using the mean loss by default: https://github.com/google/dopamine/blob/master/dopamine/jax/agents/quantile/quantile_agent.py#L91
The same goes for IQN implementation of Pfrl: https://github.com/pfnet/pfrl/blob/master/pfrl/agents/iqn.py#L211
Their implementations sum over the (current) quantile dimension, so the same as mine, aren't they?
Multiplying "n_quantiles" means that summing over the (current) quantile dimension.
EDIT: They are the same as our QR-DQN loss, not as TQC loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add an argument like sum_over_quantiles : bool
to quantile_huber_loss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Their implementations sum over the (current) quantile dimension, so the same as mine, aren't they?
I will try to check later when I'm rested ;)
Maybe we should add an argument like sum_over_quantiles : bool to quantile_huber_loss?
Probably, yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could finally do the review, LGTM, thanks =)
(only missing to add the benchmark to the doc)
Thank you so much ;)
Could you do this for me, or should I update the doc? |
done ;) @Miffyli I let you decide if we need more experiments (even though we match Intel Coach results), or we merge ;) |
Visually comparing results here vs. original paper the Breakout and Pong results seem to match up till the trained 10e6, so I would trust this implementation enough to use it as QR-DQN myself and believe we can merge it :). Just two things: docs should mention what we compared against (Intel Coach and, very roughly, original paper) and also update zoo-instructions once we merge the branch. |
Hate to post on very late, but is there a specific reason that we cannot use action_noise in QRDQN? Was looking to use it for exploration rather than a exploration scheduler, and would prefer using action noise over noisy nets for now. |
QRDQN supports only Discrete action spaces, and there is no action_noise for Discrete spaces (at least no ones implemented in SB3 as of writing). |
@araffin I tried to reproduce those results, but it did not work out. Which version of the games were used in your benchmark? https://www.gymlibrary.ml/environments/atari/ is documenting 3 versions (v0, v4, v5) and there a couple of different options for the v5 setting as well as different versions of the environment for v4. I tried "Breakout-v5" with default options as well as "BreakoutNoFrameskip-v4" however my learning curves do not look anything comparable to yours.
My results looked like the following: Would really appreciate if you could give me hints on reproducing your baseline. |
Hello @hh0rva1h ,
we always use the NoFrameskip-v4 (and gym 0.21).
Please use the RL Zoo for that, we have instructions in the documentation: Instructions are also on huggingface hub: https://huggingface.co/sb3/qrdqn-BreakoutNoFrameskip-v4 and detailed hyperparameters are here: https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/qrdqn.yml#L1
I think your issue may come from the monitoring, please read DLR-RM/stable-baselines3#181. You should at least use |
@araffin any chance you have the reasoning in mind? I thought v4-noframeskip was to be avoided due to the "memorizing sequences" problem (even with the up to 30 noops from the atari wrapper I suspect it just memorizes 31 sequences) |
You can read more about why in the PR on updating gym:
TL;DR: the new atari v5 were not yet benchmarked and as we have everything using
Do you have a reference for that? |
@araffin: I would have concluded that from Brute-like is referring to models that rely purely on memorization. Of course it is not memorzizing a sequence (as action always only depends on previous state), but I imagine it as being able to properly recognize only states that are from that exact playbook. And to verify: train a model using NoFrameskip-v4 and evaluate on v4 (stochasitcity via frameskip) or v5 (stochasticity via stickies) -> it will transfer poorly. I imagine a more sophisitcated approach would be to us NoFrameskip-v4, take randomly some set of integers A from B={0,...,30} for the noops and then train the model only on B`A and evalute the model on A. use those selected Noops in A only in evaluation but not training. If model performs poorly on those frameskips on which it was not trained -> we know it just memorized. When trying to reproduce good results on Breakout I ran into exactly that issue: it is simple&straightforward to produce high scores with NoFrameskip-v4 (even with Noops) but with v4 or v5 it becomes much harder. |
@mreitschuster Read about the memorizing issue as well. Do you have by any chance baselines curves at hand for the v4 or v5 scenarios? Would be really great to have more baselines apart from the NoFrameskip-v4. |
Not with qrdqn (yet). working on it. With PPO i have quite a few, but mostly with a non-standard wrapper configuration. But that would go off-topic from this PR. you can have a look at my work on tuning & env selection. Short answer: without an aimbot I get to score of 180 and with aimbot to 220 (on 1e7 training steps).I havent found a PM functionality in github - if you want a discussion on that feel free to open an issue/discussion and add me there.
Sorry, no & don't know. I havent dived into qrdqn too deep yet. I was just testing models on breakout and wanted to check my results with others - and qrdqn looked promising - showing scores of 400 - but then realizing it is in the deterministic environment the excitement level dropped. |
Thanks @mreitschuster for the links!
The RL Zoo models are using frameskip (via the
you need to be careful when you do that, because as mentioned before, if you use atari wrapper, there is frame skipping.
by sticky actions you mean randomly repeating actions? |
It is my understanding that we have two different frameskips available - stochastic frameskip (randomly skip frames) which is inbuilt into *-v4 (but deactivated when using *NoFrameskip-v4). it is also not active in v5 (which injects stochasticity diffferently). and then there is the deterministic frameskip - found in sb3 atariwrapper as well as in the environment *Deterministic-v4.
yes you are right, but the wrapper provides determinstic frameskip (for speeding up the game), not the stochastic one (to train more transferable skills)
yes. i think if we want to deepen this (and i would be very happy to) it should be a seperate discussion. |
Implement QR-DQN
Description
Paper: https://arxiv.org/abs/1710.10044
sb3_contrib.common.utils
Context
closes #12
Types of changes
Checklist:
make format
(required)make check-codestyle
andmake lint
(required)make pytest
andmake type
both pass. (required)Note: we are using a maximum length of 127 characters per line
Benchmark