Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Double DQN #52

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

corywalker
Copy link

I was interested in implementing Double DQN in this source code, so here are my changes. Feel free to pull these into the main codebase. I didn't change much, since the Double DQN algorithm is not much different from that described in the Nature paper. I couldn't get the original tests to pass, so I was not able to add a test for Double DQN. I did test everything though, by running experiments with Breakout. Here is the performance over time:

image

Of course, the differences here are negligible and Breakout was named in the Double DQN paper as not having a real change under Double DQN. If I had more computing resources, I could test on the games which Double DQN makes a significant difference. Here is perhaps a more useful plot that shows how Double DQN seems to reduce value overestimates:

image

And here is the change required for Double DQN:

image

If you don't have the time to look over the changes or to test them yourself, I understand. At least this PR will allow others to use it easily if need be.

References:

van Hasselt, H., Guez, A., & Silver, D. (2015). Deep Reinforcement Learning with Double Q-learning. arXiv preprint arXiv:1509.06461.

@moscow25
Copy link

Awesome! Been hoping someone implements double QN since the paper came out. Thanks!

On Nov 22, 2015, at 10:07 PM, Cory Walker notifications@github.com wrote:

I was interested in implementing Double DQN in this source code, so here are my changes. Feel free to pull these into the main codebase. I didn't change much, since the Double DQN algorithm is not much different from that described in the Nature paper. I couldn't get the original tests to pass, so I was not able to add a test for Double DQN. I did test everything though, by running experiments with Breakout. Here is the performance over time:

Of course, the differences here are negligible and Breakout was named in the Double DQN paper as not having a real change under Double DQN. If I had more computing resources, I could test on the games which Double DQN makes a significant difference. Here is perhaps a more useful plot that shows how Double DQN seems to reduce value overestimates:

And here is the change required for Double DQN:

If you don't have the time to look over the changes or to test them yourself, I understand. At least this PR will allow others to use it easily if need be.

References:

van Hasselt, H., Guez, A., & Silver, D. (2015). Deep Reinforcement Learning with Double Q-learning. arXiv preprint arXiv:1509.06461.

You can view, comment on, or merge this pull request online at:

#52

Commit Summary

Double DQN support.
Bug fix, some testing code.
Checkpoint before instance shutdown.
Prepare for pull request.
File Changes

M deep_q_rl/launcher.py (5)
M deep_q_rl/q_network.py (22)
A deep_q_rl/run_double.py (66)
M deep_q_rl/run_nature.py (1)
M deep_q_rl/run_nips.py (1)
M deep_q_rl/test/test_q_network.py (12)
Patch Links:

https://github.com/spragunr/deep_q_rl/pull/52.patch
https://github.com/spragunr/deep_q_rl/pull/52.diff

Reply to this email directly or view it on GitHub.

@spragunr
Copy link
Owner

Thanks for the PR. I'm behind on reviewing, but I'm hoping to get caught up in late December / early January. It looks like the changes aren't very disruptive so there shouldn't be an issue merging.

@alito
Copy link

alito commented Nov 25, 2015

Excellent. I'm starting a test run on space invaders since it's one where they saw a big increase. I'll let you know how it goes in a couple of days

@alito
Copy link

alito commented Nov 30, 2015

Plot from space invaders.
spaceinvadersdoubleq

This, while not being up to scratch with DeepMind's results, is, I think, much better than any result I've seen with the deep_q_rl implementation.

It's very slow to learn but seems very stable. I might try again but raising the learning rate.

@moscow25
Copy link

Very nice! This is with the double Q-RL? It just switching the network
every X steps, right?

I'm also impressed. My performance with deep_Q_RL never came close to their
reports, either...

Best,
N

On Mon, Nov 30, 2015 at 7:06 AM, Alejandro Dubrovsky <
notifications@github.com> wrote:

Plot from space invaders.
[image: spaceinvadersdoubleq]
https://cloud.githubusercontent.com/assets/775207/11471057/9dd9c402-97b6-11e5-9b5a-ff571ea1e4a1.png

This, while not being up to scratch with DeepMind's results, is, I think,
much better than any result I've seen with the deep_q_rl implementation.

It's very slow to learn but seems very stable. I might try again but
raising the learning rate.


Reply to this email directly or view it on GitHub
#52 (comment).

@corywalker
Copy link
Author

@alito Thanks for the examination. Do you mind sharing the results.csv and perhaps the results.csv from any other Space Invaders models that you have trained?

Also, here is a newer paper from DeepMind that claims better performance than Double DQN: http://arxiv.org/abs/1511.06581

Could be interesting to implement.

@alito
Copy link

alito commented Dec 1, 2015

Here is results.csv for this run (note the extra column in there):
http://organicrobot.com/deepqrl/results-doubleq.csv

I don't seem to have, or at least kept, a recent results.csv. I've got a few from June that didn't learn at all, and a few from the NIPS era. I've put one up from May which seems to be the best I've got, but I don't think there's a good comparison.

http://organicrobot.com/deepqrl/results-20150527.csv

I'm running a plain version now, but it will take a while to see what's going on.

Also, there's this: http://arxiv.org/abs/1511.05952 from last week, which, aside from doing better, it has the plot of epoch vs reward for all 57 games. From those, it seems like even their non-double Q implementation is very stable, or at least more stable than deep_q_rl seems to be at the moment.

Minor change to update the citation for Double DQN.
@moscow25
Copy link

moscow25 commented Dec 2, 2015

Thanks Alejandro. I, for one am curious to see how this comparison shakes
out for you. When I ran deep-Q-RL the first time with Theano, it didn't
really learn for me, also.

The Prioritized Replay paper that you mentioned has been sitting on my
desk, as it may also apply to my poker AI problems. Choosing the best
replay batch set is a pain, once you have a lot of so-so data... and I
think others who got better learning results from deep-Q-RL talked a lot
about it starting to forget parts of the game, as it got better at others...

I have always suspected that they sample the games data in a more clever
way than the original paper gets into. Sometimes, it's just easier to say
you did the simple thing. So curious to see if they have now come clean :-)

Best,
Nikolai

On Tue, Dec 1, 2015 at 6:48 AM, Alejandro Dubrovsky <
notifications@github.com> wrote:

Here is results.csv for this run (note the extra column in there):
http://organicrobot.com/deepqrl/results-doubleq.csv

I don't seem to have, or at least kept, a recent results.csv. I've got a
few from June that didn't learn at all, and a few from the NIPS era. I've
put one up from May which seems to be the best I've got, but I don't think
there's a good comparison.

http://organicrobot.com/deepqrl/results-20150527.csv

I'm running a plain version now, but it will take a while to see what's
going on.

Also, there's this: http://arxiv.org/abs/1511.05952 from last week,
which, aside from doing better, it has the plot of epoch vs reward for all
57 games. From those, it seems like even their non-double Q implementation
is very stable, or at least more stable than deep_q_rl seems to be at the
moment.


Reply to this email directly or view it on GitHub
#52 (comment).

@alito
Copy link

alito commented Dec 4, 2015

The run without double-q hasn't finished, but it's not going to go anywhere from its current state. I've put the results up:
http://organicrobot.com/deepqrl/results-20151201.csv

Here's the plot:
spaceinvadersstandardnature

It does better than I expected. Looks stable if nothing else. Double-Q looks like a substantial improvement in this case.

@moscow25 they've released their code, so I suspect they are not cheating in any way they haven't mentioned. I haven't tested their code though, but it wouldn't be hard to find out if they aren't doing as well as they claimed on their papers.

@moscow25
Copy link

moscow25 commented Dec 4, 2015

Awesome!

I meant that tongue in cheek. Any yes, they released code, so it happened :-)

Just saying that it's always hard to specify a tech system precisely, especially in 7 pages. And this presumes that people who wrote the system remember every decision explored and taken.

Glad to see the double Q RL working so well. I kept starting ok but then diverging into NaN territory why I ran the (Lasagne version) on this when it came out. Seeing to converge more steady now is great. The idea from that paper is simple and glad it just works.

Over-optimism is a huge problem for my high variance poker AI problems. So optimistic to try this version now. Thanks again for running the baseline.

Best,
Nikolai

On Dec 4, 2015, at 6:56 AM, Alejandro Dubrovsky notifications@github.com wrote:

The run without double-q hasn't finished, but it's not going to go anywhere from its current state. I've put the results up:
http://organicrobot.com/deepqrl/results-20151201.csv

Here's the plot:

It does better than I expected. Looks stable if nothing else. Double-Q looks like a substantial improvement in this case.

@moscow25 they've released their code, so I suspect they are not cheating in any way they haven't mentioned. I haven't tested their code though, but it wouldn't be hard to find out if they aren't doing as well as they claimed on their papers.


Reply to this email directly or view it on GitHub.

@stokasto
Copy link

There seems to be a bug in your implementation: as far as I can see you are calculating maxaction based on q_vals (which contains the Q values for s_t and NOT s_{t+1}).
To fix this you have to do a second forward pass through the current q network, using the next state.
That would look like this:
`

    q_vals = lasagne.layers.get_output(self.l_out, states / input_scale)

    if self.freeze_interval > 0:
        next_q_vals = lasagne.layers.get_output(self.next_l_out,
                                                next_states / input_scale)
    else:
        next_q_vals = lasagne.layers.get_output(self.l_out,
                                                next_states / input_scale)
        next_q_vals = theano.gradient.disconnected_grad(next_q_vals)

    if self.use_double:
        # also get q values for next_states
        q_vals_next_current = lasagne.layers.get_output(self.l_out, next_states / input_scale)
        maxaction = T.argmax(q_vals_next_current, axis=1, keepdims=False)
        temptargets = next_q_vals[T.arange(batch_size),maxaction].reshape((-1, 1))
        target = (rewards +
                  (T.ones_like(terminals) - terminals) *
                  self.discount * temptargets)

`

@alito
Copy link

alito commented Dec 30, 2016

Note by @stokasto sounds right. I'll do some testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants