-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDP Version #9
Comments
Thanks for your interest in AlphaZero.jl. It is my plan indeed to add support for MDPs. This requires adding three features:
Regarding the last point (stochasticity), there are two possible approaches and I would appreciate some opinion / feedback. Approach 1: with explicit chance nodes In this approach, we add explicit chance nodes in MCTS and have the The advantage of this approach is that it leverages full-knowledge of the transition probabilities. On the other hand, it is restricted to finite distributions with relatively small supports (although optimizations such as node grouping can be used to handle larger chance branching factors). Approach 2: without explicit chance nodes The alternative is to have This approach can accommodate arbitrary transition distributions. It also has the advantage of treating deterministic games and stochastic games identically, reducing complexity. Asymptotically, I am pretty sure both approaches are equivalent. However, I have been looking for an empirical comparison but haven't found any. Does anyone have an opinion on this? |
The most common way is more similar to approach 2 - you simply sample a single new state and reward from the transition probability distribution. If you have access to IEEE publications, there is a very clear pseudocode description in algorithm 4.9 of this book chapter: https://ieeexplore-ieee-org.colorado.idm.oclc.org/document/7288642 You can also see one of our implementations here, https://github.com/JuliaPOMDP/MCTS.jl, though this code was written long ago, so I would say it is not the best to get insight from and there are some mistakes that I would do differently now (and plan to in the future). Happy to review code or help in another way! |
It will also be important to be able to handle non-terminal rewards for MDPs (I noticed this was mentioned in #4) |
Thanks!
The link you gave me points to a "University of Colorado" portal. Can you give me the book title? |
Oops, haha. It's called "Decision Making under Uncertainty: Theory and Application". This link should work: https://ieeexplore.ieee.org/document/7288642 |
Thanks for the reference. |
I’ve been looking for something that can model games with a dice roll to start each turn. In my impression this would require Approach 1 above, as every possible roll should be considered. How could this be accompanied using approach 2? It seems to me like it would only be able to sample a single possible roll. |
@JacksonKearl Approach 2 would be able to accommodate games with a dice roll. Yes, during each move of each MCTS simulation, only the result of a single dice roll is explored. But this is ok, as each simulation is going to explore different stochastic choices and the frequency of exploring each choice will be proportional to its probability. Even in a deterministic game, MCTS is never able to explore every possible branch (it differs from minmax in this way). Even when using Approach 1, a single dice roll would be sampled at each chance node. The only difference is that some info from the exact probability distribution could potentially be used when backpropagating rewards. |
An update: the last release (v0.3.0) generalized the game interface. In particular, it adds support for stochastic games and intermediate rewards. There shouldn't be much work to add support for CommonRLInterface.jl now. |
@zsunberg Do you have a nice MDP environment to suggest that I could use as a test example for MDP support in AlphaZero.jl? |
@jonathan-laurent are these updates in the docs? I don't see anything here: https://jonathan-laurent.github.io/AlphaZero.jl/stable/reference/game_interface/ |
@JacksonKearl The doc seems up-to-date to me (you can compare it with the doc for v0.2.0). In particular, at the top of the page you link too, one can read that "stochastic games and intermediate rewards are supported". Note however that:
|
@jonathan-laurent Awesome. I would say start simple with a Grid World problem (you can google it to see what it is). That way you can compare to the optimal solution from value iteration easily. We have one implemented in the POMDPs.jl interface here: https://github.com/JuliaPOMDP/POMDPModels.jl/blob/master/src/gridworld.jl I have been a bit busy the last few days, but I hope to do some more work on CommonRLInterface this evening and tomorrow, so I will try to implement a grid world problem in that interface (or just provide POMDPs <-> CommonRLInterface compatibility) soon. |
@zsunberg I think grid world is a good suggestion. However, there should not be a unique, fixed world, in which case there is no opportunity for generalization and AlphaZero is useless. I think a nice setting would be to generate a random world at the beginning of each episode and to allow the agent to observe the whole grid. Thanks for your work on CommonRLInterface.jl. I think it is a really valuable project. |
Yes, I agree that a fixed grid world is of course not a good showcase for AlphaZero :) but I think it is an important first step so that we can verify all of our semantics are the same and it will solve the simplest problems correctly. |
@jonathan-laurent I added a grid world example in the common rl interface package. https://github.com/JuliaReinforcementLearning/CommonRLInterface.jl/tree/master/examples Do you think that will work? |
Thanks @zsunberg. I won't try it straight away as my priority right now is to add support for distributed computing but this is in second position on my TODO list. |
@jonathan-laurent , cool. By the way, one thing I have heard a bunch about implementing parallel MCTS is that using locks usually doesn't work that well. Instead, it is better to use atomic operations that can just fail if there is a conflict between two threads. Luckily Julia has these: https://docs.julialang.org/en/v1/base/multi-threading/#Base.Threads.Atomic |
Thanks for the advice @zsunberg! For a start, I decided not to implement parallel MCTS and rather parallelize across game simulations: all MCTS instances are independent and they send requests to a common inference server via a channel. But I'll definitely look into this when I add parallel MCTS into the mix. |
Ah, yes, that is probably more immediately useful from a RL perspective. |
For anyone reading this thread, this is useful: #4 (comment) |
@zsunberg Support for CommonRLInterface.jl is finally merged on master. julia --project -e 'using AlphaZero.jl; Scripts.train("grid-world")' |
Nice, that's great to hear! Looking forward to trying it out, and I will recommend it to the students in my class for their projects. |
First of all, it's very cool to see AlphaZero implemented in Julia! I have always thought that Julia is a really good tool for this type of thing.
This comment is related to #4, but maybe is a slightly different perspective. It would be very cool to also have a version of this that works with MDPs, either using RLBase as @findmyway suggested in #4, or POMDPs.jl (which I and some colleagues work on), or RLInterface.jl. This would need to be a considerably different implementation because the MCTS would be approximating an expectimax tree instead of a minimax tree.
Just thought I should start this issue as a stub in case anyone wants to pick it up, and to point to some MDP definition interfaces in Julia, and to clarify that the game and MDP versions would need to be different. I and my students would definitely use the package if it had support for MDPs.
The text was updated successfully, but these errors were encountered: