Skip to content

MCTS default policy

Peter Shih edited this page Aug 7, 2017 · 11 revisions
  • Survey
  • Methods
    • Predicate-Average Sampling Technique
    • Features-to-Action Sampling Technique
  • Idea
    • A global table to store Q(a)
      • a is one of the actions
      • Q(a) is the number of times the action a is chosen
      • Example: in Othello, a good action a is to put a chess in the corner
    • The global table can be updated in back-propagation step
    • Use the table to bias the MCTS default policy
      • Choose actions using Boltzmann distribution over Q(a)
    • Can use predicates to constraint the table
      • Example: chess piece is at a certain location
  • Hand-crafted predicates
    • Force-select actions:
      • If a character is attackable, choose to attack
        • Rather than play a card, etc.
        • Choose the attacker with highest HP
      • If attack hero will make him/her died, attack hero
        • If not, attack minion
      • Choose a weaker defender
        • Defender dies, attacker survives
        • Choose the defender with highest attack
      • Choose a defender with same strength
        • Both attacker and defender died
        • Choose the defender with highest attack
      • Choose a stronger defender
        • Attacker dies, defender survives
        • Choose a defender with highest attack
      • If a weapon is equipped, do not play a weapon card
    • Predicate: In main-action:
      • Actions:
        • Play a minion when owning <= 3 minions
        • Play a minion when owning >= 4 minions
        • Attack minion
        • Attack hero
        • Hero power
        • End turn
      • Rationale: match the heuristic
        • End-turn should be chosen less likely then play-card, attack, and hero-power
        • Hero-power should be chosen less likely then play-card, and attack
        • Prefer to play a minion when not much minions
    • Predicate: choose target of a card with a particular card id
      • Actions:
        • Target an ally character
        • Target an enemy character
  • TODO
    • Refine definition of actions
      • If we carefully define actions, we can use RAVE (rapid action value estimation) to enhance tree policy
      • And use MAST (move-averaging sampling technique) to enhance default policy
    • Use (Predicate-) Rapid Action Value Estimation
    • Ideas to redefine actions:
      • A minion is stronger/same/weaker than n minions
        • Or, just sum up the 'HP' and 'attack' to gain a rough idea how strong it is
        • This is a total ordering. A rank can be used
      • Hero health + armor
      • Note: a game state can be seen/valued by the predicates which holds in that state
  • Backup
    • Self-Character 1 can attack Opponent-Character 1
      • and so on for self-character 1-8 and opponent character 1-8
      • total 8*8=64 features
    • Self-Character 1 can attack Opponent-Character 1, attacker alive, defender die
Clone this wiki locally