Skip to content

MCTS default policy

Peter Shih edited this page Aug 7, 2017 · 11 revisions

Goal

Play like an amateur player

  • End turn only when nothing can be done
    • Tend to use up crystals
  • Prefer attack a weaker minion by a stronger minion
    • Rules to choose the defender
      • Attack the strongest minion (as long as attacker survives)
      • Otherwise, attack the strongest minion
      • Otherwise, attack hero
  • Play a spell only when it's effective
    • need to categorize cards
      • AOE cards, apply enhance on minion, deal damage
      • TODO: sometimes effect depends on the target
        • E.g., +2/+2 on friendly demon, otherwise deal 2 damage
    • need to categorize game boars
      • many opponent minions -> prefer to use an aoe

Offline Learning

State-value function

State-action function

  • Survey
    • Actor-critic framework
  • Approximate the state-action function Q(s,a) by a (deep) neural network
  • This is the Q-function in Q-learning.
  • Actions
    • The card-id of the playing card
    • Choose which character to attack
    • Choose which character to defend
  • Difficulties
    • Only a small subset of actions is viable at anytime
      • Hold in hand
      • Have enough resource to play that card
    • Need to categorize the playing card
      • A minion
      • A minion with aura
      • An AOE spell card
      • A healing spell card
      • A damaging spell card
      • Also consider target when deciding category:
        • A card can be in different categories when targets differently
        • E.g., +2/+2 on a friendly demon, otherwise deal 2 damage
    • Maybe a different neural network for each sub-action
      • One neural network to choose main-action: attack, play card, hero power, or end turn
      • One neural network to choose attacker
        • The optimal case is to coordinate several attacks to take down a big minion
        • How to make this neural network that smart?
      • One neural network to choose card to play
        • A subnet N1 to extract a few features (say, 5) board
          • Input: heros / minions stats
        • A subnet N2 to extract features of playing card
          • Input: card id / its target / choose-card
            • Can this be trained effectively?
          • Output: category the playing card
            • An AOE, a enchant, or a damage spell
        • A subnet N3 to give final decision
          • Input: output of N1 and N2
        • Need to loop over all possible targets?
          • Since a card can be in different categories when targets differently
        • How about choose-one callbacks?
          • Loop over choose-one callbacks?
          • Maybe another neural network to give the decision:
            • Input: card id
            • Output: decide which card to choose
            • Still need a few subnets in the deep nerual network
              • Subnet N1: extract features from board
              • Subnet N2: categorize the card id
              • Subnet N3: combine N1 and N2; produce final output
  • Analysis
    • [GOOD] One forward-pass can get a good action
    • [BAD] Hard to train
  • Keywords
    • Q learning
    • SARSA

Online learning

Some conclusions first. This method is probably more suitable for tasks if there are no suitable static policy (or, should not have one). For example, if opponent is playing an aggro deck, the optimal default policy is P1; if opponent is playing a control deck, the optimal default policy is P2. If policy P1 and P2 differs much, then this kind of online learning might be useful. However, in Hearthstone, this seems not to be the case. We're usually aims to achieve a good board status (at least at amateur level), which should be able to be approximated by a state-value function.

  • Survey
  • Methods
    • Predicate-Average Sampling Technique
    • Features-to-Action Sampling Technique
  • Idea
    • A global table to store Q(a)
      • a is one of the actions
      • Q(a) is the number of times the action a is chosen
      • Example: in Othello, a good action a is to put a chess in the corner
    • The global table can be updated in back-propagation step
    • Use the table to bias the MCTS default policy
      • Choose actions using Boltzmann distribution over Q(a)
    • Can use predicates to constraint the table
      • Example: chess piece is at a certain location
  • Hand-crafted predicates
    • Force-select actions:
      • If a character is attackable, choose to attack
        • Rather than play a card, etc.
        • Choose the attacker with highest HP
      • If attack hero will make him/her died, attack hero
        • If not, attack minion
      • Choose a weaker defender
        • Defender dies, attacker survives
        • Choose the defender with highest attack
      • Choose a defender with same strength
        • Both attacker and defender died
        • Choose the defender with highest attack
      • Choose a stronger defender
        • Attacker dies, defender survives
        • Choose a defender with highest attack
      • If a weapon is equipped, do not play a weapon card
    • Predicate: In main-action:
      • Actions:
        • Play a minion when owning <= 3 minions
        • Play a minion when owning >= 4 minions
        • Attack minion
        • Attack hero
        • Hero power
        • End turn
      • Rationale: match the heuristic
        • End-turn should be chosen less likely then play-card, attack, and hero-power
        • Hero-power should be chosen less likely then play-card, and attack
        • Prefer to play a minion when not much minions
    • Predicate: choose target of a card with a particular card id
      • Actions:
        • Target an ally character
        • Target an enemy character
  • TODO
    • Refine definition of actions
      • If we carefully define actions, we can use RAVE (rapid action value estimation) to enhance tree policy
      • And use MAST (move-averaging sampling technique) to enhance default policy
    • Use (Predicate-) Rapid Action Value Estimation
    • Ideas to redefine actions:
      • A minion is stronger/same/weaker than n minions
        • Or, just sum up the 'HP' and 'attack' to gain a rough idea how strong it is
        • This is a total ordering. A rank can be used
      • Hero health + armor
      • Note: a game state can be seen/valued by the predicates which holds in that state