MCTS default policy

Goal

Play like an amateur player

End turn only when nothing can be done
- Tend to use up crystals
Prefer attack a weaker minion by a stronger minion
- Rules to choose the defender
  - Attack the strongest minion (as long as attacker survives)
  - Otherwise, attack the strongest minion
  - Otherwise, attack hero
Play a spell only when it's effective
- need to categorize cards
  - AOE cards, apply enhance on minion, deal damage
  - TODO: sometimes effect depends on the target
    - E.g., +2/+2 on friendly demon, otherwise deal 2 damage
- need to categorize game boars
  - many opponent minions -> prefer to use an aoe

Offline Learning

State-value function

Approximate the state-value function Q(s) by a (deep) neural network.
To decide an action in default policy
- Enum all possibilities, choose the best one.
  - Enum over all playable cards, attackable characters
  - Enum over all possible target, choose-one, etc.
  - Only one of the random outcome is considered.
- Apply some rules to reduce the number of possibilities
  - Force to attack if anyone is attackable
  - End turn only if it's the only option
  - Put minion randomly
  - Choose attacker randomly
Analysis
- [GOOD] Only needs the state-value function
- [BAD] Need to enum over all posibilities
Learning
- Self play
  - http://proceedings.mlr.press/v37/heinrich15.pdf
  - Prioritized experience replay: https://arxiv.org/abs/1511.05952
  - http://videolectures.net/rldm2015_silver_reinforcement_learning/
- Tensorflow (python): https://www.tensorflow.org/get_started/
- OpenNN: http://www.opennn.net/

State-action function

Survey
- Actor-critic framework
Approximate the state-action function Q(s,a) by a (deep) neural network
This is the Q-function in Q-learning.
Actions
- The card-id of the playing card
- Choose which character to attack
- Choose which character to defend
Difficulties
- Only a small subset of actions is viable at anytime
  - Hold in hand
  - Have enough resource to play that card
- Need to categorize the playing card
  - A minion
  - A minion with aura
  - An AOE spell card
  - A healing spell card
  - A damaging spell card
  - Also consider target when deciding category:
    - A card can be in different categories when targets differently
    - E.g., +2/+2 on a friendly demon, otherwise deal 2 damage
- Maybe a different neural network for each sub-action
  - One neural network to choose main-action: attack, play card, hero power, or end turn
  - One neural network to choose attacker
    - The optimal case is to coordinate several attacks to take down a big minion
    - How to make this neural network that smart?
  - One neural network to choose card to play
    - A subnet N1 to extract a few features (say, 5) board
      - Input: heros / minions stats
    - A subnet N2 to extract features of playing card
      - Input: card id / its target / choose-card
        
        Can this be trained effectively?
      - Output: category the playing card
        
        An AOE, a enchant, or a damage spell
    - A subnet N3 to give final decision
      - Input: output of N1 and N2
    - Need to loop over all possible targets?
      - Since a card can be in different categories when targets differently
    - How about choose-one callbacks?
      - Loop over choose-one callbacks?
      - Maybe another neural network to give the decision:
        
        Input: card id
        
        Output: decide which card to choose
        
        Still need a few subnets in the deep nerual network
        
        Subnet N1: extract features from board
        
        Subnet N2: categorize the card id
        
        Subnet N3: combine N1 and N2; produce final output
Analysis
- [GOOD] One forward-pass can get a good action
- [BAD] Hard to train
Keywords
- Q learning
- SARSA

Online learning

Some conclusions first. This method is probably more suitable for tasks if there are no suitable static policy (or, should not have one). For example, if opponent is playing an aggro deck, the optimal default policy is P1; if opponent is playing a control deck, the optimal default policy is P2. If policy P1 and P2 differs much, then this kind of online learning might be useful. However, in Hearthstone, this seems not to be the case. We're usually aims to achieve a good board status (at least at amateur level), which should be able to be approximated by a state-value function.

Survey
- Recent Advances in General Game Playing
- Generalized Monte-Carlo Tree Search Extensions for General Game Playing
Methods
- Predicate-Average Sampling Technique
- Features-to-Action Sampling Technique
Idea
- A global table to store Q(a)
  - a is one of the actions
  - Q(a) is the number of times the action a is chosen
  - Example: in Othello, a good action a is to put a chess in the corner
- The global table can be updated in back-propagation step
- Use the table to bias the MCTS default policy
  - Choose actions using Boltzmann distribution over Q(a)
- Can use predicates to constraint the table
  - Example: chess piece is at a certain location
Hand-crafted predicates
- Force-select actions:
  - If a character is attackable, choose to attack
    - Rather than play a card, etc.
    - Choose the attacker with highest HP
  - If attack hero will make him/her died, attack hero
    - If not, attack minion
  - Choose a weaker defender
    - Defender dies, attacker survives
    - Choose the defender with highest attack
  - Choose a defender with same strength
    - Both attacker and defender died
    - Choose the defender with highest attack
  - Choose a stronger defender
    - Attacker dies, defender survives
    - Choose a defender with highest attack
  - If a weapon is equipped, do not play a weapon card
- Predicate: In main-action:
  - Actions:
    - Play a minion when owning <= 3 minions
    - Play a minion when owning >= 4 minions
    - Attack minion
    - Attack hero
    - Hero power
    - End turn
  - Rationale: match the heuristic
    - End-turn should be chosen less likely then play-card, attack, and hero-power
    - Hero-power should be chosen less likely then play-card, and attack
    - Prefer to play a minion when not much minions
- Predicate: choose target of a card with a particular card id
  - Actions:
    - Target an ally character
    - Target an enemy character
TODO
- Refine definition of actions
  - If we carefully define actions, we can use RAVE (rapid action value estimation) to enhance tree policy
  - And use MAST (move-averaging sampling technique) to enhance default policy
- Use (Predicate-) Rapid Action Value Estimation
  - http://ai.cs.unibas.ch/papers/theses/neidinger-bachelor-17-slides.pdf
- Ideas to redefine actions:
  - A minion is stronger/same/weaker than n minions
    - Or, just sum up the 'HP' and 'attack' to gain a rough idea how strong it is
    - This is a total ordering. A rank can be used
  - Hero health + armor
  - Note: a game state can be seen/valued by the predicates which holds in that state

Provide feedback

Saved searches

Use saved searches to filter your results more quickly