MCTS default policy

Goal

Goal: think like an amateur player

End turn only when nothing can be done
- Tend to use up crystals
Prefer attack a weaker minion by a stronger minion
- Rules to choose the defender
  - Attack the strongest minion (as long as attacker survives)
  - Otherwise, attack the strongest minion
  - Otherwise, attack hero
Play a spell only when it's effective
- need to categorize cards
  - AOE cards, apply enhance on minion, deal damage
  - TODO: sometimes effect depends on the target
    - E.g., +2/+2 on friendly demon, otherwise deal 2 damage
- need to categorize game boars
  - many opponent minions -> prefer to use an aoe

Online learning

Some conclusions first. This method is probably more suitable for tasks if there are no static policy. For example, if opponent is with an aggro deck, the optimal policy for default playout is P1; if opponent is with a control deck, the optimal policy for default playout is P2. If policy P1 and P2 differs much, then this kind of online learning might be useful. However, in Hearthstone, this seems not to be the case. We're usually aims to achieve a good board status, which should be able to be approximated by a state-value function.

Survey
- Recent Advances in General Game Playing
- Generalized Monte-Carlo Tree Search Extensions for General Game Playing
Methods
- Predicate-Average Sampling Technique
- Features-to-Action Sampling Technique
Idea
- A global table to store Q(a)
  - a is one of the actions
  - Q(a) is the number of times the action a is chosen
  - Example: in Othello, a good action a is to put a chess in the corner
- The global table can be updated in back-propagation step
- Use the table to bias the MCTS default policy
  - Choose actions using Boltzmann distribution over Q(a)
- Can use predicates to constraint the table
  - Example: chess piece is at a certain location
Hand-crafted predicates
- Force-select actions:
  - If a character is attackable, choose to attack
    - Rather than play a card, etc.
    - Choose the attacker with highest HP
  - If attack hero will make him/her died, attack hero
    - If not, attack minion
  - Choose a weaker defender
    - Defender dies, attacker survives
    - Choose the defender with highest attack
  - Choose a defender with same strength
    - Both attacker and defender died
    - Choose the defender with highest attack
  - Choose a stronger defender
    - Attacker dies, defender survives
    - Choose a defender with highest attack
  - If a weapon is equipped, do not play a weapon card
- Predicate: In main-action:
  - Actions:
    - Play a minion when owning <= 3 minions
    - Play a minion when owning >= 4 minions
    - Attack minion
    - Attack hero
    - Hero power
    - End turn
  - Rationale: match the heuristic
    - End-turn should be chosen less likely then play-card, attack, and hero-power
    - Hero-power should be chosen less likely then play-card, and attack
    - Prefer to play a minion when not much minions
- Predicate: choose target of a card with a particular card id
  - Actions:
    - Target an ally character
    - Target an enemy character
TODO
- Refine definition of actions
  - If we carefully define actions, we can use RAVE (rapid action value estimation) to enhance tree policy
  - And use MAST (move-averaging sampling technique) to enhance default policy
- Use (Predicate-) Rapid Action Value Estimation
  - http://ai.cs.unibas.ch/papers/theses/neidinger-bachelor-17-slides.pdf
- Ideas to redefine actions:
  - A minion is stronger/same/weaker than n minions
    - Or, just sum up the 'HP' and 'attack' to gain a rough idea how strong it is
    - This is a total ordering. A rank can be used
  - Hero health + armor
  - Note: a game state can be seen/valued by the predicates which holds in that state
Backup
- Self-Character 1 can attack Opponent-Character 1
  - and so on for self-character 1-8 and opponent character 1-8
  - total 8*8=64 features
- Self-Character 1 can attack Opponent-Character 1, attacker alive, defender die

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCTS default policy

Goal

Online learning

Clone this wiki locally