1 / 23

Progressive Strategies For Monte-Carlo Tree Search

Progressive Strategies For Monte-Carlo Tree Search. Authors: G.M.J.B. Chaslot, M.H.M. Winands, J.W.H.M. Uiterwijk, H.J. van den Herik and B. Bouzy. Presenter: Ling Zhao University of Alberta November 5, 2007. Outlines.

javen
Download Presentation

Progressive Strategies For Monte-Carlo Tree Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Progressive StrategiesFor Monte-Carlo Tree Search Authors: G.M.J.B. Chaslot, M.H.M. Winands, J.W.H.M. Uiterwijk, H.J. van den Herik and B. Bouzy Presenter: Ling Zhao University of Alberta November 5, 2007

  2. Outlines • Monte-Carlo Tree Search (MCTS) and the implementation in MANGO. • Progressive strategies: progressive bias and progressive unpruning. • Experiments. • Conclusions and future work.

  3. MCTS

  4. Selection • Process: select moves in UCT tree for the best balance between exploitation and exploration. • A multi-armed bandit problems. • UCB formula: k: No. k child of node n, vi: value of node i ni: visit count of node i, np: visit count of node p C: const • Selection precondition: np >= T (= 30)

  5. Expansion • Process: For a given leaf node, determine whether it will be expanded by storing one or more of its children in UCT tree. • Simple rule: expand one node per simulated game (the first node encountered not in UCT tree). • In MANGO, if np = T (= 30), all its children will be expanded.

  6. Simulation • Process: self-play until the end of the game. • Rules: 1. Disallow play in its eyes 2. Stop the game after a certain number of moves. • In MANGO, the probability of a move being selected in simulation is proportional to its urgency, a sum of capture value, 3x3 pattern value and proximity modification.

  7. Backpropagation • Process: using the result of a simulated game to update the nodes it traverses. • Result: +1 for win, -1 for loss, 0 for draw • vi of node i is computed by averaging the result of all simulated games made through it.

  8. Progressive Strategies • Soft transition between selection strategy and simulation strategy. • Intuition: Selection strategy becomes more accurate than simulation one only when the number of games simulated is large. • Progress strategy uses the information available for the selection strategy, and some expensive domain knowledge. • Progress strategy is similar to the simulation strategy when a few games have been played, and converges to selection strategy when numerous games have been played.

  9. Progressive Bias • Direct search using possibly expensive heuristic knowledge. • Modify the selection strategy, and make sure the influence decreases fast when many games have been played.

  10. Progressive Bias Formula • Hiis a coefficient representing knowledge • For children with ni =0, is replaced by M with M>>any vi, thus the children with the highest f(ni) is selected. • If np  [30, 100], f(ni) is dominant. • If np  (100, 500], f(ni) has partial impact. • When np > 500, f(ni) is dominated, but can be used for tie breaker.

  11. Alternative Approach • Using prior knowledge (Gelly and Silver): • “Scalability of this approach to larger board sizes is an open question”.

  12. Progressive Unpruning • Reducing the branching factor artificially when the selection strategy is used. • Increase the branching factor progressively when more games are simulated. • Pruning or unpruning is done according to the heuristic value of the children.

  13. Progressive Unpruning (Details) • If np = T, only k0 (=5) children with highest heuristic values are not pruned. • If np > T, k = lg(np /40) * 2.67 + k0, children will be left unpruned. • k = 5 (np = 40), 7 (np = 80), 10 (np = 120) • Similar idea used by Coulom (progressive widening).

  14. Heuristic Values • Pattern value: learned offline using pattern matching (89,119 patterns from 2000 pro games). • Capture value: the number of stones to be captured or to escape a capture with the move. • Proximity value: Euclidean distance to the last move.

  15. Heuristic Value Formula • Ci: Capture value • Pi: pattern value • Dk,i: distance to the kth last move • k = 1.25 + k/2 • Computing Pi the time consuming part

  16. Time For Computing Heuristics • Computing H is around 1000 times slower than playing a move in simulated game. • So H is computed only once per node, when T (=30) games is played through it. • Speed reduction is only 4%, since the number of nodes with visit count >= 30 is low compared to the total number of moves in simulated games.

  17. Domain Knowledge Calls Vs. T

  18. Visit Count Vs. Number of Nodes

  19. Experiments • Self played games on 13x13 board (10 sec per move): MANGO with progressive strategies won 91% of the 500 games against MANGO without progressive strategies. • MANGO : 20,000 simulated games, 1 sec on 9x9, 2 sec on 13x13, 5 sec on 19x19. • GNU Go: level 10 on 9x9 and 13x13, 0 on 19x19.

  20. MANGO Vs. GNU Go

  21. MANGO Vs. GNU Go • Plain MCTS does not scale well to 13x13 or 19x19 board. • Progressive strategies are useful on every board size. • The two progressive strategies combined are most powerful, esp. in 19x19.

  22. Tournament Results • Always in the top half. • But were negative results removed?

  23. Conclusions and Future Work • Two progressive strategies are useful by providing a soft transition between selection and simulation. • Overhead is negligible. • Combine with RAVE and UCT with prior knowledge. • Combine with the advanced knowledge developed by Coulom. • Using life and death information. • Better progressive bias. P-A. Coquelin and R. Munos. Bandit Algorithm for Tree Search. Technical Report 6141, INRIA, 2007.

More Related