1 / 1

Each arm has unknown random sequences, unknown expected value

65. 65. 62. 65. 90. 62. 83. 65. 35. 90. 49. 62. 30. 83. 80. Max node. Min node. UCB value. Expected value. Bias. Example of a finished game on 9x9 board. (Official games use 19x19 board). Imai Laboratory Introduction. Monte-Carlo Tree Search.

alessa
Download Presentation

Each arm has unknown random sequences, unknown expected value

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 65 65 62 65 90 62 83 65 35 90 49 62 30 83 80 Max node Min node UCB value Expected value Bias Example of a finished game on 9x9 board. (Official games use 19x19 board) Imai Laboratory Introduction Monte-Carlo Tree Search A Revolutionary Search Algorithm Developed in Computer Game Research Multi-Armed Bandit Problem Computer Go Difficulty Multi-Armed Bandit means slot machines with K-arms It is very difficult to make “fast and exact” evaluation function for Go. Each arm has unknown random sequences, unknown expected value K-arms The Problem Exploration-Exploitation Dilemma Get The Maximum Reward from certain number of coins Explore : Find out the best arm Exploit : Pull (so far) the best arm Evaluated Values Policies for Multi-Armed Bandit Optimal Policy is known, but is slow and memory consuming. Fast, and near optimal policy is wanted. UCB1 (Upper Confidence Bound) was proposed in 2002 [1]. Alpha-beta search does not work without an evaluation function. The Game of Go Originated in China and popular in east Asian countries. Give a higher bias to the arm with less number of plays Maybe it is just unlucky … Rules The Idea 1, Black and White place stones on intersections in turns. UCB1 : Pull the arm with the highest UCB1 Value 2, Stones connects in 4 directions and form blocks. UCB1 is easy to calculate. Also, the performance gap from the optimal policy is bounded. 3, Win by occupying greater area. Tree Search based on UCB1 4, Blocks gets captured if completely surrounded. UCT algorithm(UCB applied to Trees) uses UCB1 for tree search. Because of the large number of legal moves, and difficulty of making evaluation functions, A brand new Tree Search Algorithm (proposed in 2006) which only needs Probability Distribution. No Evaluation Function Needed ! Go is widely regarded as the most difficult board game for computers UCT is proved to converge to the optimal answer [2] Monte-Carlo Go Evaluate board positions by the average results of random play. Computer Go Players are improving rapidly with the Combination of UCT and Monte-Carlo Go With little constraints, random player can finish the game. It is possible to evaluate the board position by the probability distribution. CrazyStone [3] and MoGo[4] are the 2 strongest programs References [1] P. Auer, N. Cesa-Bianchi and P. Fischer, “Finite-time analysis of the multi-armed bandit problem”. Machine Learning, vol. 47, pp. 235-256, 2002. [2] L. Kocsis and C.Szepesvari, “Bandit Based Monte-Carlo planning”. 15th European Conference on Machine Learning, pp. 282-293, 2006. [3] Rémi Coulom, “Computing Elo Ratings of Move Patterns in the Game of Go”. In Computer Games Workshop, Amsterdam, The Netherlands, 2007. [4] Sylvain Gelly , Y. Wang and R. Munos and O. Teytaud, “Modification of UCT with patterns in Monte-Carlo Go”. Technical Report 6062, INRIA, 2006.

More Related