1 / 51

Monte-Carlo Tree Search

Monte-Carlo Tree Search. Alan Fern. Introduction. Rollout does not guarantee optimality or near optimality It only guarantees policy improvement (under certain conditions) Theoretical Question: Can we develop Monte-Carlo methods that give us near optimal policies?

hamlin
Download Presentation

Monte-Carlo Tree Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Monte-Carlo Tree Search Alan Fern

  2. Introduction • Rollout does not guarantee optimality or near optimality • It only guarantees policy improvement (under certain conditions) • Theoretical Question:Can we develop Monte-Carlo methods that give us near optimal policies? • With computation that does NOT depend on number of states! • This was an open theoretical question until late 90’s. • Practical Question:Can we develop Monte-Carlo methods that improve smoothly and quickly with more computation time?

  3. … … … … … … … Look-Ahead Trees • Rollout can be viewed as performing one level of search on top of the base policy • In deterministic games and search problems it is common to build a look-ahead tree at a state to select best action • Can we generalize this to general stochastic MDPs? • Sparse Sampling is one such algorithm • Strong theoretical guarantees of near optimality ak a1 a2 Maybe we should searchfor multiple levels. …

  4. Online Planning with Look-Ahead Trees • At each state we encounter in the environment we build a look-ahead tree of depth hand use it to estimate optimal Q-values of each action • Select action with highest Q-value estimate • s = current state • Repeat • T = BuildLookAheadTree(s) ;; sparse sampling or UCT ;; tree provides Q-value estimates for root action • a = BestRootAction(T) ;; action with best Q-value • Execute action a in environment • s is the resulting state

  5. Planning with Look-Ahead Trees s s a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a2 a2 a2 a2 a2 a2 a2 a2 a2 a2 a2 a1 Real worldstate/action sequence s … … … … … … … … Build look-ahead tree Build look-ahead tree s11 s11 … … s1w s1w s21 s21 … … s2w s2w R(s1w,a1) R(s1w,a1) R(s1w,a2) R(s1w,a2) R(s11,a2) R(s11,a2) R(s11,a1) R(s11,a1) R(s21,a1) R(s21,a1) R(s21,a2) R(s21,a2) R(s2w,a1) R(s2w,a1) R(s2w,a2) R(s2w,a2) ………… …………

  6. ExpectimaxTree (depth 1) max node (max over actions) Alternate max &expectation s a b expectation node (weighted averageover states) … … s1 s2 sn-1 sn s1 s2 sn-1 sn States -- weighted by probability of occuring after taking ain s • After taking each action there is a distribution over next states (nature’s moves) • The value of an action depends on the immediate reward and weighted value of those states

  7. Expectimax Tree (depth 2) max node (max over actions) Alternate max &expectation s a b expectation node (max over actions) … … a a b b …...... … … … … Max Nodes: value equal to max of values of child expectation nodes Expectation Nodes: value is weighted average of values of its children Compute values from bottom up (leaf values = 0). Select root action with best value.

  8. Exact Expectimax Tree V*(s,H) Alternate max &expectation Q*(s,a,H) In general can grow tree to any horizon H Size depends on size of the state-space. Bad!

  9. Sparse Sampling Tree V*(s,H) Q*(s,a,H) Sampling width w (kw)H leaves Replace expectation with average over w samples w will typically be much smaller than n.

  10. Sparse Sampling [Kearns et. al. 2002] The Sparse Sampling algorithm computes root value via depth first expansion Return value estimate V*(s,h) of state s and estimated optimal action a* SparseSampleTree(s,h,w) If h == 0 Then Return [0, null] For each action a in s Q*(s,a,h) = 0 For i = 1 to w Simulate taking a in s resulting in si and reward ri [V*(si,h),a*] = SparseSample(si,h-1,w) Q*(s,a,h) = Q*(s,a,h) + ri+ β V*(si,h) Q*(s,a,h) = Q*(s,a,h) / w ;; estimate of Q*(s,a,h) V*(s,h) = maxa Q*(s,a,h) ;; estimate of V*(s,h) a* = argmaxa Q*(s,a,h) Return [V*(s,h), a*]

  11. SparseSample(h=2,w=2) Alternate max &averaging s a Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value.

  12. SparseSample(h=2,w=2) Alternate max &averaging s a 10 SparseSample(h=1,w=2) Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value.

  13. SparseSample(h=2,w=2) Alternate max &averaging s a 0 10 SparseSample(h=1,w=2) SparseSample(h=1,w=2) Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value.

  14. SparseSample(h=2,w=2) Alternate max &averaging s a 5 0 10 SparseSample(h=1,w=2) SparseSample(h=1,w=2) Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value.

  15. SparseSample(h=2,w=2) Alternate max &averaging s b a 3 5 2 4 SparseSample(h=1,w=2) SparseSample(h=1,w=2) Select action ‘a’ since its value is larger than b’s. Max Nodes: value equal to max of values of child expectation nodes Average Nodes: value is average of values of w sampled children Compute values from bottom up (leaf values = 0). Select root action with best value.

  16. Sparse Sampling (Cont’d) • For a given desired accuracy, how largeshould sampling width and depth be? • Answered: Kearns, Mansour, and Ng (1999) • Good news: gives values for w and H to achieve policy arbitrarily close to optimal • Values are independent of state-space size! • First near-optimal general MDP planning algorithm whose runtime didn’t depend on size of state-space • Bad news: the theoretical values are typically still intractably large---also exponential in H • Exponential in H is the best we can do in general • In practice: use small Hand use heuristic at leaves

  17. Sparse Sampling (Cont’d) • In practice: use small H& evaluate leaves with heuristic • For example, if we have a policy, leaves could be evaluated by estimating the policy value (i.e. average reward across runs of policy) a1 a2 … … …...... … … … … policysims

  18. Uniform vs. Adaptive Tree Search • Sparse sampling wastes time on bad parts of tree • Devotes equal resources to each state encountered in the tree • Would like to focus on most promising parts of tree • But how to control exploration of new parts of tree vs. exploiting promising parts? • Need adaptive bandit algorithmthat explores more effectively

  19. What now? • Adaptive Monte-Carlo Tree Search • UCB Bandit Algorithm • UCT Monte-Carlo Tree Search

  20. Bandits: Cumulative Regret Objective • Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (one pull per time step) • Optimal (in expectation) is to pull optimal arm n times • UniformBandit is poor choice --- waste time on bad arms • Must balance exploring machines to find good payoffs and exploiting current knowledge s ak a1 a2 …

  21. Cumulative Regret Objective • Theoretical results are often about “expected cumulative regret” of an arm pulling strategy. • Protocol: At time step n the algorithm picks an arm based on what it has seen so far and receives reward ( are random variables). • Expected Cumulative Regret (: difference between optimal expected cumulative reward and expected cumulative reward of our strategy at time n

  22. UCB Algorithm for Minimizing Cumulative Regret Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2), 235-256. • Q(a) : average reward for trying action a (in our single state s) so far • n(a) : number of pulls of arm a so far • Action choice by UCB after n pulls: • Assumes rewards in [0,1]. We can always normalize if we know max value.

  23. UCB: Bounded Sub-Optimality Value Term:favors actions that looked good historically Exploration Term:actions get an exploration bonus that grows with ln(n) Expected number of pulls of sub-optimal arm a is bounded by: where is the sub-optimality of arm a Doesn’t waste much time on sub-optimal arms, unlike uniform!

  24. UCB Performance Guarantee[Auer, Cesa-Bianchi, & Fischer, 2002] Theorem: The expected cumulative regret of UCB after n arm pulls is bounded by O(log n) • Is this good? Yes. The average per-step regret is O Theorem: No algorithm can achieve a better expected regret (up to constant factors)

  25. What now? • Adaptive Monte-Carlo Tree Search • UCB Bandit Algorithm • UCT Monte-Carlo Tree Search

  26. UCT Algorithm [Kocsis & Szepesvari, 2006] • UCT is an instance of Monte-Carlo Tree Search • Applies principle of UCB • Similar theoretical properties to sparse sampling • Much better anytime behavior than sparse sampling • Famous for yielding a major advance in computer Go • A growing number of success stories

  27. Monte-Carlo Tree Search What is the rollout policy? • Builds a sparse look-ahead tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy” • During construction each tree node s stores: • state-visitation count n(s) • action counts n(s,a) • action values Q(s,a) • Repeat until time is up • Execute rollout policy starting from root until horizon(generates a state-action-reward trajectory) • Add first node not in current tree to the tree • Update statistics of each tree nodes on trajectory • Increment n(s) and n(s,a) for selected action a • Update Q(s,a) by total reward observed after the node

  28. Rollout Policies • Monte-Carlo Tree Search algorithms mainly differ on their choice of rollout policy • Rollout policies have two distinct phases • Tree policy: selects actions at nodes already in tree(each action must be selected at least once) • Default policy: selects actions after leaving tree • Key Idea: the tree policy can use statistics collected from previous trajectories to intelligently expand tree in most promising direction • Rather than uniformly explore actions at each node

  29. Example MCTS • For illustration purposes we will assume MDP is: • Deterministic • Only non-zero rewards are at terminal/leaf nodes • Algorithm is well-defined without these assumpitons

  30. At a leaf node tree policy selects a random action then executes default Iteration 1 Current World State Initially tree is single leaf 1 1 new tree node DefaultPolicy 1 Terminal(reward = 1) Assume all non-zero reward occurs at terminal nodes.

  31. Must select each action at a node at least once Iteration 2 Current World State 1 1 new tree node DefaultPolicy 0 Terminal(reward = 0)

  32. Must select each action at a node at least once Iteration 3 Current World State 1/2 1 0

  33. When all node actions tried once, select action according to tree policy Iteration 3 Current World State 1/2 Tree Policy 1 0

  34. When all node actions tried once, select action according to tree policy Iteration 3 Current World State 1/2 Tree Policy 1 0 new tree node DefaultPolicy 0

  35. When all node actions tried once, select action according to tree policy Iteration 4 Current World State 1/3 Tree Policy 0 1/2 0

  36. When all node actions tried once, select action according to tree policy Iteration 4 Current World State 1/3 0 1/2 0 1

  37. When all node actions tried once, select action according to tree policy Current World State 2/3 Tree Policy 0 2/3 0 1 What is an appropriate tree policy?Default policy?

  38. UCT Algorithm [Kocsis & Szepesvari, 2006] • Basic UCT uses random default policy • In practice often use hand-coded or learned policy • Tree policy is based on UCB: • Q(s,a) : average reward received in trajectories so far after taking action a in state s • n(s,a) : number of times action a taken in s • n(s) : number of times state s encountered Theoretical constant that must be selected empirically in practice

  39. When all node actions tried once, select action according to tree policy Current World State 1/3 a2 a1 Tree Policy 0 1/2 0 1

  40. When all node actions tried once, select action according to tree policy Current World State 1/3 Tree Policy 0 1/2 0 1

  41. When all node actions tried once, select action according to tree policy Current World State 1/3 Tree Policy 0 1/2 0 1 0

  42. When all node actions tried once, select action according to tree policy Current World State 1/4 Tree Policy 0 1/3 1 0/1 0

  43. When all node actions tried once, select action according to tree policy Current World State 1/4 Tree Policy 0 1/3 1 0/1 0 1

  44. When all node actions tried once, select action according to tree policy Current World State 2/5 Tree Policy 1/2 1/3 1 1 0/1 0

  45. UCT Recap • To select an action at a state s • Build a tree using N iterations of monte-carlo tree search • Default policy is uniform random • Tree policy is based on UCB rule • Select action that maximizes Q(s,a)(note that this final action selection does not take the exploration term into account, just the Q-value estimate) • The more simulations the more accurate

  46. Some Successes • Computer Go • Klondike Solitaire (wins 40% of games) • General Game Playing Competition • Real-Time Strategy Games • Combinatorial Optimization • Crowd Sourcing • List is growing • Usually extend UCT is some ways

  47. Some Improvements • Use domain knowledge to handcraft a more intelligent default policy than random • E.g. don’t choose obviously stupid actions • In Go a hand-coded default policy is used • Learn a heuristic function to evaluate positions • Use the heuristic function to initialize leaf nodes(otherwise initialized to zero)

  48. Practical Issues • Selecting K • There is no fixed rule • Experiment with different values in your domain • Rule of thumb – try values of K that are of the same order of magnitude as the reward signal you expect • The best value of K may depend on the number of iterations N • Usually a single value of K will work well for values of N that are large enough

  49. Practical Issues • UCT can have trouble building deep trees when actions can result in a large # of possible next states • Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf s a b

  50. Practical Issues • UCT can have trouble building deep trees when actions can result in a large # of possible next states • Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf s a b

More Related