1 / 56

Life: play and win in 20 trillion moves Stuart Russell

Life: play and win in 20 trillion moves Stuart Russell. White to play and win in 2 moves. Life. 100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions (Not to mention choosing brain activities!)

marlee
Download Presentation

Life: play and win in 20 trillion moves Stuart Russell

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Life: play and win in 20 trillion movesStuart Russell

  2. White to play and win in 2 moves

  3. Life • 100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions • (Not to mention choosing brain activities!) • The world has a very large, partially observable state space and an unknown model • So, being intelligent is “provably” hard • How on earth do we manage?

  4. Hierarchical structure!! (among other things) • Deeply nested behaviors give modularity • E.g., tongue controls for [t] independent of everything else given decision to say “tongue” • High-level decisions give scale • E.g., “go to IJCAI 13” 3,000,000,000 actions • Look further ahead, reduce computation exponentially [NIPS 97, ICML 99, NIPS 00, AAAI 02, ICML 03, IJCAI 05, ICAPS 07, ICAPS 08, ICAPS 10, IJCAI 11]

  5. Reminder: MDPs • Markov decision process (MDP) defined by • Set of states S • Set of actions A • Transition model P(s’ | s,a) • Markovian, i.e., depends only on st and not on history • Reward function R(s,a,s’) • Discount factor γ • Optimal policyπ* maximizes the value function Vπ= Σt γt R(st,π(st),st+1) • Q(s,a) = Σs’ P(s’ | s,a)[R(s,a,s’) + γV(s’)]

  6. Reminder: policy invariance • MDP M with reward function R(s,a,s’) • Any potential function Φ(s) then • Any policy that is optimal for M is optimal for any version M’ with reward function R’(s,a,s’) = R(s,a,s’) + γΦ(s’) - Φ(s) • So we can provide hints to speed up RL by shaping the reward function

  7. Reminder: Bellman equation • Bellman equation for a fixed policy π : Vπ(s) = Σs’P(s’|s,π(s)) [R(s,π(s),s’) + γVπ (s’) ] • Bellman equation for optimal value function: V(s) = maxaΣs’P(s’|s,a) [R(s,a,s’) + γV(s’) ]

  8. Reminder: Reinforcement learning • Agent observes transitions and rewards: • Doing a in s leads to s’with reward r • Temporal-difference (TD) learning: • V(s) <- (1-α)V(s) + α[r+γV(s’) ] • Q-learning: • Q(s,a) <- (1-α)Q(s,a) + α[r+γ maxa’Q(s’,a’) ] • SARSA (after observing the next action a’ ): • Q(s,a) <- (1-α)Q(s,a) + α[r+γQ(s’,a’) ]

  9. Reminder: RL issues • Credit assignment (problem: • Generalization problem:

  10. Abstract states • First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.

  11. Abstract states • First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.

  12. Abstract states • First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc. Oops – lost the Markov property! Action outcome depends on the low-level state, which depends on history

  13. Abstract actions • Forestier & Varaiya, 1978: supervisory control: • Abstract action is a fixed policy for a region • “Choice states” where policy exits region Decision process on choice states is semi-Markov

  14. Semi-Markov decision processes • Add random duration N for action executions: P(s’,N|s,a) • Modified Bellman equation: Vπ = Σs’,N P(s’,N|s,π(s)) [R(s,π(s),s’,N) + γNVπ (s’) ] • Modified Q-learning with abstract actions: • For transitions within an abstract action πa: rc = rc + γcR(s,πa(s),s’) ; γc = γcγ • For transitions that terminate action begun at s0 Q(s0,πa) = (1-α)Q(s0,πa) + α(rc + γc maxa’Q(s’,πa’) ]

  15. Hierarchical abstract machines • Hierarchy of NDFAs; states come in 4 types: • Action states execute a physical action • Call states invoke a lower-level NDFA • Choice points nondeterministically select next state • Stop states return control to invoking NDFA • NDFA internal transitions dictated by percept

  16. Running example • Peasants can move, pickup and dropoff • Penalty for collision • Cost-of-living each step • Reward for dropping off resources • Goal : gather 10 gold + 10 wood • (3L)n++ states s • 7n primitive actions a

  17. Part of 1-peasant HAM

  18. Joint SMDP • Physical states S, machine states Θ, joint state space Ω = S x Θ • Joint MDP: • Transition model for physical state (from physical MDP) and machine state (as defined by HAM) • A(ω) is fixed except when θ is at a choice point • Reduction to SMDP: • States are just those where θ is at a choice point

  19. Hierarchical optimality • Theorem: the optimal solution to the SMDP is hierarchically optimal for the original MDP, i.e., maximizes value among all behaviors consistent with the HAM

  20. Partial programs • HAMs (and Dietterich’s MAXQ hierarchies and Sutton’s Options) are all partially constrained descriptions of behaviors with internal state • A partial program is like an ordinary program in any programming language, with a nondeterministic choice operator added to allow for ignorance about what to do • HAMs, MAXQs, Options are trivially expressible as partial programs • Standard MDP: loop forever choose an action

  21. RL and partial programs Partial program a Learning algorithm s,r Completion

  22. RL and partial programs Partial program a Learning algorithm s,r Completion Hierarchically optimal for all terminating programs

  23. Single-threaded Alisp program Program state θ includes Program counter Call stack Global variables (defun top () (loop do (choose ‘top-choice (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action‘get-wood) (nav *base-loc*) (action‘dropoff))) (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest)) (action‘get-gold) (nav *base-loc*)) (action‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice‘nav-choice (move ‘(N S E W NOOP)) (action move))))

  24. Q-functions • Represent completions using Q-function • Joint stateω = [s,θ] env state + program state • MDP + partial program = SMDP over {ω} • Qπ(ω,u) = “Expected total reward if I make choice u in ω and follow completion π thereafter” • Modified Q-learning [AR 02] finds optimal completion

  25. Internal state • Availability of internal state (e.g., goal stack) can greatly simplify value functions and policies • E.g., while navigating to location (x,y), moving towards (x,y) is a good idea • Natural local shaping potential (distance from destination) impossible to express in external terms

  26. Temporal Q-decomposition Top Top Top GatherGold GatherWood Nav(Mine2) Nav(Forest1) Qr Qc Qe Q • Temporal decompositionQ = Qr+Qc+Qewhere • Qr(ω,u) = reward while doing u (may be many steps) • Qc(ω,u) = reward in current subroutine after doing u • Qe(ω,u) = reward after current subroutine

  27. State abstraction • Temporal decomposition => state abstraction E.g., while navigating, Qc independent of gold reserves • In general, local Q-components can depend on few variables => fast learning

  28. Handling multiple effectors Get-gold Get-wood Multithreadedagent programs • Threads = tasks • Each effector assigned to a thread • Threads can be created/destroyed • Effectors can be reassigned • Effectors can be created/destroyed (Defend-Base) Get-wood

  29. An example Concurrent ALisp program An example single-threaded ALisp program (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff))) (defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move)))) (defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas (first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

  30. Concurrent Alisp semantics Peasant 1 (defun nav (Wood1) (loop until (at-pos Wood1) (setf d (choose ‘(N S E W R))) do (action d) )) • defun gather-wood () • (setf dest (choose *forests*)) • (nav dest) • (action ‘get-wood) • (nav *home-base-loc*) • (action ‘put-wood)) Thread 1 Running Paused Waiting for joint action Making joint choice (defun nav (Gold2) (loop until (at-pos Gold2) (setf d (choose ‘(N S E W R))) do (action d) )) (defun gather-gold () (setf dest (choose *goldmines*)) (nav dest) (action ‘get-gold) (nav *home-base-loc*) (action ‘put-gold)) Peasant 2 Paused Thread 2 Running Making joint choice Waiting for joint action Environment timestep 27 Environment timestep 26

  31. Concurrent Alisp semantics • Threads execute independently until they hit a choice or action • Wait until all threads are at a choice or action • If all effectors have been assigned an action, do that joint action in environment • Otherwise, make joint choice

  32. Q-functions • To complete partial program, at each choice state ω, need to specify choices for all choosing threads • So Q(ω,u) as before, except u is a joint choice • Suitable SMDP Q-learning gives optimal completion Example Q-function

  33. Problems with concurrent activities • Temporal decomposition of Q-function lost • No credit assignment among threads • Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly • Peasant 2 thinks he’s done very well!! • Significantly slows learning as number of peasants increases

  34. Threadwise decomposition • Idea : decompose reward among threads (Russell+Zimdars, 2003) • E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants • Qjπ(ω,u) = “Expected total reward received by thread j if we make joint choice u and then do π” • Threadwise Q-decompositionQ = Q1+…Qn • Recursively distributed SARSA => global optimality

  35. Learning threadwise decomposition Peasant 3 thread Peasant 2 thread Peasant 1 thread r2 r1 r3 Top thread r decomp a Action state

  36. Learning threadwise decomposition Peasant 3 thread Q-update Peasant 2 thread Q-update Peasant 1 thread Q2(ω,·) Q-update Q1(ω,·) Q3(ω,·) Top thread + argmax = u Q(ω,·) Choice state ω

  37. Threadwise and temporal decomposition Peasant 3 thread • Qj = Qj,r + Qj,c + Qj,e where • Qj,r (ω,u) = Expected reward gained by thread jwhile doing u • Qj,c (ω,u) = Expected reward gained by thread jafter u but before leaving current subroutine • Qj,e (ω,u) = Expected reward gained by thread jafter current subroutine Top thread Peasant 2 thread Peasant 1 thread

  38. Resource gathering with 15 peasants Threadwise + Temporal Threadwise Undecomposed Reward of learnt policy Flat Num steps learning (x 1000)

  39. Summary of Part I • Structure in behavior seems essential for scaling up • Partial programs • Provide natural structural constraints on policies • Decompose value functions into simple components • Include internal state (e.g., “goals”) that further simplifies value functions, shaping rewards • Concurrency • Simplifies description of multieffector behavior • Messes up temporal decomposition and credit assignment (but threadwise reward decomposition restores it)

  40. Outline • Hierarchical RL with abstract actions • Hierarchical RL with partial programs • Efficient hierarchical planning

  41. Abstract Lookahead Rxa1 c2 c2 Kxa1 Qa4 ‘a’ ‘b’ a b ‘b’ ‘a’ ‘b’ ‘a’ aa ab ba bb • k-step lookahead >> 1-step lookahead • e.g., chess • k-step lookahead no use if steps are too small • e.g., first k characters of a NIPS paper Kc3 • Abstract plans (high-level executions of partial programs) are shorter • Much shorter plans for given goals => exponential savings • Can look ahead much further into future

  42. High-level actions [Act] [GoSFO, Act] [GoOAK, Act] • Start with restricted classical HTN form of partial program • A high-level action (HLA) • Has a set of possible immediate refinements into sequences of actions, primitive or high-level • Each refinement may have a precondition on its use • Executable plan space = all primitive refinements of Act [WalkToBART, BARTtoSFO, Act]

  43. Lookahead with HLAs • To do offline or online planning with HLAs, need a model • See classical HTN planning literature

  44. Lookahead with HLAs • To do offline or online planning with HLAs, need a model • See classical HTN planning literature – not! • Drew McDermott (AI Magazine, 2000): • The semantics of hierarchical planning have never been clarified … no one has ever figured out how to reconcile the semantics of hierarchical plans with the semantics of primitive actions

  45. Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking

  46. Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general

  47. Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general • MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds

  48. Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general • MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds • Problem: how to say true things about effects of HLAs?

  49. Angelic semantics for HLAs S S S s0 s0 s0 a4 a1 a3 a2 g g g • Begin with atomic state-space view, start in s0, goal state g • Central idea is the reachable set of an HLA from each state • When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal • May seem related to nondeterminism • But the nondeterminism is angelic: the “uncertainty” will be resolved by the agent, not an adversary or nature State space h1h2is a solution h2 h1 h1h2 h2

More Related