1 / 48

Life: play and win in 20 trillion moves

Life: play and win in 20 trillion moves. White to play and win in 2 moves. Life. 100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions (Not to mention choosing brain activities!) And the world has a very large state space

dionne
Download Presentation

Life: play and win in 20 trillion moves

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Life: play and win in 20 trillion moves

  2. White to play and win in 2 moves

  3. Life • 100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions • (Not to mention choosing brain activities!) • And the world has a very large state space • So, being intelligent is “provably” hard • How on earth do we manage?

  4. Outline • Rationality and Intelligence • Efficient hierarchical planning • Hierarchical RL with partial programs • Next steps towards AI

  5. Outline • Rationality and Intelligence • Efficient hierarchical planning • Hierarchical RL with partial programs • Next steps towards AI

  6. Intelligence • Turing’s definition of intelligence as human mimicry is nonconstructive, informal • AI needs a formal, constructive, broad definition of intelligence – call it Int – that supports analysis and synthesis • “Look, my system is Int !!!” • Is this claim interesting? • Is this claim ever true?

  7. Candidate definitions • Int1 : perfect rationality • Logical AI: always act to achieve the goal • Economics, modern AI, stochastic optimal control, OR: act to maximize expected utility

  8. Candidate definitions • Int1 : perfect rationality • Logical AI: always act to achieve the goal • Economics, modern AI, stochastic optimal control, OR: act to maximize expected utility • Very interesting claim • Almost never true – requires instant computation!!

  9. Candidate definitions • Int2 : calculative rationality • Calculate what would have been the perfectly rational action had it been done immediately

  10. Candidate definitions • Int2 : calculative rationality • Calculate what would have been the perfectly rational action had it been done immediately • Algorithms known for many settings – covers much of classical and modern AI • Not useful for the real world! We always end up cutting corners, often in ad hoc ways

  11. Candidate definitions • Int3 : metalevel rationality • View computations as actions; agent chooses the ones providing maximal utility • ≈ expected decision improvement minus cost of time

  12. Candidate definitions • Int3 : metalevel rationality • View computations as actions; agent chooses the ones providing maximal utility • ≈ expected decision improvement minus cost of time • Potentially very interesting claim – superefficient anytime approximation algorithms!! • Almost never possible – metalevel problem harder than object-level problem. • Myopic metalevel control effective for search and games

  13. Candidate definitions • Int4 : bounded optimality • The agent program is the best that can run on its fixed architecture

  14. Candidate definitions • Int4 : bounded optimality • The agent program is the best that can run on its fixed architecture • Very interesting if true • Always exists (by definition), but may be hard to find • Some early results (~1995) on (A)BO construction: • Optimal sequence of classifiers exhibiting time/accuracy tradeoff for real-time decisions • Optimal time allocation within arbitrary compositions of anytime algorithms • Can we solve for more interesting architectures?

  15. Outline • Rationality and Intelligence • Efficient hierarchical planning • Hierarchical RL with partial programs • Next steps towards AI

  16. Life: Play and win in 20,000,000,000,000* moves • How do we do it? Hierarchical structure!! • Deeply nested behaviors give modularity • E.g., tongue controls for [t] independent of everything else given decision to say “tongue” • High-level decisions give scale • E.g., “go to IJCAI”  3,000,000,000 actions • Looks further ahead, reduce computation exponentially [NIPS 97, ICML 99, NIPS 00, AAAI 02, ICML 03, IJCAI 05, ICAPS 07, ICAPS 08, ICAPS 10]

  17. High-level actions [Act] [GoCDG, Act] [GoOrly, Act] • Classical HTN (Hierarchical Task Network) • A high-level action (HLA) has a set of possible refinements into sequences of actions, primitive or high-level • Hierarchical optimality = bestprimitive refinement of Act • (A restricted subset of simple linear agent programs) [WalkToRER, RERtoCDG, Act]

  18. Semantics of HLAs • To do offline or online planning with HLAs, need a model describing outcome of each HLA

  19. Semantics of HLAs • To do offline or online planning with HLAs, need a model describing outcome of each HLA • Drew McDermott (AI Magazine, 2000): • The semantics of hierarchical planning have never been clarified … no one has ever figured out how to reconcile the semantics of hierarchical plans with the semantics of primitive actions

  20. Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking

  21. Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general

  22. Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general • MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds

  23. Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general • MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds • Problem: how to say true things about effects of HLAs?

  24. Angelic semantics for HLAs S S S s0 s0 s0 a4 a1 a3 a2 g g g • Start with atomic state-space view, start in s0, goal state g • Central idea is the reachable set of an HLA from each state • When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal • May seem related to nondeterminism • But the nondeterminism is angelic: the “uncertainty” will be resolved by the agent, not an adversary or nature State space h1h2is a solution h2 h1 h1h2 h2

  25. Technical development • NCSTRIPS to describe reachable sets • STRIPS add/delete plus “possibly add”, “possibly delete”, etc • E.g., GoCDG adds AtCDG, possibly adds CarAtCDG • Reachable sets may be too hard to describe exactly • Upper and lower descriptions bound the exact reachable set (and its cost) above and below • Still support proofs of plan success/failure • Possibly-successful plans must be refined • Sound, complete, optimal offline planning algorithms • Complete, eventually-optimal online (real-time) search

  26. Online Search: Preliminary Results Steps to reach goal

  27. Outline • Rationality and Intelligence • Efficient hierarchical planning • Hierarchical RL with partial programs • Next steps towards AI

  28. Temporal abstraction in RL • Basic theorem (Forestier & Varaiya 78; Parr & Russell 98): • Given an underlying Markov decision process with primitive actions • Define temporally extended choice-free actions • Agent is in a choice state whenever an extended action terminates • Choice states + extended actions form a semi-Markov decision process • Hierarchical structures with unspecified choices = know-how • Hierarchical Abstract Machines [Parr & Russell 98] (= recursive NDFAs) • Options[Sutton & Precup 98] (= extended choice-free actions) • MAXQ[Dietterich 98] (= HTN hierarchy) • General partial programs: agent program falls in a designated restricted subset of arbitrary programs • Alisp [Andre & Russell 02] • Concurrent Alisp [Marthi et al 05]

  29. Running example • Peasants can move, pickup and dropoff • Penalty for collision • Cost-of-living each step • Reward for dropping off resources • Goal : gather 10 gold + 10 wood • (3L)n++ states s • 7n primitive actions a

  30. RL and partial programs Partial program a Learning algorithm s,r Completion

  31. RL and partial programs Partial program a Learning algorithm s,r Completion Hierarchically optimal for all terminating programs

  32. An example Concurrent ALisp program An example single-threaded ALisp program (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff))) (defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move)))) (defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas (first (my-effectors)) (choose ‘top-choice (spawngather-wood peas) (spawngather-gold peas)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

  33. Technical development

  34. Technical development • Decisions based on internal state • Joint stateω = [s,θ] environment state + program state (cf. [Russell & Wefald 1989] )

  35. Technical development • Decisions based on internal state • Joint stateω = [s,θ] environment state + program state (cf. [Russell & Wefald 1989] ) • MDP + partial program = SMDP over {ω}, learn Qπ(ω,u)

  36. Technical development • Decisions based on internal state • Joint stateω = [s,θ] environment state + program state (cf. [Russell & Wefald 1989] ) • MDP + partial program = SMDP over {ω}, learn Qπ(ω,u) • Additive decomposition of value functions

  37. Technical development • Decisions based on internal state • Joint stateω = [s,θ] environment state + program state (cf. [Russell & Wefald 1989] ) • MDP + partial program = SMDP over {ω}, learn Qπ(ω,u) • Additive decomposition of value functions • by subroutine structure [Dietterich 00, Andre & Russell 02] Q is a sum of sub-Q functions per subroutine

  38. Technical development • Decisions based on internal state • Joint stateω = [s,θ] environment state + program state (cf. [Russell & Wefald 1989] ) • MDP + partial program = SMDP over {ω}, learn Qπ(ω,u) • Additive decomposition of value functions • by subroutine structure [Dietterich 00, Andre & Russell 02] Q is a sum of sub-Q functions per subroutine • across concurrent threads [Russell & Zimdars 03] Q is a sum of sub-Q functions per thread, with decomposed reward signal

  39. Internal state • Availability of internal state (e.g., goal stack) can greatly simplify value functions and policies • E.g., while navigating to location (x,y), moving towards (x,y) is a good idea • Natural heuristic (distance from destination) impossible to express in external terms

  40. Temporal decomposition and state abstraction • Temporal decomposition of Q-function: local components capture sum-of-rewards per subroutine [Dietterich 00, Andre & Russell 02] • => State abstraction -e.g., when navigating, localQ independent of gold reserves • Small local Q-components => fast learning Q

  41. Temporal decomposition and state abstraction • Temporal decomposition of Q-function: local components capture sum-of-rewards per subroutine [Dietterich 00, Andre & Russell 02] • => State abstraction -e.g., when navigating, localQ independent of gold reserves • Small local Q-components => fast learning Q

  42. Q-decomposition w/ concurrency? • Temporal decomposition of Q-function lost!! • No credit assignment among threads • Peasant 1 brings back gold, Peasant 2 twiddles thumbs • Peasant 2 thinks he’s done very well!! • => learning is hopelessly slow with many peasants

  43. Threadwise decomposition • Idea : decompose reward among threads [Russell & Zimdars 03] • E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants • Qjπ(ω,u) = “Expected total reward received by thread j if we make joint choice u and then do π” • Threadwise Q-decompositionQ = Q1+…Qn • Recursively distributed SARSA => global optimality

  44. Outline • Rationality and Intelligence • Efficient hierarchical planning • Hierarchical RL with partial programs • Next steps towards AI

  45. Bounded optimality (again) Alisp partial program defines a space of agent programs (the “fixed architecture”); RL converges to the bounded-optimal solution This gets interesting when the agent program is allowed to do some deliberating: ... (defun select-by-tree-search (nodes n) (for i from 1 to n do ... (setq leaf-to-expand (choose nodes)) ...

  46. Bounded optimality contd. • Previous work on metareasoning has considered only homogeneous computation; RL with partial programming offers the ability to control complex, heterogeneous computations • E.g., computations in a hierarchical online-lookahead planning agent • Hybrid reactive/deliberative decision making • Detailed planning/replanning for the near future • Increasingly high-level, abstract, long-duration planning further into the future

More Related