1 / 46

Structure in Reinforcement Learning

Structure in Reinforcement Learning. Outline. Structure in behavior Temporal abstraction and ALisp Temporal decomposition of value functions Concurrent behavior and concurrent ALisp Threadwise decomposition of value functions Putting it all together. Structured behavior.

gulrich
Download Presentation

Structure in Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Structure in Reinforcement Learning

  2. Outline • Structure in behavior • Temporal abstraction and ALisp • Temporal decomposition of value functions • Concurrent behavior and concurrent ALisp • Threadwise decomposition of value functions • Putting it all together

  3. Structured behavior • The world is very complicated, but we manage somehow • Behavior is usually very structured and deeply nested: • Moving my tongue • Shaping this syllable • Saying this word • Saying this sentence • Making a point about nesting • Explaining structured behavior • Giving this talk

  4. Structured behavior • The world is very complicated, but we manage somehow • Behavior is usually very structured and deeply nested: • Moving my tongue • Shaping this syllable • Saying this word • Saying this sentence • Making a point about nesting • Explaining structured behavior • Giving this talk • Modularity/Independence: Selection of tongue motions proceeds independently of most of the variables relevant to giving this talk, given the choice of word.

  5. Example problem: Stratagus • Challenges • Large state space • Large action space • Complex policies

  6. Running example • Peasants can move, pickup and dropoff • Penalty for collision • Cost-of-living each step • Reward for dropping off resources • Goal : gather 10 gold + 10 wood

  7. Outline • Structure in behavior • Temporal abstraction and ALisp • Temporal decomposition of value functions • Concurrent behavior and concurrent ALisp • Threadwise decomposition of value functions • Putting it all together

  8. Reinforcement Learning a Learning algorithm s,r Policy

  9. Q-functions • Represent policies implicitly using a Q-function • Qπ(s,a) = “Expected total reward if I do action a in environment state s and follow policy π thereafter” Fragment of example Q-function

  10. Hierarchy in RL • It is natural to think about abstract states such as ``at home,'‘ ``in Soda Hall,'' etc., arranged in an abstraction hierarchy • Idea 1: set up a decision process among the abstract states

  11. Hierarchy in RL • It is natural to think about abstract states such as ``at home,'‘ ``in Soda Hall,'' etc., arranged in an abstraction hierarchy • Idea 1: set up a decision process among the abstract states • Fails! Transitions between abstract states are usually non-Markovian

  12. Hierarchy in RL • Idea 2: define abstract actions such as ``drive to work'', ``park'' etc. • Set up a decision process with choice states and abstract actions • Resulting decision problem is semi-Markov (Forestier & Varaiya, 1978) • Temporal abstraction in RL (HAMs (Parr & Russell); Options (Sutton & Precup); MAXQ (Dietterich))

  13. Partial programs • Idea 3: choices only in some states => prior constraints on behavior => partially specified programs • Partial programming language = programming language + free choices • ALisp (Andre & Russell, 2002) Concurrent ALisp (Marthi et al., 2005) (cf. GoLog and ConGoLog) • Analogy to Bayes Nets • Domain experts supply structure (partial programs) • => factored representation of Q-function (Dietterich, 2000)

  14. Hierarchical RL and ALisp Partial program a Learning algorithm s,r Completion

  15. Program state includes Program counter Call stack Global variables Single-threaded Alisp program (defun top () (loop do (choose ‘top-choice (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))

  16. Q-functions • Represent completions using Q-function • Joint state ω = env state + program state • MDP + partial program = SMDP over {ω} • Qπ(ω,u) = “Expected total reward if I make choice u in ω and follow completion π thereafter” Example Q-function

  17. Outline • Structure in behavior • Temporal abstraction and ALisp • Temporal decomposition of value functions • Concurrent behavior and concurrent ALisp • Threadwise decomposition of value functions • Putting it all together

  18. Temporal Q-decomposition Top Top Top GatherGold GatherWood Nav(Mine2) Nav(Forest1) Qr Qc Qe Q • Temporal decompositionQ = Qr+Qc+Qewhere • Qr(ω,u) = reward while doing u • Qc(ω,u) = reward in current subroutine after doing u • Qe(ω,u) = reward after current subroutine

  19. Q-learning for ALisp • Temporal decomposition => state abstraction • E.g., while navigating, Qr doesn’t depend on gold reserves • State abstraction => fewer parameters, faster learning, better generalization • Algorithm for learning temporally decomposed Q-function (Andre+Russell 02) • Yields optimal completion of partial program

  20. Outline • Structure in behavior • Temporal abstraction and ALisp • Temporal decomposition of value functions • Concurrent behavior and concurrent ALisp • Threadwise decomposition of value functions • Putting it all together

  21. Multieffector domains • Domains where the action is a vector • Each component of action vector corresponds to an effector • Stratagus: each unit/building is an effector • #actions exponential in #effectors

  22. HRL in multieffector domains • Single Alisp program • Partial program must essentially implement multiple control stacks • Independencies between tasks not used • Temporal decomposition is lost • Separate Alisp program for each effector • Hard to achieve coordination among effectors • Our approach - single multithreaded partial program to control all effectors

  23. Threads and effectors Get-gold Get-wood • Threads = tasks • Each effector assigned to a thread • Threads can be created/destroyed • Effectors can be moved between threads • Set of effectors can change over time Defend-Base Get-wood

  24. Concurrent Alisp syntax • Extension of ALisp • New/modified operations : • (action (e1 a1) ... (en an)) • each effector eidoes ai • (spawnthread fn effectors) • start new thread that calls function fn, and reassign effectors to it • (reassigneffectors dest-thread) • reassign effectors to thread dest-thread • (my-effectors) • return list of effectors assigned to calling thread • Program should also specify how to assign new effectors on each step to a thread

  25. Concurrent Alisp syntax • Extension of ALisp • New/modified operations : • (action (e1 a1) ... (en an)) • each effector eidoes ai • (spawnthread fn effectors) • start new thread that calls function fn, and reassign effectors to it • (reassigneffectors dest-thread) • reassign effectors to thread dest-thread • (my-effectors) • return list of effectors assigned to calling thread • Program should also specify how to assign new effectors on each step to a thread

  26. Concurrent Alisp syntax • Extension of ALisp • New/modified operations : • (action (e1 a1) ... (en an)) • each effector eidoes ai • (spawnthread fn effectors) • start new thread that calls function fn, and reassign effectors to it • (reassigneffectors dest-thread) • reassign effectors to thread dest-thread • (my-effectors) • return list of effectors assigned to calling thread • Program should also specify how to assign new effectors on each step to a thread

  27. Concurrent Alisp syntax • Extension of ALisp • New/modified operations : • (action (e1 a1) ... (en an)) • each effector eidoes ai • (spawnthread fn effectors) • start new thread that calls function fn, and reassign effectors to it • (reassigneffectors dest-thread) • reassign effectors to thread dest-thread • (my-effectors) • return list of effectors assigned to calling thread • Program should also specify how to assign new effectors on each step to a thread

  28. Concurrent Alisp syntax • Extension of ALisp • New/modified operations : • (action (e1 a1) ... (en an)) • each effector eidoes ai • (spawnthread fn effectors) • start new thread that calls function fn, and reassign effectors to it • (reassigneffectors dest-thread) • reassign effectors to thread dest-thread • (my-effectors) • return list of effectors assigned to calling thread • Program should also specify how to assign new effectors on each step to a thread

  29. An example single-threaded ALisp program An example Concurrent ALisp program (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff))) (defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move)))) (defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas (first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

  30. Concurrent Alisp semantics Peasant 1 (defun nav (Wood1) (loop until (at-pos Wood1) (setf d (choose ‘(N S E W R))) do (action d) )) • defun gather-wood () • (setf dest (choose *forests*)) • (nav dest) • (action ‘get-wood) • (nav *home-base-loc*) • (action ‘put-wood)) Thread 1 Running Paused Waiting for joint action Making joint choice (defun nav (Gold2) (loop until (at-pos Gold2) (setf d (choose ‘(N S E W R))) do (action d) )) (defun gather-gold () (setf dest (choose *goldmines*)) (nav dest) (action ‘get-gold) (nav *home-base-loc*) (action ‘put-gold)) Peasant 2 Paused Thread 2 Running Making joint choice Waiting for joint action Environment timestep 27 Environment timestep 26

  31. Concurrent Alisp semantics • Threads execute independently until they hit a choice or action • Wait until all threads are at a choice or action • If all effectors have been assigned an action, do that joint action in environment • Otherwise, make joint choice • Can be converted to a semi-Markov decision process assuming partial program is deadlock-free and has no race conditions

  32. Q-functions • To complete partial program, at each choice state ω, need to specify choices for all choosing threads • So Q(ω,u) as before, except u is a joint choice Example Q-function

  33. Making joint choices at runtime • Large state space • Large choice space • At state ω, want to choose argmaxu Q(ω,u) • # joint choices exponential in # choosing threads • Use relational linear value function approximation with coordination graphs (Guestrin et al 2003) to represent, maximize Q efficiently

  34. Basic learning procedure • Apply Q-learning on equivalent SMDP • Boltzmann exploration, maximization on each step done efficiently using coordination graph • Theorem: in tabular case, converges to the optimal completion of the partial program

  35. Outline • Structure in behavior • Temporal abstraction and ALisp • Temporal decomposition of value functions • Concurrent behavior and concurrent ALisp • Threadwise decomposition of value functions • Putting it all together

  36. Problems with basic learning method • Temporal decomposition of Q-function lost • No credit assignment among threads • Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly • Peasant 2 thinks he’s done very well!! • Significantly slows learning as number of peasants increases

  37. Threadwise decomposition • Idea : decompose reward among threads (Russell+Zimdars, 2003) • E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants • Qj(ω,u) = “Expected total reward received by thread j if we make joint choice u and make globally optimal choices thereafter” • Threadwise Q-decomposition Q = Q1+…Qn

  38. Learning threadwise decomposition Peasant 3 thread Peasant 2 thread Peasant 1 thread r2 r1 r3 Top thread r decomp a Action state

  39. Learning threadwise decomposition Peasant 3 thread Q-update Peasant 2 thread Q-update Peasant 1 thread Q2(ω,·) Q-update Q1(ω,·) Q3(ω,·) Top thread + argmax = u Q(ω,·) Choice state ω

  40. Outline • Structure in behavior • Temporal abstraction and ALisp • Temporal decomposition of value functions • Concurrent behavior and concurrent ALisp • Threadwise decomposition of value functions • Putting it all together

  41. Threadwise and temporal decomposition Peasant 3 thread Top thread Peasant 2 thread • Qj = Qj,r + Qj,c + Qj,e where • Qj,r (ω,u) = Expected reward gained by thread j while doing u • Qj,c (ω,u) = Expected reward gained by thread j after u but before leaving current subroutine • Qj,e (ω,u) = Expected reward gained by thread j after current subroutine Peasant 1 thread

  42. Resource gathering with 15 peasants Reward of learnt policy Flat Num steps learning (x 1000)

  43. Resource gathering with 15 peasants Undecomposed Reward of learnt policy Flat Num steps learning (x 1000)

  44. Resource gathering with 15 peasants Threadwise Undecomposed Reward of learnt policy Flat Num steps learning (x 1000)

  45. Resource gathering with 15 peasants Threadwise + Temporal Threadwise Undecomposed Reward of learnt policy Flat Num steps learning (x 1000)

  46. Conclusion • Temporal abstraction + temporal value decomposition + concurrency + threadwise value decomposition + relational function approximation + coordination graphs + reward shaping => somewhat effective model-free learning • Current directions • Model-based learning and lookahead • Temporal logic as a behavioral constraint language • Partial observability (belief state = program state) • Complex motor control tasks

More Related