1 / 42

A Unifying Framework for Computational Reinforcement Learning Theory

A Unifying Framework for Computational Reinforcement Learning Theory. Lihong Li Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) Department of Computer Science, Rutgers University PhD Defense Committee Michael Littman, Michael Pazzani, Robert Schapire, Mario Szegedy

gitel
Download Presentation

A Unifying Framework for Computational Reinforcement Learning Theory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Unifying Framework for Computational Reinforcement Learning Theory Lihong Li Rutgers Laboratory for Real-Life Reinforcement Learning (RL3) Department of Computer Science, Rutgers University PhD Defense Committee Michael Littman, Michael Pazzani, Robert Schapire, Mario Szegedy Joint work with Michael Littman, Alex Strehl, Tom Walsh, …

  2. $ponsored$earch Are these better alternatives?  Need to EXPLORE! Lihong Li

  3. Thesis The KWIK (Knows What It Knows) learning model provides a flexible, modularized, and unifying way for creating and analyzing RL algorithms with provably efficient exploration. Lihong Li

  4. Outline • Reinforcement Learning (RL) • The KWIK Framework • Provably Efficient RL • Model-based Approaches • Model-free Approaches • Conclusions Lihong Li

  5. -1 per response +20 if succeeds -20 if fails reward Reinforcement Learning Example AT&T Dialer [Li & Williams & Balakrishnan 09] states “May I speak to John Smith?” Want to call someone at AT&T 100K dimensional dialog state, features, etc. Speech Recognition, NLP, Belief Tracking, etc. Optimized by RL actions responses to user Language Generation, Text-to-speech, etc. “So you want to call John Smith, is that right?” Confirm(“John Smith”) Dialog Design Objective succeed in conversation with fewest responses Lihong Li

  6. RL Summary Define reward and let the agent chase it! Lihong Li European Workshop on Reinforcement Learning 2008

  7. Regularity Assumption: Markov Decision Process • Environment is often modeled as an MDP Discount factor in (0,1) Set of states Set of actions Transition probabilities Reward function time s1 s2 st st+1 Lihong Li

  8. Policies and Value Functions • Policy: • Value function: • Optimal value function: • Optimal policy: • Solving an MDP: Lihong Li

  9. Solving an MDP • Planning (when and are known) • Dynamic programming, linear programming, … • Relatively easy to analyze • Learning (when or are unknown) • Q-learning [Watkins 89], … • Fundamentally harder • Exploration/exploitation dilemma Lihong Li

  10. Exploration/Exploitation Dilemma Take optimal actions “exploitation”: reward maximization • Similar to • active learning (like selective sampling) • bandit problems (Ad ranking) • But different/harder • Many heuristics may fail Need estimate and “dual control” Try suboptimal actions “exploration”: knowledge acquisition Lihong Li

  11. 0 0 0 0 0 0 0 1 2 3 98 99 100 1000 0.001 total rewards time Combination Lock active (efficient) exploration optimal policy poor (insufficient) exploration Lihong Li

  12. In words… We want the algorithm to act near optimally except in a small number of steps PAC-MDP RL • RL algorithm viewed as a non-stationary policy: • Sample complexity [Kakade 03] (given ): • A is PAC-MDP (Probably Approximate Correct in MDP) [Strehl, Li, Wiewiora, Langford & Littman 06]if: • With prob. at least • The sample complexity is Lihong Li

  13. Why PAC-MDP? • Sample complexity • number of steps where learning/exploration happens • related to “learning speed” or “exploration efficiency” • Roles of parameters • : allow small sub-optimality • : allow failure due to unlucky data • |M|: measures problem complexity • 1/(1-): larger  makes problem harder • Generality • No assumption on ergodicity • No assumption on mixing • No need for reset or generative model Lihong Li

  14. Rmax [Brafman & Tenenholtz 02] • Rmax is for finite-state, finite-action MDPs • Learns T and R by counting/averaging • In st, takes optimal action in • “Optimism in the face of uncertainty”: • Either: explore “unknown” region • Or: exploit “known” region Thm: Rmax is PAC-MDP [Kakade 03] Known state-actions Unknown state-actions SxA Lihong Li

  15. Outline • Reinforcement Learning (RL) • The KWIK Framework • Provably Efficient RL • Model-based Approaches • Model-free Approaches • Conclusions Lihong Li

  16. KWIK Notation • KWIK: Knows What It Knows [Li & Littman & Walsh 08] • A self-aware, supervised-learning model • Input set: X • Output set: Y • Observation set: Z • Hypothesis class: H µ (X  Y) • Target function: h* 2 H • “Realizable assumption” • Special symbol: ? (“I don’t know”) Lihong Li

  17. KWIK Definition Learning succeeds if Given: , , H • W/prob. 1- , all predictions are correct • |ŷ - h*(x)| ≤  • Total #? is small • at most poly(1/,1/,dim(H)) Env: Pick h* 2 H secretly & adversarially Env: Pick x adversarially “I know” Learner “ŷ” Observe y=h*(x)[deterministic] or measurement z[stochastic where E[z]=h*(x)] “I don’t know” “?” Lihong Li

  18. Related Frameworks (if one-way functions exist) [Blum 94] (may be exponentially harder) [Li & Littman & Walsh 08] PAC: Probably Approximately Correct [Valiant 84] MB: Mistake Bound [Littlestone 87] KWIK: Knows What It Knows [Li & Littman & Walsh 08] Lihong Li

  19. Deterministic / Finite Case(X or H is finite) • Alg. 1: Memorization • Memorize outcome for each • subgroup of patrons • Predict ? if unseen before • #? ≤ |X| • Bar-fight: #?· 2n • Alg. 2: Enumeration • Enumerate all consistent • (instigator, peacemaker)pairs • Say ? when they disagree • #? ≤ |H| -1 • Bar-fight: #?· n(n-1) Can make accurate predictions before complete identification of h* Thought Experiment: You own a bar frequented by n patrons… • One is an instigator. When he shows up, there is a fight, unless • Another patron, the peacemaker, is also there. • We want to predict, for a subset of patrons, {fight or no-fight} Lihong Li 19

  20. Problem Learn a multinomial distribution over N outcomes Same input at all times Observe outcomes, not actual probabilities Algorithm Predict ? for the first times Use empirical estimate afterwards Correctness follows from Chernoff’s bound Building block for many other stochastic cases Stochastic / Finite Case:Dice-Learning Lihong Li

  21. More Examples • Distance to an unknown point in <n [Li & Littman & Walsh 08] • Linear functions with white noise [Strehl & Littman 08] [Walsh & Szita & Diuk & Littman 09] • Gaussian distributions [Brunskill & Leffler & Li & Littman & Roy 08] Lihong Li

  22. Outline • Reinforcement Learning (RL) • The KWIK Framework • Provably Efficient RL • Model-based Approaches • Model-free Approaches • Conclusions Lihong Li

  23. Model-based RL • Model-based RL (in ) • First learn T and R • Then uses to compute • Simulation lemma [Kearns & Singh 02] • Building a model often makes more efficient use of training data in practice Lihong Li

  24. Rmax [Brafman & Tenenholtz 02] KWIK-Rmax [Li et al. 09] Rmax is for finite-state, finite-action MDPs • Generalizes Rmax to general MDPs • KWIK-learns T and R simultaneously • In st, takes optimal action in Learns T and R by counting/averaging • “Optimism in the face of uncertainty”: • Either: explore “unknown” region • Or: exploit “known” region Known state-actions Unknown state-actions SxA Lihong Li

  25. KWIK-Rmax Analysis • Explore-or-Exploit Lemma [Li et al. 09] • KWIK-Rmax either follows -optimal policy, or • explores an unknown state • allowing KWIK-learners to learn T and R! • Theorem[Li et al. 09]: KWIK-Rmax is PAC-MDP w/ sample complexity Lihong Li

  26. KWIK-Learning Finite MDPsby Input-Partition • T(.|s,a) is multinomial distribution • There are |S||A| many of them • Each indexed by (s,a) Environment x=(s1,a2) Input-Partition …… T(.|s1,a1) T(.|s1,a2) T(.|sn,am) dice-learning dice-learning dice-learning [Brafman & Tenenholtz 02] [Kakade 03] [Strehl & Li & Littman 06] Lihong Li

  27. DBN representation [Dean & Kanazawa 89] Factored-State MDPs Star Bidirectional Ring Ring and Star 3 Legs Ring of Rings Network topologies from [Guestrin & Koller & Parr & Venkataraman 03] Lihong Li

  28. Factored-State MDPs DBN representation [Dean & Kanazawa 89] Assuming #parents is bounded by a constant D • Challenges: • How to estimate Ti(si’ | parents(si’),a)? • How to discover parents of each si’? • How to combine learners L(si’) and L(sj’)? Lihong Li

  29. KWIK-Learning DBNswith Unknown Structure From [Kearns & Koller 99]: “This paper leaves many interesting problems unaddressed. Of these, the most intriguing one is to allow the algorithm to learn the model structure as well as the parameters. The recent body of work on learning Bayesian networks from data [Heckerman, 1995] lays much of the foundation, but the integration of these ideas with the problems of exploration/exploitation is far from trivial.” Learning a DBN Cross-Product CPTs for T(si’ | parent(si’), a) Noisy-Union Discovery of parents of si’ Input-Partition Entries in CPT Dice-Learning [Li & Littman & Walsh 08] [Diuk & Li & Leffler 09] First solved by [Strehl & Diuk & Littman 07] Lihong Li

  30. Experiment: “System Administrator” Ring network 8 machines 9 actions Met-Rmax [Diuk & Li & Leffler 09] SLF-Rmax [Strehl & Diuk & Littman 07] Factored Rmax [Guestrin & Patrascu & Schuurmans 02] Lihong Li

  31. MDPs with Gaussian Dynamics • Examples: robot navigation, transportation planning • State offset is multi-variate normal distribution CORL [Brunskill & Leffler & Li & Littman & Roy 08] RAM-Rmax [Leffler & Littman & Edmunds 07] Lihong Li (video by Leffler)

  32. Outline • Reinforcement Learning (RL) • The KWIK Framework • Provably Efficient RL • Model-based Approaches • Model-free Approaches • Conclusions Lihong Li

  33. Model-free RL • Estimate directly • Implying • No need to estimate T or R • Benefits • Tractable computation complexity • Tractable space complexity • Drawbacks • Seems to makes inefficient use of data • Are there PAC-MDP model-free algorithms? Lihong Li

  34. PAC-MDP Model-free RL Can be KWIK-learned optimistic Q-functions explore small E(s,a)  near-optimal (exploit) Lihong Li

  35. Delayed Q-learning • Delayed Q-learning (for finite MDPs) first known PAC-MDP model-free algorithm [Strehl-Li-Wiewiora-Langford-Littman 06] • Similar to Q-learning [Watkins 89] • Minimal computation complexity • Minimal space complexity Lihong Li

  36. Comparison Lihong Li

  37. Improved Lower Bound for Finite MDPs • Lower bound for N=1[Mannor & Tsitsiklis 04]: • Theorem: a new lower bound • Delayed Q-learning’s upper bound: Lihong Li

  38. KWIK with Linear Function Approximation • Linear FA: • LSPI-Rmax [Li & Littman & Mansley 09] • LSPI [Lagoudakis & Parr 03]with online exploration • (s,a) is unknown if under-represented in training set • Includes Rmax as a special case • REKWIRE [Li & Littman 08] • For finite-horizon MDPs • Learns Q in a bottom-up manner Lihong Li

  39. Outline • Reinforcement Learning (RL) • The KWIK Framework • Provably Efficient RL • Model-based Approaches • Model-free Approaches • Conclusions Lihong Li

  40. Open Problems • Agnostic learning [Kearns & Schapire & Sellie 94] in KWIK • Hypothesis class H may not include h* • “Unrealizable” KWIK [Li & Littman 08] • Prior information in RL • Bayesian prior [Asmuth & Li & Littman & Nouri & Wingate 09] • Heuristic/shaping [Asmuth & Littman & Zinkov 08] [Strehl & Li & Littman 09] • Approximate RL with KWIK • Least-squares policy iteration [Li & Littman & Mansley 09] • Fitted value iteration [Brunskill & Leffler & Li & Littman & Roy 08] • Linear function approximation [Li & Littman 08] Lihong Li

  41. Conclusions: A Unification model-free KWIK [Li & Littman & Walsh 08] KWIK-based VFA [Li & Littman 08] [Li & Mansley & Littman 09] The KWIK (Knows What It Knows) learning model provides a flexible, modularized, and unifying way for creating and analyzing RL algorithms with provably efficient exploration. Finite MDP [Strehl & Li & Wiewiora & Langford & Littman 06] model-based Factored MDP [Kearns & Koller 99] [Strehl & Diuk & Littman 07] [Li & Littman & Walsh 08] [Diuk & Li & Leffler 09] Finite MDP [Kearns & Singh 02] [Brafman & Tenenholtz 02] [Kakade 03] [Strehl & Li & Littman 06] Matching Lower Bound Linear MDP [Strehl & Littman 08] Gaussian-Offset MDP [Brunskill & Leffer & Li & Littman & Roy 08] RAM-MDP [Leffler & Littman & Edmunds 07] Delayed-Observation MDP [Walsh & Nouri & Li & Littman 07] Lihong Li

  42. References • Li, Littman, & Walsh: “Knows what it knows: A framework for self-aware learning”. In ICML 2008. • Diuk, Li, & Leffler: “The adaptive k-meteorologist problem and its applications to structure discovery and feature selection in reinforcement learning”. In ICML 2009. • Brunskill, Leffler, Li, Littman, & Roy: “CORL: A continuous-state offset-dynamics reinforcement learner”. In UAI 2008. • Walsh, Nouri, Li, & Littman: “Planning and learning in environments with delayed feedback”. In ECML 2007. • Strehl, Li, & Littman: “Incremental model-based learners with formal learning-time guarantees”. In UAI 2006. • Li, Littman, & Mansley: “Online exploration in least-squares policy iteration”. In AAMAS 2009. • Li & Littman: “Efficient value-function approximation via online linear regression”. In AI&Math 2008. • Strehl, Li, Wiewiora, Langford, & Littman: “PAC model-free reinforcement learning”. In ICML 2006. KWIK MBRL MFRL Lihong Li

More Related