Overview of Fitted and Model-Based Reinforcement Learning in a Sketchy Manner

Fitted/batch/model-based RL: A (sketchy, biased) overview(?) Csaba Szepesvári University of Alberta TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA

Contents • What, why? • Constraints • How? • Model-based learning • Model learning • Planning • Model-free learning • Averagers • Fitted RL

Motto “Nothing is more practical than a good theory” [Lewin] “He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast.” [Leonardo da Vinci]

What? Why? • What is batch RL? • Input: Samples (algorithm cannot influence samples) • Output: A good policy • Why? • Common problem • Sample efficiency -- data is expensive • Building block • Why not? • Too much work (for nothing?) – • “Don’t worry, be lazy!” • Old samples are irrelevant • Missed opportunities (evaluate a policy!?)

Constraints • Large (infinite) state/action space • Limits on • Computation • Memory use

How? • Model learning + planning • Model free • Policy search • DP • Policy iteration • Value iteration

Model-based learning

Model learning

Model-based methods Problem 1: Should planning take into account the uncertainties in the model? (“robustification”) Problem 2: How to learn relevant, compact models? For example: How to reject irrelevant features and keep the relevant ones? Need: Tight integration of planning and learning! • Model-learning: How? • Model: What happens if ..? • Features vs. observations vs. states • System identification? •  Satinder! Carlos! Eric! … • Planning: How? • Sample + learning! (batch RL? ..but you can influence the samples) • What else? (Discretize? Nay..) • Pro: Model is good for multiple things • Contra: Problem is doubled: need of high fidelity models, good planning

Planning

Bad news.. • Theorem (Chow, Tsitsiklis ’89) • Markovian Decision Problems • d dimensional state space • Bounded transition probabilities, rewards • Lipschitz-continuous transition probabilities and rewards  Any algorithm computing an ²-approximation of the optimal value function needs (²-d) values of p and r. • What’s next then?? • Open: Policy approximation?

The joy of laziness • Don’t worry, be lazy: • “If something is too hard to do, then it's not worth doing” • Luckiness factor: • “If you really want something in this life, you have to work for it - Now quiet, they're about to announce the lottery numbers!”

Sparse lookahead trees [Kearns et al., ’02] • Idea: Computing a good action ´ planning  build a lookahead tree • Size of the tree: S = c |A|H (²) (unavoidable), whereH(²) = Kr/(²(1-°)) • Good news:S is independent of d! • Bad news: S is exponential in H(²) • Still attractive: Generic, easy to implement • Problem: Not really practical

Idea..  Remi • Be more lazy • Need to propagate values from good leaves as early as possible • Why sample suboptimal actions at all? • Breadth-first  Depth-first! • Bandit algorithms  Upper Confidence Bounds  UCT • Similar ideas: • [Peret and Garcia, ’04] • [Chang et al., ’05] [KoSze ’06]

Results: Sailing • ‘Sailing’: Stochastic shortest path • State-space size = 24*problem-size • Extension to two-player, full information games • Good results in go! ( Remi, David!) Open: Why (when) does UCTwork so well? Conjecture: When being (very) optimistic does not abuse search How to improve UCT?

Random Discretization Method [Rust’97] • Method: • Random base points • Value function computed at these points (weighted importance sampling) • Compute values at other points at run-time (“half-lazy method”) • Why Monte-Carlo? Avoid grids! • Result: • State space: [0,1]d • Action space: finite • p(y|x,a), r(x,a) Lipschitz continuous, bounded • Theorem [Rust ’97]: • Theorem [Sze’01]: Poly samples are enough to come up with ²-optimal actions (poly dependence on H). Smoothness of the value function is not required Open: Can we improve the result by changing the distribution of samples? Idea: Presample + Follow the obtained policy Open: Can we get poly dependence on both d and H without representing a value function? (e.g. lookahead trees)

Pegasus [Ng & Jordan ’00] • Idea: Policy search + method of common random numbers (“scenarios”) • Results: • Condition: Deterministic simulative model • Thm: Finite action space, finite complexity policy class  polynomial sample complexity • Thm: Infinite action spaces, Lipschitz continuity of trans.probs + rewards  polynomial sample complexity • Thm: Finitely computable models + policies  polynomial sample complexity • Pro: Nice results • Contra: Global search? What policy space? Problem 1: How to avoid global search? Problem 2: When can we find a good policy efficiently? How? Problem 3: How to choose the policy class?

Other planning methods • Your favorite RL method! +Planning is easier than learning: You can reset the state! • Dyna-style planning with prioritized sweeping  Rich • Conservative policy iteration • Problem: Policy search, guaranteed improvement in every iteration • [K&L’00]: Bound for finite MDPs, policy class ´ all policies • [K’03]: Arbitrary policies, reduction-style result • Policy search by DP [Bagnell, Kakade, Ng & Schneider ’03] • Similar to [K’03], finite horizon problems • Fitted value iteration • ..

Model-free: Policy Search • ???? Open: How to do it?? (I am serious) Open: How to evaluate a policy/policy gradient given some samples? (partial result: In the limit, under some conditions, policies can be evaluated [AnSzeMu’08])

Model-free: Dynamic Programming Policy Iteration How to evaluate policies? Do good value functions give rise to good policies? Value Iteration Use action-value functions How to represent value functions? How to do the updates?

Value-function based methods • Questions: • What representation to use? • How are errors propagated? • Averagers [Gordon ’95] ~ kernel methods • Vt+1 = ¦F T Vt • L1 theory • Can we have an L2 (Lp) theory? • Counterexamples [Boyan&Moore ’95, Baird’95, BeTsi’96] • L2 error propagation [Munos ’03 ’05]

Fitted methods • Idea: • Use regression/classification with value/policy iteration • Notable examples: • Fitted Q-iteration • Use trees ( averagers; Damien!) • Use neural nets ( L2, Martin!) • Policy iteration • LSTD [Bradtke&Barto ’96, Boyan ‘99] BRM [AnSzeMu’06,’08] • LSPI: Use action-value functions + iterate [Lagoudakis & Parr ’01, ’03] • RL as classification [La & Pa ’03]

Results for fitted algorithms • Results for LSPI/BRM-PI, FQI: • Finite action-, continuous state-space • Smoothness conditions on MDP • Representative training set • Function class (F) large (Bellman error of F is small), but controlled complexity •  Polynomial rates (similar to supervised learning) • FQI, continuous action-spaces • Similar conditions + restricted policy class  Polynomial rates, but bad scaling with the dimension of the action space Open: How to choose the function space in an adaptive way? ~ model selection in supervised learning Supervised learning does not work without model selection? Why would RL work?  NO, IT DOES NOT. Idea: Regularize!  Problem: How to evaluate policies? [AnSzeMu ’06-’08]

Regularization

Final thoughts • Batch RL: Flourishing area • Many open questions • More should! come soon! • Some good results in practice • Take computation cost seriously? • Connect to on-line RL?

Batch RL Let’s switch to that policy – after all the paper says that learning converges at an optimal rate!

Overview of Fitted and Model-Based Reinforcement Learning in a Sketchy Manner

Overview of Fitted and Model-Based Reinforcement Learning in a Sketchy Manner

Presentation Transcript

Batch Distillation

Introduction to Batch Files

BUTLER’S TOURISM LIFE CYCLE MODEL

Software Process Improvements Based on Capability Maturity Model (CMM)

Modeling with Observational Data

Specialized Process Models

EXPLOITATION OF QBD ELEMENTS FOR A BATCH/CONTINUOUS PROCESS

FIRMS Derivation Model Overview

COCOMO II Overview

Course Notes

Playful Learning: An evidence-based model of preschool education

Understanding and Comparing Model-Based Specification Notations

Chapter 2 Gender Stereotypes and Other Gender Biases

Prologue

Unit 6 Database Design and the E-R Model

Chap. 4 BJT transistors

SAT-based Model Checking

An Overview of the NCEP Eta Model

INF5120 – Model-based System Development

Biased assimilation, belief perseverance, group think and the DV paradigm

Temple University – CIS Dept. CIS331– Principles of Database Systems