260 likes | 335 Views
Fitted/ batch /model-based RL: A (sketchy, biased) overview(?). Csaba Szepesv ári University of Alberta. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A. Contents. What, why? Constraints How? Model-based learning Model learning
E N D
Fitted/batch/model-based RL: A (sketchy, biased) overview(?) Csaba Szepesvári University of Alberta TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA
Contents • What, why? • Constraints • How? • Model-based learning • Model learning • Planning • Model-free learning • Averagers • Fitted RL
Motto “Nothing is more practical than a good theory” [Lewin] “He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast.” [Leonardo da Vinci]
What? Why? • What is batch RL? • Input: Samples (algorithm cannot influence samples) • Output: A good policy • Why? • Common problem • Sample efficiency -- data is expensive • Building block • Why not? • Too much work (for nothing?) – • “Don’t worry, be lazy!” • Old samples are irrelevant • Missed opportunities (evaluate a policy!?)
Constraints • Large (infinite) state/action space • Limits on • Computation • Memory use
How? • Model learning + planning • Model free • Policy search • DP • Policy iteration • Value iteration
Model-based methods Problem 1: Should planning take into account the uncertainties in the model? (“robustification”) Problem 2: How to learn relevant, compact models? For example: How to reject irrelevant features and keep the relevant ones? Need: Tight integration of planning and learning! • Model-learning: How? • Model: What happens if ..? • Features vs. observations vs. states • System identification? • Satinder! Carlos! Eric! … • Planning: How? • Sample + learning! (batch RL? ..but you can influence the samples) • What else? (Discretize? Nay..) • Pro: Model is good for multiple things • Contra: Problem is doubled: need of high fidelity models, good planning
Bad news.. • Theorem (Chow, Tsitsiklis ’89) • Markovian Decision Problems • d dimensional state space • Bounded transition probabilities, rewards • Lipschitz-continuous transition probabilities and rewards Any algorithm computing an ²-approximation of the optimal value function needs (²-d) values of p and r. • What’s next then?? • Open: Policy approximation?
The joy of laziness • Don’t worry, be lazy: • “If something is too hard to do, then it's not worth doing” • Luckiness factor: • “If you really want something in this life, you have to work for it - Now quiet, they're about to announce the lottery numbers!”
Sparse lookahead trees [Kearns et al., ’02] • Idea: Computing a good action ´ planning build a lookahead tree • Size of the tree: S = c |A|H (²) (unavoidable), whereH(²) = Kr/(²(1-°)) • Good news:S is independent of d! • Bad news: S is exponential in H(²) • Still attractive: Generic, easy to implement • Problem: Not really practical
Idea.. Remi • Be more lazy • Need to propagate values from good leaves as early as possible • Why sample suboptimal actions at all? • Breadth-first Depth-first! • Bandit algorithms Upper Confidence Bounds UCT • Similar ideas: • [Peret and Garcia, ’04] • [Chang et al., ’05] [KoSze ’06]
Results: Sailing • ‘Sailing’: Stochastic shortest path • State-space size = 24*problem-size • Extension to two-player, full information games • Good results in go! ( Remi, David!) Open: Why (when) does UCTwork so well? Conjecture: When being (very) optimistic does not abuse search How to improve UCT?
Random Discretization Method [Rust’97] • Method: • Random base points • Value function computed at these points (weighted importance sampling) • Compute values at other points at run-time (“half-lazy method”) • Why Monte-Carlo? Avoid grids! • Result: • State space: [0,1]d • Action space: finite • p(y|x,a), r(x,a) Lipschitz continuous, bounded • Theorem [Rust ’97]: • Theorem [Sze’01]: Poly samples are enough to come up with ²-optimal actions (poly dependence on H). Smoothness of the value function is not required Open: Can we improve the result by changing the distribution of samples? Idea: Presample + Follow the obtained policy Open: Can we get poly dependence on both d and H without representing a value function? (e.g. lookahead trees)
Pegasus [Ng & Jordan ’00] • Idea: Policy search + method of common random numbers (“scenarios”) • Results: • Condition: Deterministic simulative model • Thm: Finite action space, finite complexity policy class polynomial sample complexity • Thm: Infinite action spaces, Lipschitz continuity of trans.probs + rewards polynomial sample complexity • Thm: Finitely computable models + policies polynomial sample complexity • Pro: Nice results • Contra: Global search? What policy space? Problem 1: How to avoid global search? Problem 2: When can we find a good policy efficiently? How? Problem 3: How to choose the policy class?
Other planning methods • Your favorite RL method! +Planning is easier than learning: You can reset the state! • Dyna-style planning with prioritized sweeping Rich • Conservative policy iteration • Problem: Policy search, guaranteed improvement in every iteration • [K&L’00]: Bound for finite MDPs, policy class ´ all policies • [K’03]: Arbitrary policies, reduction-style result • Policy search by DP [Bagnell, Kakade, Ng & Schneider ’03] • Similar to [K’03], finite horizon problems • Fitted value iteration • ..
Model-free: Policy Search • ???? Open: How to do it?? (I am serious) Open: How to evaluate a policy/policy gradient given some samples? (partial result: In the limit, under some conditions, policies can be evaluated [AnSzeMu’08])
Model-free: Dynamic Programming Policy Iteration How to evaluate policies? Do good value functions give rise to good policies? Value Iteration Use action-value functions How to represent value functions? How to do the updates?
Value-function based methods • Questions: • What representation to use? • How are errors propagated? • Averagers [Gordon ’95] ~ kernel methods • Vt+1 = ¦F T Vt • L1 theory • Can we have an L2 (Lp) theory? • Counterexamples [Boyan&Moore ’95, Baird’95, BeTsi’96] • L2 error propagation [Munos ’03 ’05]
Fitted methods • Idea: • Use regression/classification with value/policy iteration • Notable examples: • Fitted Q-iteration • Use trees ( averagers; Damien!) • Use neural nets ( L2, Martin!) • Policy iteration • LSTD [Bradtke&Barto ’96, Boyan ‘99] BRM [AnSzeMu’06,’08] • LSPI: Use action-value functions + iterate [Lagoudakis & Parr ’01, ’03] • RL as classification [La & Pa ’03]
Results for fitted algorithms • Results for LSPI/BRM-PI, FQI: • Finite action-, continuous state-space • Smoothness conditions on MDP • Representative training set • Function class (F) large (Bellman error of F is small), but controlled complexity • Polynomial rates (similar to supervised learning) • FQI, continuous action-spaces • Similar conditions + restricted policy class Polynomial rates, but bad scaling with the dimension of the action space Open: How to choose the function space in an adaptive way? ~ model selection in supervised learning Supervised learning does not work without model selection? Why would RL work? NO, IT DOES NOT. Idea: Regularize! Problem: How to evaluate policies? [AnSzeMu ’06-’08]
Final thoughts • Batch RL: Flourishing area • Many open questions • More should! come soon! • Some good results in practice • Take computation cost seriously? • Connect to on-line RL?
Batch RL Let’s switch to that policy – after all the paper says that learning converges at an optimal rate!