Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs - PowerPoint PPT Presentation

paul2
optimistic initialization and greediness lead to polynomial time learning in factored mdps n.
Skip this Video
Loading SlideShow in 5 Seconds..
Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs PowerPoint Presentation
Download Presentation
Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

play fullscreen
1 / 36
Download Presentation
Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
317 Views
Download Presentation

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs István Szita & András Lőrincz University of Alberta Canada Eötvös Loránd University Hungary

  2. Outline • Factored MDPs • motivation • definitions • planning in FMDPs • Optimism • Optimism & FMDPs & Model-based learning Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  3. Reinforcement learning • the agent makes decisions • … in an unknown world • makes some observations (including rewards) • tries to maximize collected reward Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  4. What kind of observation? • structured observations • structure is unclear ??? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  5. How to “solve an RL task”? • a model is useful • can reuse experience from previous trials • can learn offline • observations are structured • structure is unknown • structured + model + RL = FMDP ! • (or linear dynamical systems, neural networks, etc…) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  6. Factored MDPs • ordinary MDPs • everything is factored • states • rewards • transition probabilities • (value functions) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  7. Factored state space • all functions depend on a few variables only Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  8. Factored dynamics Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  9. Factored rewards Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  10. (Factored value functions) • V* is not factored in general • we will make an approximation error Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  11. Solving a known FMDP • NP-hard • either exponential-time or non-optimal… • exponential-time worst case • flattening the FMDP • approximate policy iteration [Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000] • non-optimal solution (approximating value function in a factored form) • approximate linear programming [Guestrin, Koller, Parr & Venkataraman, 2002] • ALP + policy iteration [Guestrin et al., 2002] • factored value iteration [Szita & Lőrincz, 2008] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  12. Factored value iteration H := matrix of basis functions N (HT) := row-normalization of HT, • the iterationconverges to fixed point w£ • can be computed quickly for FMDPs • Let V£ = Hw£. Then V£ has bounded error: Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  13. Learning in unknown FMDPs • unknown factor decompositions (structure) • unknown rewards • unknown transitions (dynamics) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  14. Learning in unknown FMDPs • unknown factor decompositions (structure) • unknown rewards • unknown transitions(dynamics) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  15. Outline • Factored MDPs • motivation • definitions • planning in FMDPs • Optimism • Optimism & FMDPs & Model-based learning Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  16. Learning in an unknown FMDPa.k.a. “Explore or exploit?” • after trying a few action sequences… • … try to discover better ones? • … do the best thing according to current knowledge? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  17. Be Optimistic! (when facing uncertainty) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  18. either you get experience… Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  19. or you get reward! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  20. Outline • Factored MDPs • motivation • definitions • planning in FMDPs • Optimism • Optimism & FMDPs & Model-based learning Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  21. Factored Initial Model component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  22. Factored Optimistic Initial Model “Garden of Eden” +$10000 reward (or something very high) component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  23. Later on… • according to initial model, all states have value • in frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  24. Factored optimistic initial model • initialize model (optimistically) • for each time step t, • solve aproximate model using factored value iteration • take greedy action, observe next state • update model • number of non-near-optimal steps (w.r.t. V£ ) is polynomial with probability ¼1 Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  25. elements of proof: some standard stuff • if , then • if for all i, then • let mi be the number of visits toif mi is large, thenfor all yi. • more precisely:with prob.(Hoeffding/Azuma inequality) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  26. elements of proof: main lemma • for any , approximate Bellman-updates will be more optimistic than the real ones: • if VE is large enough, the bonus term dominates for a long time • if all elements of H are nonnegative, projection preserves optimism lower bound by Azuma’s inequality bonus promised by Garden of Eden state Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  27. elements of proof: wrap up • for a long time, Vt is optimistic enough to boost exploration • at most polynomially many exploration steps can be made • except those, the agent must be near-V £-optimal Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  28. Previous approaches • extensions of E3, Rmax, MBIE to FMDPs • using current model, make smart plan (explore or exploit) • explore: make model more accurate • exploit: collect near-optimal reward • unspecified planners • requirement: output plan is close-to-optimal • …e.g., solve the flat MDP • polynomial sample complexity • exponential amounts of computation! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  29. Unknown rewards? • “To simplify the presentation, we assume the rewardfunction is known and does not need to be learned. All resultscan be extended to the case of an unknown reward function.”false. • problem: cannot observe reward components, only their sum • ! UAI poster [Walsh, Szita, Diuk, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  30. Unknown structure? • can be learnt in polynomial time • SLF-Rmax [Strehl, Diuk, Littman, 2007] • Met-Rmax [Diuk, Li, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  31. Take-home message if your model starts out optimistically enough, you get efficient exploration for free! (even if your planner is non-optimal (as long as it is monotonic)) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  32. Thank you for your attention!

  33. Optimistic initial model for FMDPs • add “garden of Eden” value to each state variable • add reward factors for each state variable • init transition model Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  34. Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  35. Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  36. Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs