Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

317 Views

Download Presentation
## Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Optimistic Initialization and Greediness Lead to**Polynomial-Time Learning in Factored MDPs István Szita & András Lőrincz University of Alberta Canada Eötvös Loránd University Hungary**Outline**• Factored MDPs • motivation • definitions • planning in FMDPs • Optimism • Optimism & FMDPs & Model-based learning Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Reinforcement learning**• the agent makes decisions • … in an unknown world • makes some observations (including rewards) • tries to maximize collected reward Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**What kind of observation?**• structured observations • structure is unclear ??? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**How to “solve an RL task”?**• a model is useful • can reuse experience from previous trials • can learn offline • observations are structured • structure is unknown • structured + model + RL = FMDP ! • (or linear dynamical systems, neural networks, etc…) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored MDPs**• ordinary MDPs • everything is factored • states • rewards • transition probabilities • (value functions) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored state space**• all functions depend on a few variables only Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored dynamics**Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored rewards**Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**(Factored value functions)**• V* is not factored in general • we will make an approximation error Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Solving a known FMDP**• NP-hard • either exponential-time or non-optimal… • exponential-time worst case • flattening the FMDP • approximate policy iteration [Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000] • non-optimal solution (approximating value function in a factored form) • approximate linear programming [Guestrin, Koller, Parr & Venkataraman, 2002] • ALP + policy iteration [Guestrin et al., 2002] • factored value iteration [Szita & Lőrincz, 2008] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored value iteration**H := matrix of basis functions N (HT) := row-normalization of HT, • the iterationconverges to fixed point w£ • can be computed quickly for FMDPs • Let V£ = Hw£. Then V£ has bounded error: Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Learning in unknown FMDPs**• unknown factor decompositions (structure) • unknown rewards • unknown transitions (dynamics) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Learning in unknown FMDPs**• unknown factor decompositions (structure) • unknown rewards • unknown transitions(dynamics) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Outline**• Factored MDPs • motivation • definitions • planning in FMDPs • Optimism • Optimism & FMDPs & Model-based learning Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Learning in an unknown FMDPa.k.a. “Explore or exploit?”**• after trying a few action sequences… • … try to discover better ones? • … do the best thing according to current knowledge? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Be Optimistic!**(when facing uncertainty) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**either you get experience…**Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**or you get reward!**Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Outline**• Factored MDPs • motivation • definitions • planning in FMDPs • Optimism • Optimism & FMDPs & Model-based learning Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored Initial Model**component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored Optimistic Initial Model**“Garden of Eden” +$10000 reward (or something very high) component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Later on…**• according to initial model, all states have value • in frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored optimistic initial model**• initialize model (optimistically) • for each time step t, • solve aproximate model using factored value iteration • take greedy action, observe next state • update model • number of non-near-optimal steps (w.r.t. V£ ) is polynomial with probability ¼1 Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**elements of proof: some standard stuff**• if , then • if for all i, then • let mi be the number of visits toif mi is large, thenfor all yi. • more precisely:with prob.(Hoeffding/Azuma inequality) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**elements of proof: main lemma**• for any , approximate Bellman-updates will be more optimistic than the real ones: • if VE is large enough, the bonus term dominates for a long time • if all elements of H are nonnegative, projection preserves optimism lower bound by Azuma’s inequality bonus promised by Garden of Eden state Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**elements of proof: wrap up**• for a long time, Vt is optimistic enough to boost exploration • at most polynomially many exploration steps can be made • except those, the agent must be near-V £-optimal Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Previous approaches**• extensions of E3, Rmax, MBIE to FMDPs • using current model, make smart plan (explore or exploit) • explore: make model more accurate • exploit: collect near-optimal reward • unspecified planners • requirement: output plan is close-to-optimal • …e.g., solve the flat MDP • polynomial sample complexity • exponential amounts of computation! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Unknown rewards?**• “To simplify the presentation, we assume the rewardfunction is known and does not need to be learned. All resultscan be extended to the case of an unknown reward function.”false. • problem: cannot observe reward components, only their sum • ! UAI poster [Walsh, Szita, Diuk, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Unknown structure?**• can be learnt in polynomial time • SLF-Rmax [Strehl, Diuk, Littman, 2007] • Met-Rmax [Diuk, Li, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Take-home message**if your model starts out optimistically enough, you get efficient exploration for free! (even if your planner is non-optimal (as long as it is monotonic)) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Optimistic initial model for FMDPs**• add “garden of Eden” value to each state variable • add reward factors for each state variable • init transition model Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Outline**Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Outline**Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Outline**Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs