1 / 36

360 likes | 493 Views

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs. Istv án Szita & Andr ás Lőrincz. University of Alberta Canada. Eötvös Loránd University Hungary. Outline. Factored MDPs motivation definitions planning in FMDPs Optimism

Download Presentation
## Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.
Content is provided to you AS IS for your information and personal use only.
Download presentation by click this link.
While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

**Optimistic Initialization and Greediness Lead to**Polynomial-Time Learning in Factored MDPs István Szita & András Lőrincz University of Alberta Canada Eötvös Loránd University Hungary**Outline**• Factored MDPs • motivation • definitions • planning in FMDPs • Optimism • Optimism & FMDPs & Model-based learning Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Reinforcement learning**• the agent makes decisions • … in an unknown world • makes some observations (including rewards) • tries to maximize collected reward Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**What kind of observation?**• structured observations • structure is unclear ??? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**How to “solve an RL task”?**• a model is useful • can reuse experience from previous trials • can learn offline • observations are structured • structure is unknown • structured + model + RL = FMDP ! • (or linear dynamical systems, neural networks, etc…) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored MDPs**• ordinary MDPs • everything is factored • states • rewards • transition probabilities • (value functions) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored state space**• all functions depend on a few variables only Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored dynamics**Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored rewards**Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**(Factored value functions)**• V* is not factored in general • we will make an approximation error Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Solving a known FMDP**• NP-hard • either exponential-time or non-optimal… • exponential-time worst case • flattening the FMDP • approximate policy iteration [Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000] • non-optimal solution (approximating value function in a factored form) • approximate linear programming [Guestrin, Koller, Parr & Venkataraman, 2002] • ALP + policy iteration [Guestrin et al., 2002] • factored value iteration [Szita & Lőrincz, 2008] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored value iteration**H := matrix of basis functions N (HT) := row-normalization of HT, • the iterationconverges to fixed point w£ • can be computed quickly for FMDPs • Let V£ = Hw£. Then V£ has bounded error: Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Learning in unknown FMDPs**• unknown factor decompositions (structure) • unknown rewards • unknown transitions (dynamics) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Learning in unknown FMDPs**• unknown factor decompositions (structure) • unknown rewards • unknown transitions(dynamics) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Outline**• Factored MDPs • motivation • definitions • planning in FMDPs • Optimism • Optimism & FMDPs & Model-based learning Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Learning in an unknown FMDPa.k.a. “Explore or exploit?”**• after trying a few action sequences… • … try to discover better ones? • … do the best thing according to current knowledge? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Be Optimistic!**(when facing uncertainty) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**either you get experience…**Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**or you get reward!**Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Outline**• Factored MDPs • motivation • definitions • planning in FMDPs • Optimism • Optimism & FMDPs & Model-based learning Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored Initial Model**component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored Optimistic Initial Model**“Garden of Eden” +$10000 reward (or something very high) component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Later on…**• according to initial model, all states have value • in frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas component x1 parents: (x1,x3) component x2 parent: (x2) … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Factored optimistic initial model**• initialize model (optimistically) • for each time step t, • solve aproximate model using factored value iteration • take greedy action, observe next state • update model • number of non-near-optimal steps (w.r.t. V£ ) is polynomial with probability ¼1 Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**elements of proof: some standard stuff**• if , then • if for all i, then • let mi be the number of visits toif mi is large, thenfor all yi. • more precisely:with prob.(Hoeffding/Azuma inequality) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**elements of proof: main lemma**• for any , approximate Bellman-updates will be more optimistic than the real ones: • if VE is large enough, the bonus term dominates for a long time • if all elements of H are nonnegative, projection preserves optimism lower bound by Azuma’s inequality bonus promised by Garden of Eden state Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**elements of proof: wrap up**• for a long time, Vt is optimistic enough to boost exploration • at most polynomially many exploration steps can be made • except those, the agent must be near-V £-optimal Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Previous approaches**• extensions of E3, Rmax, MBIE to FMDPs • using current model, make smart plan (explore or exploit) • explore: make model more accurate • exploit: collect near-optimal reward • unspecified planners • requirement: output plan is close-to-optimal • …e.g., solve the flat MDP • polynomial sample complexity • exponential amounts of computation! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Unknown rewards?**• “To simplify the presentation, we assume the rewardfunction is known and does not need to be learned. All resultscan be extended to the case of an unknown reward function.”false. • problem: cannot observe reward components, only their sum • ! UAI poster [Walsh, Szita, Diuk, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Unknown structure?**• can be learnt in polynomial time • SLF-Rmax [Strehl, Diuk, Littman, 2007] • Met-Rmax [Diuk, Li, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Take-home message**if your model starts out optimistically enough, you get efficient exploration for free! (even if your planner is non-optimal (as long as it is monotonic)) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Optimistic initial model for FMDPs**• add “garden of Eden” value to each state variable • add reward factors for each state variable • init transition model Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Outline**Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Outline**Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs**Outline**Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

More Related