Loading in 5 sec....

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPsPowerPoint Presentation

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- By
**paul2** - Follow User

- 285 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Optimistic Initialization and Greediness Lead to Polynomial-Time Learning' - paul2

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

### Thank you for your attention!

István Szita & András Lőrincz

University of Alberta

Canada

Eötvös Loránd University

Hungary

Outline

- Factored MDPs
- motivation
- definitions
- planning in FMDPs

- Optimism
- Optimism & FMDPs & Model-based learning

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Reinforcement learning

- the agent makes decisions
- … in an unknown world
- makes some observations (including rewards)
- tries to maximize collected reward

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

What kind of observation?

- structured observations
- structure is unclear

???

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

How to “solve an RL task”? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- a model is useful
- can reuse experience from previous trials
- can learn offline

- observations are structured
- structure is unknown
- structured + model + RL = FMDP !
- (or linear dynamical systems, neural networks, etc…)

Factored MDPs Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- ordinary MDPs
- everything is factored
- states
- rewards
- transition probabilities
- (value functions)

Factored state space Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- all functions depend on a few variables only

Factored dynamics Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored rewards Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

(Factored value functions) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- V* is not factored in general
- we will make an approximation error

Solving a known FMDP Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- NP-hard
- either exponential-time or non-optimal…

- exponential-time worst case
- flattening the FMDP
- approximate policy iteration [Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000]

- non-optimal solution (approximating value function in a factored form)
- approximate linear programming [Guestrin, Koller, Parr & Venkataraman, 2002]
- ALP + policy iteration [Guestrin et al., 2002]
- factored value iteration [Szita & Lőrincz, 2008]

Factored value iteration Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

H := matrix of basis functions

N (HT) := row-normalization of HT,

- the iterationconverges to fixed point w£
- can be computed quickly for FMDPs
- Let V£ = Hw£. Then V£ has bounded error:

Learning in unknown FMDPs Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- unknown factor decompositions (structure)
- unknown rewards
- unknown transitions (dynamics)

Learning in unknown FMDPs Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- unknown factor decompositions (structure)
- unknown rewards
- unknown transitions(dynamics)

Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- Factored MDPs
- motivation
- definitions
- planning in FMDPs

- Optimism
- Optimism & FMDPs & Model-based learning

Learning in an unknown FMDPa.k.a. “Explore or exploit?” Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- after trying a few action sequences…
- … try to discover better ones?
- … do the best thing according to current knowledge?

Be Optimistic! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

(when facing uncertainty)

Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- Factored MDPs
- motivation
- definitions
- planning in FMDPs

- Optimism
- Optimism & FMDPs & Model-based learning

Factored Initial Model Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

component x1

parents: (x1,x3)

component x2

parent: (x2)

…

Factored Optimistic Initial Model Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

“Garden of Eden”

+$10000 reward

(or something very high)

component x1

parents: (x1,x3)

component x2

parent: (x2)

…

Later on… Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- according to initial model, all states have value
- in frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas

component x1

parents: (x1,x3)

component x2

parent: (x2)

…

Factored optimistic initial model Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- initialize model (optimistically)
- for each time step t,
- solve aproximate model using factored value iteration
- take greedy action, observe next state
- update model

- number of non-near-optimal steps (w.r.t. V£ ) is polynomial with probability ¼1

elements of proof: some standard stuff Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- if , then
- if for all i, then
- let mi be the number of visits toif mi is large, thenfor all yi.
- more precisely:with prob.(Hoeffding/Azuma inequality)

elements of proof: main lemma Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- for any , approximate Bellman-updates will be more optimistic than the real ones:
- if VE is large enough, the bonus term dominates for a long time
- if all elements of H are nonnegative, projection preserves optimism

lower bound by

Azuma’s inequality

bonus promised by

Garden of Eden state

elements of proof: wrap up Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- for a long time, Vt is optimistic enough to boost exploration
- at most polynomially many exploration steps can be made
- except those, the agent must be near-V £-optimal

Previous approaches Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- extensions of E3, Rmax, MBIE to FMDPs
- using current model, make smart plan (explore or exploit)
- explore: make model more accurate
- exploit: collect near-optimal reward

- unspecified planners
- requirement: output plan is close-to-optimal
- …e.g., solve the flat MDP

- polynomial sample complexity
- exponential amounts of computation!

Unknown rewards? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- “To simplify the presentation, we assume the rewardfunction is known and does not need to be learned. All resultscan be extended to the case of an unknown reward function.”false.
- problem: cannot observe reward components, only their sum
- ! UAI poster [Walsh, Szita, Diuk, Littman, 2009]

Unknown structure? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- can be learnt in polynomial time
- SLF-Rmax [Strehl, Diuk, Littman, 2007]
- Met-Rmax [Diuk, Li, Littman, 2009]

Take-home message Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

if your model starts out optimistically enough,

you get efficient exploration for free!

(even if your planner is non-optimal (as long as it is monotonic))

Optimistic initial model for FMDPs Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- add “garden of Eden” value to each state variable
- add reward factors for each state variable
- init transition model

Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Download Presentation

Connecting to Server..