Loading in 2 Seconds...

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Loading in 2 Seconds...

- By
**paul2** - Follow User

- 315 Views
- Uploaded on

Download Presentation
## Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

István Szita & András Lőrincz

University of Alberta

Canada

Eötvös Loránd University

Hungary

Outline

- Factored MDPs
- motivation
- definitions
- planning in FMDPs
- Optimism
- Optimism & FMDPs & Model-based learning

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Reinforcement learning

- the agent makes decisions
- … in an unknown world
- makes some observations (including rewards)
- tries to maximize collected reward

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

What kind of observation?

- structured observations
- structure is unclear

???

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

How to “solve an RL task”?Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- a model is useful
- can reuse experience from previous trials
- can learn offline
- observations are structured
- structure is unknown
- structured + model + RL = FMDP !
- (or linear dynamical systems, neural networks, etc…)

Factored MDPsSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- ordinary MDPs
- everything is factored
- states
- rewards
- transition probabilities
- (value functions)

Factored state spaceSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- all functions depend on a few variables only

Factored dynamicsSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored rewardsSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

(Factored value functions)Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- V* is not factored in general
- we will make an approximation error

Solving a known FMDPSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- NP-hard
- either exponential-time or non-optimal…
- exponential-time worst case
- flattening the FMDP
- approximate policy iteration [Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000]
- non-optimal solution (approximating value function in a factored form)
- approximate linear programming [Guestrin, Koller, Parr & Venkataraman, 2002]
- ALP + policy iteration [Guestrin et al., 2002]
- factored value iteration [Szita & Lőrincz, 2008]

Factored value iterationSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

H := matrix of basis functions

N (HT) := row-normalization of HT,

- the iterationconverges to fixed point w£
- can be computed quickly for FMDPs
- Let V£ = Hw£. Then V£ has bounded error:

Learning in unknown FMDPsSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- unknown factor decompositions (structure)
- unknown rewards
- unknown transitions (dynamics)

Learning in unknown FMDPsSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- unknown factor decompositions (structure)
- unknown rewards
- unknown transitions(dynamics)

OutlineSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- Factored MDPs
- motivation
- definitions
- planning in FMDPs
- Optimism
- Optimism & FMDPs & Model-based learning

Learning in an unknown FMDPa.k.a. “Explore or exploit?”Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- after trying a few action sequences…
- … try to discover better ones?
- … do the best thing according to current knowledge?

Be Optimistic!Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

(when facing uncertainty)

either you get experience…Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

OutlineSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- Factored MDPs
- motivation
- definitions
- planning in FMDPs
- Optimism
- Optimism & FMDPs & Model-based learning

Factored Initial ModelSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

component x1

parents: (x1,x3)

component x2

parent: (x2)

…

Factored Optimistic Initial ModelSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

“Garden of Eden”

+$10000 reward

(or something very high)

component x1

parents: (x1,x3)

component x2

parent: (x2)

…

Later on…Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- according to initial model, all states have value
- in frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas

component x1

parents: (x1,x3)

component x2

parent: (x2)

…

Factored optimistic initial modelSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- initialize model (optimistically)
- for each time step t,
- solve aproximate model using factored value iteration
- take greedy action, observe next state
- update model
- number of non-near-optimal steps (w.r.t. V£ ) is polynomial with probability ¼1

elements of proof: some standard stuffSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- if , then
- if for all i, then
- let mi be the number of visits toif mi is large, thenfor all yi.
- more precisely:with prob.(Hoeffding/Azuma inequality)

elements of proof: main lemmaSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- for any , approximate Bellman-updates will be more optimistic than the real ones:
- if VE is large enough, the bonus term dominates for a long time
- if all elements of H are nonnegative, projection preserves optimism

lower bound by

Azuma’s inequality

bonus promised by

Garden of Eden state

elements of proof: wrap upSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- for a long time, Vt is optimistic enough to boost exploration
- at most polynomially many exploration steps can be made
- except those, the agent must be near-V £-optimal

Previous approachesSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- extensions of E3, Rmax, MBIE to FMDPs
- using current model, make smart plan (explore or exploit)
- explore: make model more accurate
- exploit: collect near-optimal reward
- unspecified planners
- requirement: output plan is close-to-optimal
- …e.g., solve the flat MDP
- polynomial sample complexity
- exponential amounts of computation!

Unknown rewards?Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- “To simplify the presentation, we assume the rewardfunction is known and does not need to be learned. All resultscan be extended to the case of an unknown reward function.”false.
- problem: cannot observe reward components, only their sum
- ! UAI poster [Walsh, Szita, Diuk, Littman, 2009]

Unknown structure?Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- can be learnt in polynomial time
- SLF-Rmax [Strehl, Diuk, Littman, 2007]
- Met-Rmax [Diuk, Li, Littman, 2009]

Take-home messageSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

if your model starts out optimistically enough,

you get efficient exploration for free!

(even if your planner is non-optimal (as long as it is monotonic))

Optimistic initial model for FMDPsSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

- add “garden of Eden” value to each state variable
- add reward factors for each state variable
- init transition model

OutlineSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

OutlineSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

OutlineSzita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Download Presentation

Connecting to Server..