Optimistic initialization and greediness lead to polynomial time learning in factored mdps
Download
1 / 36

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning - PowerPoint PPT Presentation


  • 285 Views
  • Uploaded on

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs. Istv án Szita & Andr ás Lőrincz. University of Alberta Canada. Eötvös Loránd University Hungary. Outline. Factored MDPs motivation definitions planning in FMDPs Optimism

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Optimistic Initialization and Greediness Lead to Polynomial-Time Learning' - paul2


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Optimistic initialization and greediness lead to polynomial time learning in factored mdps

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

István Szita & András Lőrincz

University of Alberta

Canada

Eötvös Loránd University

Hungary


Outline
Outline

  • Factored MDPs

    • motivation

    • definitions

    • planning in FMDPs

  • Optimism

  • Optimism & FMDPs & Model-based learning

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Reinforcement learning
Reinforcement learning

  • the agent makes decisions

  • … in an unknown world

  • makes some observations (including rewards)

  • tries to maximize collected reward

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


What kind of observation
What kind of observation?

  • structured observations

  • structure is unclear

???

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


How to solve an rl task
How to “solve an RL task”?

  • a model is useful

    • can reuse experience from previous trials

    • can learn offline

  • observations are structured

  • structure is unknown

  • structured + model + RL = FMDP !

    • (or linear dynamical systems, neural networks, etc…)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Factored mdps
Factored MDPs

  • ordinary MDPs

  • everything is factored

    • states

    • rewards

    • transition probabilities

    • (value functions)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Factored state space
Factored state space

  • all functions depend on a few variables only

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Factored dynamics
Factored dynamics

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Factored rewards
Factored rewards

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Factored v alue function s
(Factored value functions)

  • V* is not factored in general

  • we will make an approximation error

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Solving a known fmdp
Solving a known FMDP

  • NP-hard

    • either exponential-time or non-optimal…

  • exponential-time worst case

    • flattening the FMDP

    • approximate policy iteration [Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000]

  • non-optimal solution (approximating value function in a factored form)

    • approximate linear programming [Guestrin, Koller, Parr & Venkataraman, 2002]

    • ALP + policy iteration [Guestrin et al., 2002]

    • factored value iteration [Szita & Lőrincz, 2008]

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Factored value iteration
Factored value iteration

H := matrix of basis functions

N (HT) := row-normalization of HT,

  • the iterationconverges to fixed point w£

  • can be computed quickly for FMDPs

  • Let V£ = Hw£. Then V£ has bounded error:

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Learning in unknown fmdps
Learning in unknown FMDPs

  • unknown factor decompositions (structure)

  • unknown rewards

  • unknown transitions (dynamics)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Learning in unknown fmdps1
Learning in unknown FMDPs

  • unknown factor decompositions (structure)

  • unknown rewards

  • unknown transitions(dynamics)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Outline1
Outline

  • Factored MDPs

    • motivation

    • definitions

    • planning in FMDPs

  • Optimism

  • Optimism & FMDPs & Model-based learning

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Learning in an unknown fmdp a k a explore or exploit
Learning in an unknown FMDPa.k.a. “Explore or exploit?”

  • after trying a few action sequences…

  • … try to discover better ones?

  • … do the best thing according to current knowledge?

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Be optimistic
Be Optimistic!

(when facing uncertainty)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


either you get experience…

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


or you get reward!

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Outline2
Outline

  • Factored MDPs

    • motivation

    • definitions

    • planning in FMDPs

  • Optimism

  • Optimism & FMDPs & Model-based learning

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Factored initial model
Factored Initial Model

component x1

parents: (x1,x3)

component x2

parent: (x2)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Factored optimistic initial model
Factored Optimistic Initial Model

“Garden of Eden”

+$10000 reward

(or something very high)

component x1

parents: (x1,x3)

component x2

parent: (x2)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Later on
Later on…

  • according to initial model, all states have value

  • in frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas

component x1

parents: (x1,x3)

component x2

parent: (x2)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Factored optimistic initial model1
Factored optimistic initial model

  • initialize model (optimistically)

  • for each time step t,

    • solve aproximate model using factored value iteration

    • take greedy action, observe next state

    • update model

  • number of non-near-optimal steps (w.r.t. V£ ) is polynomial with probability ¼1

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Elements of proof some standard stuff
elements of proof: some standard stuff

  • if , then

  • if for all i, then

  • let mi be the number of visits toif mi is large, thenfor all yi.

  • more precisely:with prob.(Hoeffding/Azuma inequality)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Elements of proof main lemma
elements of proof: main lemma

  • for any , approximate Bellman-updates will be more optimistic than the real ones:

  • if VE is large enough, the bonus term dominates for a long time

  • if all elements of H are nonnegative, projection preserves optimism

lower bound by

Azuma’s inequality

bonus promised by

Garden of Eden state

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Elements of proof wrap up
elements of proof: wrap up

  • for a long time, Vt is optimistic enough to boost exploration

  • at most polynomially many exploration steps can be made

  • except those, the agent must be near-V £-optimal

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Previous approaches
Previous approaches

  • extensions of E3, Rmax, MBIE to FMDPs

    • using current model, make smart plan (explore or exploit)

    • explore: make model more accurate

    • exploit: collect near-optimal reward

  • unspecified planners

    • requirement: output plan is close-to-optimal

    • …e.g., solve the flat MDP

  • polynomial sample complexity

  • exponential amounts of computation!

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Unknown rewards
Unknown rewards?

  • “To simplify the presentation, we assume the rewardfunction is known and does not need to be learned. All resultscan be extended to the case of an unknown reward function.”false.

  • problem: cannot observe reward components, only their sum

    • ! UAI poster [Walsh, Szita, Diuk, Littman, 2009]

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Unknown structure
Unknown structure?

  • can be learnt in polynomial time

    • SLF-Rmax [Strehl, Diuk, Littman, 2007]

    • Met-Rmax [Diuk, Li, Littman, 2009]

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Take home message
Take-home message

if your model starts out optimistically enough,

you get efficient exploration for free!

(even if your planner is non-optimal (as long as it is monotonic))

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Thank you for your attention

Thank you for your attention!


Optimistic initial model for fmdps
Optimistic initial model for FMDPs

  • add “garden of Eden” value to each state variable

  • add reward factors for each state variable

  • init transition model

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Outline3
Outline

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Outline4
Outline

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Outline5
Outline

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


ad