optimistic initialization and greediness lead to polynomial time learning in factored mdps n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs PowerPoint Presentation
Download Presentation
Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Loading in 2 Seconds...

play fullscreen
1 / 36

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs - PowerPoint PPT Presentation


  • 315 Views
  • Uploaded on

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs. Istv án Szita & Andr ás Lőrincz. University of Alberta Canada. Eötvös Loránd University Hungary. Outline. Factored MDPs motivation definitions planning in FMDPs Optimism

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
optimistic initialization and greediness lead to polynomial time learning in factored mdps

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

István Szita & András Lőrincz

University of Alberta

Canada

Eötvös Loránd University

Hungary

outline
Outline
  • Factored MDPs
    • motivation
    • definitions
    • planning in FMDPs
  • Optimism
  • Optimism & FMDPs & Model-based learning

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

reinforcement learning
Reinforcement learning
  • the agent makes decisions
  • … in an unknown world
  • makes some observations (including rewards)
  • tries to maximize collected reward

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

what kind of observation
What kind of observation?
  • structured observations
  • structure is unclear

???

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

how to solve an rl task
How to “solve an RL task”?
  • a model is useful
    • can reuse experience from previous trials
    • can learn offline
  • observations are structured
  • structure is unknown
  • structured + model + RL = FMDP !
    • (or linear dynamical systems, neural networks, etc…)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

factored mdps
Factored MDPs
  • ordinary MDPs
  • everything is factored
    • states
    • rewards
    • transition probabilities
    • (value functions)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

factored state space
Factored state space
  • all functions depend on a few variables only

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

factored dynamics
Factored dynamics

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

factored rewards
Factored rewards

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

factored v alue function s
(Factored value functions)
  • V* is not factored in general
  • we will make an approximation error

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

solving a known fmdp
Solving a known FMDP
  • NP-hard
    • either exponential-time or non-optimal…
  • exponential-time worst case
    • flattening the FMDP
    • approximate policy iteration [Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000]
  • non-optimal solution (approximating value function in a factored form)
    • approximate linear programming [Guestrin, Koller, Parr & Venkataraman, 2002]
    • ALP + policy iteration [Guestrin et al., 2002]
    • factored value iteration [Szita & Lőrincz, 2008]

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

factored value iteration
Factored value iteration

H := matrix of basis functions

N (HT) := row-normalization of HT,

  • the iterationconverges to fixed point w£
  • can be computed quickly for FMDPs
  • Let V£ = Hw£. Then V£ has bounded error:

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

learning in unknown fmdps
Learning in unknown FMDPs
  • unknown factor decompositions (structure)
  • unknown rewards
  • unknown transitions (dynamics)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

learning in unknown fmdps1
Learning in unknown FMDPs
  • unknown factor decompositions (structure)
  • unknown rewards
  • unknown transitions(dynamics)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

outline1
Outline
  • Factored MDPs
    • motivation
    • definitions
    • planning in FMDPs
  • Optimism
  • Optimism & FMDPs & Model-based learning

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

learning in an unknown fmdp a k a explore or exploit
Learning in an unknown FMDPa.k.a. “Explore or exploit?”
  • after trying a few action sequences…
  • … try to discover better ones?
  • … do the best thing according to current knowledge?

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

be optimistic
Be Optimistic!

(when facing uncertainty)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

slide18
either you get experience…

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

slide19

or you get reward!

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

outline2
Outline
  • Factored MDPs
    • motivation
    • definitions
    • planning in FMDPs
  • Optimism
  • Optimism & FMDPs & Model-based learning

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

factored initial model
Factored Initial Model

component x1

parents: (x1,x3)

component x2

parent: (x2)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

factored optimistic initial model
Factored Optimistic Initial Model

“Garden of Eden”

+$10000 reward

(or something very high)

component x1

parents: (x1,x3)

component x2

parent: (x2)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

later on
Later on…
  • according to initial model, all states have value
  • in frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas

component x1

parents: (x1,x3)

component x2

parent: (x2)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

factored optimistic initial model1
Factored optimistic initial model
  • initialize model (optimistically)
  • for each time step t,
    • solve aproximate model using factored value iteration
    • take greedy action, observe next state
    • update model
  • number of non-near-optimal steps (w.r.t. V£ ) is polynomial with probability ¼1

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

elements of proof some standard stuff
elements of proof: some standard stuff
  • if , then
  • if for all i, then
  • let mi be the number of visits toif mi is large, thenfor all yi.
  • more precisely:with prob.(Hoeffding/Azuma inequality)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

elements of proof main lemma
elements of proof: main lemma
  • for any , approximate Bellman-updates will be more optimistic than the real ones:
  • if VE is large enough, the bonus term dominates for a long time
  • if all elements of H are nonnegative, projection preserves optimism

lower bound by

Azuma’s inequality

bonus promised by

Garden of Eden state

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

elements of proof wrap up
elements of proof: wrap up
  • for a long time, Vt is optimistic enough to boost exploration
  • at most polynomially many exploration steps can be made
  • except those, the agent must be near-V £-optimal

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

previous approaches
Previous approaches
  • extensions of E3, Rmax, MBIE to FMDPs
    • using current model, make smart plan (explore or exploit)
    • explore: make model more accurate
    • exploit: collect near-optimal reward
  • unspecified planners
    • requirement: output plan is close-to-optimal
    • …e.g., solve the flat MDP
  • polynomial sample complexity
  • exponential amounts of computation!

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

unknown rewards
Unknown rewards?
  • “To simplify the presentation, we assume the rewardfunction is known and does not need to be learned. All resultscan be extended to the case of an unknown reward function.”false.
  • problem: cannot observe reward components, only their sum
    • ! UAI poster [Walsh, Szita, Diuk, Littman, 2009]

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

unknown structure
Unknown structure?
  • can be learnt in polynomial time
    • SLF-Rmax [Strehl, Diuk, Littman, 2007]
    • Met-Rmax [Diuk, Li, Littman, 2009]

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

take home message
Take-home message

if your model starts out optimistically enough,

you get efficient exploration for free!

(even if your planner is non-optimal (as long as it is monotonic))

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

optimistic initial model for fmdps
Optimistic initial model for FMDPs
  • add “garden of Eden” value to each state variable
  • add reward factors for each state variable
  • init transition model

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

outline3
Outline

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

outline4
Outline

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

outline5
Outline

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs