learning and planning for pomdps
Download
Skip this Video
Download Presentation
Learning and Planning for POMDPs

Loading in 2 Seconds...

play fullscreen
1 / 35

Learning and Planning for POMDPs - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

Learning and Planning for POMDPs. Eyal Even-Dar, Tel-Aviv University Sham Kakade , University of Pennsylvania Yishay Mansour, Tel-Aviv University. Talk Outline. Bounded Rationality and Partially Observable MDPs Mathematical Model of POMDPs Learning in POMDPs Planning in POMDPs

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Learning and Planning for POMDPs ' - tracey


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
learning and planning for pomdps

Learning and Planning for POMDPs

Eyal Even-Dar,Tel-Aviv University

Sham Kakade, University of Pennsylvania

Yishay Mansour,Tel-Aviv University

talk outline
Talk Outline
  • Bounded Rationality and

Partially Observable MDPs

  • Mathematical Model of POMDPs
  • Learning in POMDPs
    • Planning in POMDPs
    • Tracking in POMDPs
bounded rationality
Bounded Rationality
  • Rationality:
    • Unlimited Computational power players
  • Bounded Rationality
    • Computational limitation
    • Finite Automata
  • Challenge: play optimally against a Finite Automata
    • Size of automata unknown
bounded rationality and rl
Bounded Rationality and RL
  • Model:
    • Perform an action
    • See an observation
    • Either immediate rewards or delay reward
  • This is a POMDP
    • Unknown size is a serious challenge
classical reinforcement learning agent environment interaction
Classical Reinforcement LearningAgent – Environment Interaction

Agent

Agent action

Reward

Environment

Next state

reinforcement learning goal
Reinforcement Learning - Goal
  • Maximize the return.
  • Discounted return ∑trt 0<<1
  • Undiscounted return ∑rt/ T

t=1

T

t=1

reinforcement learning model policy
Reinforcement Learning ModelPolicy
  • Policy Π:
    • Mapping states to distribution over
  • Optimal policy Π*:
    • Attains optimal return from any start state.
  • Theorem:

There exists a stationary deterministic optimal policy

planning and learning in mdps
Planning and Learning in MDPs
  • Planning:
    • Input: a complete model
    • Output: an optimal policy Π*:
  • Learning:
    • Interaction with the environment
    • Achieve near optimal return.
  • For MDPs both planning and learning can be done efficiently
    • Polynomial in the number of states
    • representation in tabular form
partial observable agent environment interaction
Partial ObservableAgent – Environment Interaction

Agent

Agent action

Reward

Environment

Signal correlated with state

partially observable markov decision process
Partially Observable Markov Decision Process
  • S the states
  • A actions
  • Psa(-) next state distribution
  • R(s,a) Reward distribution
  • O Observations
  • O(s,a) Observation distribution

O1 = .1

02 = .8

03 = .1

O1 = .8

02 = .1

03 = .1

s2

s1

0.3

0.7

O1 = .1

02 = .1

03 = .8

E[R(s3,a)] = 10

s3

partial observables problems in planning
Partial Observables – problems in Planning
  • The optimal policy is not stationary furthermore it is history dependent
  • Example:
learning in podmps difficulties
Learning in PODMPs – Difficulties
  • Suppose an agent knows its state initially, can he keep track of his state?
    • Easy given a completely accurate model.
    • Inaccurate model: Our new tracking result.
  • How can the agent return to the same state?
  • What is the meaning of very long histories?
    • Do we really need to keep all the history?!
planning in pomdps belief state algorithm
Planning in POMDPs – Belief State Algorithm
  • A Bayesian setting
  • Prior over initial state
  • Given an action and observation defines a posterior
    • belief state: distribution over states
  • View the possible belief states as “states”
    • Infinite number of states
  • Assumes also a “perfect model”
learning in pomdps popular methods
Learning in POMDPs – Popular methods
  • Policy gradient methods :
    • Find local optimal policy in a restricted class of polices (parameterized policies)
    • Need to assume a reset to the start state!
    • Cannot guarantee asymptotic results
    • [Peshkin et al, Baxter & Bartlett,…]
learning in pomdps
Learning in POMDPs
  • Trajectory trees [KMN]:
    • Assume a generative model
      • A strong RESET procedure
    • Find “near best” policy in a restricted class of polices
      • finite horizon policies
      • parameterized policies
trajectory tree kmn
Trajectory tree [KMN]

s0

a1

a2

o1

o2

a1

a1

a2

a2

o2

o4

o3

o1

our setting
Our setting
  • Return: Average reward criteria
  • One long trajectory
    • No RESET
    • Connected environment (unichain POMDP)
  • Goal: Achieve the optimal return (average reward) with probability 1
homing strategies pomdps
Homing strategies - POMDPs
  • Homing strategy is a strategy that identifies the state.
    • Knows how to return “home”
  • Enables to “approximate reset” in during a long trajectory.
homing strategies
Homing strategies
  • Learning finite automata [Rivest Schapire]
    • Use homing sequence to identify the state
      • The homing sequence is exact
      • It can lead to many states
    • Use finite automata learning of [Angluin 87]
  • Diversity based learning [Rivest Schpire]
    • Similar to our setting
  • Major difference: deterministic transitions
homing strategies pomdps1
Homing strategies - POMDPs

Definition:

H is an (,K)-homing strategy if

for every two belief states x1 and x2,

after K steps of following H,

the expected belief states b1 and b2 are within  distance.

homing strategies random walk
Homing strategies – Random Walk
  • The POMDP is strongly connected, then the random walk Markov chain is irreducible
  • Following the random walk assures that we converge to the steady state
homing strategies random walk1
Homing strategies – Random Walk
  • What if the Markov chain is periodic?
    • a cycle
  • Use “stay action” to overcome periodicity problems
homing strategies amplifying
Homing strategies – Amplifying

Claim:

If H is an (,K)-homing sequence then repeating H for T times is an (T,KT)-homing sequence

reinforcement learning with homing
Reinforcement learning with homing
  • Usually algorithms should balance between exploration and exploitation
  • Now they should balance between exploration, exploitation and homing
  • Homing is performed in both exploration and exploitation
policy testing algorithm
Policy testing algorithm

Theorem:

For any connected POMDP the policy testing algorithm obtains the optimal average reward with probability 1

After T time steps is competes with policies of horizon log log T

policy testing
Policy testing
  • Enumerate the policies
    • Gradually increase horizon
  • Run in phases:
    • Test policy πk
      • Average runs, resetting between runs
    • Run the best policy so far
      • Ensures good average return
      • Again, reset between runs.
model based algorithm
Model based algorithm

Theorem:

For any connected POMDP the model based algorithm obtains the optimal average reward with probability 1

After T time steps is competes with policies of horizon log T

model based algorithm1
Model based algorithm

Exploration

  • For t=1 to ∞
    • For K1(t)times do
      • Run random for t steps and build an empirical model
      • Use homing sequence to approximate reset
    • Compute optimal policy on the empirical model
    • For K2(t)times do
      • Run the empirical optimal policy for t steps
      • Use homing sequence to approximate reset

Exploitation

model based algorithm2
Model based algorithm

s0

~

a2

a1

o1

o2

o1

o2

a1

a2

a2

a1

…………………………………………………………………………

model based algorithm computing the optimal policy
Model based algorithm –Computing the optimal policy
  • Bounding the error in the model
    • Significant Nodes
      • Sampling
      • Approximate reset
    • Insignificant Nodes
  • Compute an ε-optimal t horizon policy in each step
model based algorithm convergence w p 1 proof
Model Based algorithm- Convergence w.p 1 proof
  • Proof idea:
  • At any stage K1(t) is large enough so we compute an t-optimal t horizon policy
  • K2(t) is large enough such that all phases before influence is bounded by t
  • For a large enough horizon, the homing sequence influence is also bounded
model based algorithm convergence rate
Model Based algorithmConvergence rate
  • Model based algorithm produces an -optimal policy with probability 1 -  in time polynomial in , |A|,|O|, log(1/ ), Homing sequence length, and exponential in the horizon time of the optimal policy
  • Note the algorithm does not depend on |S|
planning in pomdp
Planning in POMDP
  • Unfortunately, not today …
  • Basic results:
    • Tight connections with Multiplicity Automata
      • Well establish theory starting in the 60’s
    • Rank of the Hankel matrix
      • Similar to PSR
      • Always less then the number of states
    • Planning algorithm:
      • Exponential in the rank of the Hankel matrix
tracking in pomdps
Tracking in POMDPs
  • Belief states algorithm
    • Assumes perfect tracking
      • Perfect model.
  • Imperfect model, tracking impossible
    • For example: No observable
  • New results:
    • “Informative observables” implies efficient tracking.
  • Towards a spectrum of “partially” …
ad