Learning and planning for pomdps
Download
1 / 35

Learning and Planning for POMDPs - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

Learning and Planning for POMDPs. Eyal Even-Dar, Tel-Aviv University Sham Kakade , University of Pennsylvania Yishay Mansour, Tel-Aviv University. Talk Outline. Bounded Rationality and Partially Observable MDPs Mathematical Model of POMDPs Learning in POMDPs Planning in POMDPs

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Learning and Planning for POMDPs ' - tracey


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Learning and planning for pomdps

Learning and Planning for POMDPs

Eyal Even-Dar,Tel-Aviv University

Sham Kakade, University of Pennsylvania

Yishay Mansour,Tel-Aviv University


Talk outline
Talk Outline

  • Bounded Rationality and

    Partially Observable MDPs

  • Mathematical Model of POMDPs

  • Learning in POMDPs

    • Planning in POMDPs

    • Tracking in POMDPs


Bounded rationality
Bounded Rationality

  • Rationality:

    • Unlimited Computational power players

  • Bounded Rationality

    • Computational limitation

    • Finite Automata

  • Challenge: play optimally against a Finite Automata

    • Size of automata unknown


Bounded rationality and rl
Bounded Rationality and RL

  • Model:

    • Perform an action

    • See an observation

    • Either immediate rewards or delay reward

  • This is a POMDP

    • Unknown size is a serious challenge


Classical reinforcement learning agent environment interaction
Classical Reinforcement LearningAgent – Environment Interaction

Agent

Agent action

Reward

Environment

Next state


Reinforcement learning goal
Reinforcement Learning - Goal

  • Maximize the return.

  • Discounted return ∑trt 0<<1

  • Undiscounted return ∑rt/ T

t=1

T

t=1


Reinforcement learning model policy
Reinforcement Learning ModelPolicy

  • Policy Π:

    • Mapping states to distribution over

  • Optimal policy Π*:

    • Attains optimal return from any start state.

  • Theorem:

    There exists a stationary deterministic optimal policy


Planning and learning in mdps
Planning and Learning in MDPs

  • Planning:

    • Input: a complete model

    • Output: an optimal policy Π*:

  • Learning:

    • Interaction with the environment

    • Achieve near optimal return.

  • For MDPs both planning and learning can be done efficiently

    • Polynomial in the number of states

    • representation in tabular form


Partial observable agent environment interaction
Partial ObservableAgent – Environment Interaction

Agent

Agent action

Reward

Environment

Signal correlated with state


Partially observable markov decision process
Partially Observable Markov Decision Process

  • S the states

  • A actions

  • Psa(-) next state distribution

  • R(s,a) Reward distribution

  • O Observations

  • O(s,a) Observation distribution

O1 = .1

02 = .8

03 = .1

O1 = .8

02 = .1

03 = .1

s2

s1

0.3

0.7

O1 = .1

02 = .1

03 = .8

E[R(s3,a)] = 10

s3


Partial observables problems in planning
Partial Observables – problems in Planning

  • The optimal policy is not stationary furthermore it is history dependent

  • Example:


Partial observables complexity hardness results
Partial Observables – Complexity Hardness results

LGM01, L95


Learning in podmps difficulties
Learning in PODMPs – Difficulties

  • Suppose an agent knows its state initially, can he keep track of his state?

    • Easy given a completely accurate model.

    • Inaccurate model: Our new tracking result.

  • How can the agent return to the same state?

  • What is the meaning of very long histories?

    • Do we really need to keep all the history?!


Planning in pomdps belief state algorithm
Planning in POMDPs – Belief State Algorithm

  • A Bayesian setting

  • Prior over initial state

  • Given an action and observation defines a posterior

    • belief state: distribution over states

  • View the possible belief states as “states”

    • Infinite number of states

  • Assumes also a “perfect model”


Learning in pomdps popular methods
Learning in POMDPs – Popular methods

  • Policy gradient methods :

    • Find local optimal policy in a restricted class of polices (parameterized policies)

    • Need to assume a reset to the start state!

    • Cannot guarantee asymptotic results

    • [Peshkin et al, Baxter & Bartlett,…]


Learning in pomdps
Learning in POMDPs

  • Trajectory trees [KMN]:

    • Assume a generative model

      • A strong RESET procedure

    • Find “near best” policy in a restricted class of polices

      • finite horizon policies

      • parameterized policies


Trajectory tree kmn
Trajectory tree [KMN]

s0

a1

a2

o1

o2

a1

a1

a2

a2

o2

o4

o3

o1


Our setting
Our setting

  • Return: Average reward criteria

  • One long trajectory

    • No RESET

    • Connected environment (unichain POMDP)

  • Goal: Achieve the optimal return (average reward) with probability 1


Homing strategies pomdps
Homing strategies - POMDPs

  • Homing strategy is a strategy that identifies the state.

    • Knows how to return “home”

  • Enables to “approximate reset” in during a long trajectory.


Homing strategies
Homing strategies

  • Learning finite automata [Rivest Schapire]

    • Use homing sequence to identify the state

      • The homing sequence is exact

      • It can lead to many states

    • Use finite automata learning of [Angluin 87]

  • Diversity based learning [Rivest Schpire]

    • Similar to our setting

  • Major difference: deterministic transitions


Homing strategies pomdps1
Homing strategies - POMDPs

Definition:

H is an (,K)-homing strategy if

for every two belief states x1 and x2,

after K steps of following H,

the expected belief states b1 and b2 are within  distance.


Homing strategies random walk
Homing strategies – Random Walk

  • The POMDP is strongly connected, then the random walk Markov chain is irreducible

  • Following the random walk assures that we converge to the steady state


Homing strategies random walk1
Homing strategies – Random Walk

  • What if the Markov chain is periodic?

    • a cycle

  • Use “stay action” to overcome periodicity problems


Homing strategies amplifying
Homing strategies – Amplifying

Claim:

If H is an (,K)-homing sequence then repeating H for T times is an (T,KT)-homing sequence


Reinforcement learning with homing
Reinforcement learning with homing

  • Usually algorithms should balance between exploration and exploitation

  • Now they should balance between exploration, exploitation and homing

  • Homing is performed in both exploration and exploitation


Policy testing algorithm
Policy testing algorithm

Theorem:

For any connected POMDP the policy testing algorithm obtains the optimal average reward with probability 1

After T time steps is competes with policies of horizon log log T


Policy testing
Policy testing

  • Enumerate the policies

    • Gradually increase horizon

  • Run in phases:

    • Test policy πk

      • Average runs, resetting between runs

    • Run the best policy so far

      • Ensures good average return

      • Again, reset between runs.


Model based algorithm
Model based algorithm

Theorem:

For any connected POMDP the model based algorithm obtains the optimal average reward with probability 1

After T time steps is competes with policies of horizon log T


Model based algorithm1
Model based algorithm

Exploration

  • For t=1 to ∞

    • For K1(t)times do

      • Run random for t steps and build an empirical model

      • Use homing sequence to approximate reset

    • Compute optimal policy on the empirical model

    • For K2(t)times do

      • Run the empirical optimal policy for t steps

      • Use homing sequence to approximate reset

Exploitation


Model based algorithm2
Model based algorithm

s0

~

a2

a1

o1

o2

o1

o2

a1

a2

a2

a1

…………………………………………………………………………


Model based algorithm computing the optimal policy
Model based algorithm –Computing the optimal policy

  • Bounding the error in the model

    • Significant Nodes

      • Sampling

      • Approximate reset

    • Insignificant Nodes

  • Compute an ε-optimal t horizon policy in each step


Model based algorithm convergence w p 1 proof
Model Based algorithm- Convergence w.p 1 proof

  • Proof idea:

  • At any stage K1(t) is large enough so we compute an t-optimal t horizon policy

  • K2(t) is large enough such that all phases before influence is bounded by t

  • For a large enough horizon, the homing sequence influence is also bounded


Model based algorithm convergence rate
Model Based algorithmConvergence rate

  • Model based algorithm produces an -optimal policy with probability 1 -  in time polynomial in , |A|,|O|, log(1/ ), Homing sequence length, and exponential in the horizon time of the optimal policy

  • Note the algorithm does not depend on |S|


Planning in pomdp
Planning in POMDP

  • Unfortunately, not today …

  • Basic results:

    • Tight connections with Multiplicity Automata

      • Well establish theory starting in the 60’s

    • Rank of the Hankel matrix

      • Similar to PSR

      • Always less then the number of states

    • Planning algorithm:

      • Exponential in the rank of the Hankel matrix


Tracking in pomdps
Tracking in POMDPs

  • Belief states algorithm

    • Assumes perfect tracking

      • Perfect model.

  • Imperfect model, tracking impossible

    • For example: No observable

  • New results:

    • “Informative observables” implies efficient tracking.

  • Towards a spectrum of “partially” …


ad