Learning and Planning for POMDPs

1 / 35

# Learning and Planning for POMDPs - PowerPoint PPT Presentation

Learning and Planning for POMDPs. Eyal Even-Dar, Tel-Aviv University Sham Kakade , University of Pennsylvania Yishay Mansour, Tel-Aviv University. Talk Outline. Bounded Rationality and Partially Observable MDPs Mathematical Model of POMDPs Learning in POMDPs Planning in POMDPs

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Learning and Planning for POMDPs' - tracey

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Learning and Planning for POMDPs

Eyal Even-Dar,Tel-Aviv University

Yishay Mansour,Tel-Aviv University

Talk Outline
• Bounded Rationality and

Partially Observable MDPs

• Mathematical Model of POMDPs
• Learning in POMDPs
• Planning in POMDPs
• Tracking in POMDPs
Bounded Rationality
• Rationality:
• Unlimited Computational power players
• Bounded Rationality
• Computational limitation
• Finite Automata
• Challenge: play optimally against a Finite Automata
• Size of automata unknown
Bounded Rationality and RL
• Model:
• Perform an action
• See an observation
• Either immediate rewards or delay reward
• This is a POMDP
• Unknown size is a serious challenge

Agent

Agent action

Reward

Environment

Next state

Reinforcement Learning - Goal
• Maximize the return.
• Discounted return ∑trt 0<<1
• Undiscounted return ∑rt/ T

t=1

T

t=1

Reinforcement Learning ModelPolicy
• Policy Π:
• Mapping states to distribution over
• Optimal policy Π*:
• Attains optimal return from any start state.
• Theorem:

There exists a stationary deterministic optimal policy

Planning and Learning in MDPs
• Planning:
• Input: a complete model
• Output: an optimal policy Π*:
• Learning:
• Interaction with the environment
• Achieve near optimal return.
• For MDPs both planning and learning can be done efficiently
• Polynomial in the number of states
• representation in tabular form
Partial ObservableAgent – Environment Interaction

Agent

Agent action

Reward

Environment

Signal correlated with state

Partially Observable Markov Decision Process
• S the states
• A actions
• Psa(-) next state distribution
• R(s,a) Reward distribution
• O Observations
• O(s,a) Observation distribution

O1 = .1

02 = .8

03 = .1

O1 = .8

02 = .1

03 = .1

s2

s1

0.3

0.7

O1 = .1

02 = .1

03 = .8

E[R(s3,a)] = 10

s3

Partial Observables – problems in Planning
• The optimal policy is not stationary furthermore it is history dependent
• Example:
Learning in PODMPs – Difficulties
• Suppose an agent knows its state initially, can he keep track of his state?
• Easy given a completely accurate model.
• Inaccurate model: Our new tracking result.
• What is the meaning of very long histories?
• Do we really need to keep all the history?!
Planning in POMDPs – Belief State Algorithm
• A Bayesian setting
• Prior over initial state
• Given an action and observation defines a posterior
• belief state: distribution over states
• View the possible belief states as “states”
• Infinite number of states
• Assumes also a “perfect model”
Learning in POMDPs – Popular methods
• Find local optimal policy in a restricted class of polices (parameterized policies)
• Need to assume a reset to the start state!
• Cannot guarantee asymptotic results
• [Peshkin et al, Baxter & Bartlett,…]
Learning in POMDPs
• Trajectory trees [KMN]:
• Assume a generative model
• A strong RESET procedure
• Find “near best” policy in a restricted class of polices
• finite horizon policies
• parameterized policies
Trajectory tree [KMN]

s0

a1

a2

o1

o2

a1

a1

a2

a2

o2

o4

o3

o1

Our setting
• Return: Average reward criteria
• One long trajectory
• No RESET
• Connected environment (unichain POMDP)
• Goal: Achieve the optimal return (average reward) with probability 1
Homing strategies - POMDPs
• Homing strategy is a strategy that identifies the state.
• Knows how to return “home”
• Enables to “approximate reset” in during a long trajectory.
Homing strategies
• Learning finite automata [Rivest Schapire]
• Use homing sequence to identify the state
• The homing sequence is exact
• It can lead to many states
• Use finite automata learning of [Angluin 87]
• Diversity based learning [Rivest Schpire]
• Similar to our setting
• Major difference: deterministic transitions
Homing strategies - POMDPs

Definition:

H is an (,K)-homing strategy if

for every two belief states x1 and x2,

after K steps of following H,

the expected belief states b1 and b2 are within  distance.

Homing strategies – Random Walk
• The POMDP is strongly connected, then the random walk Markov chain is irreducible
• Following the random walk assures that we converge to the steady state
Homing strategies – Random Walk
• What if the Markov chain is periodic?
• a cycle
• Use “stay action” to overcome periodicity problems
Homing strategies – Amplifying

Claim:

If H is an (,K)-homing sequence then repeating H for T times is an (T,KT)-homing sequence

Reinforcement learning with homing
• Usually algorithms should balance between exploration and exploitation
• Now they should balance between exploration, exploitation and homing
• Homing is performed in both exploration and exploitation
Policy testing algorithm

Theorem:

For any connected POMDP the policy testing algorithm obtains the optimal average reward with probability 1

After T time steps is competes with policies of horizon log log T

Policy testing
• Enumerate the policies
• Run in phases:
• Test policy πk
• Average runs, resetting between runs
• Run the best policy so far
• Ensures good average return
• Again, reset between runs.
Model based algorithm

Theorem:

For any connected POMDP the model based algorithm obtains the optimal average reward with probability 1

After T time steps is competes with policies of horizon log T

Model based algorithm

Exploration

• For t=1 to ∞
• For K1(t)times do
• Run random for t steps and build an empirical model
• Use homing sequence to approximate reset
• Compute optimal policy on the empirical model
• For K2(t)times do
• Run the empirical optimal policy for t steps
• Use homing sequence to approximate reset

Exploitation

Model based algorithm

s0

~

a2

a1

o1

o2

o1

o2

a1

a2

a2

a1

…………………………………………………………………………

Model based algorithm –Computing the optimal policy
• Bounding the error in the model
• Significant Nodes
• Sampling
• Approximate reset
• Insignificant Nodes
• Compute an ε-optimal t horizon policy in each step
Model Based algorithm- Convergence w.p 1 proof
• Proof idea:
• At any stage K1(t) is large enough so we compute an t-optimal t horizon policy
• K2(t) is large enough such that all phases before influence is bounded by t
• For a large enough horizon, the homing sequence influence is also bounded
Model Based algorithmConvergence rate
• Model based algorithm produces an -optimal policy with probability 1 -  in time polynomial in , |A|,|O|, log(1/ ), Homing sequence length, and exponential in the horizon time of the optimal policy
• Note the algorithm does not depend on |S|
Planning in POMDP
• Unfortunately, not today …
• Basic results:
• Tight connections with Multiplicity Automata
• Well establish theory starting in the 60’s
• Rank of the Hankel matrix
• Similar to PSR
• Always less then the number of states
• Planning algorithm:
• Exponential in the rank of the Hankel matrix
Tracking in POMDPs
• Belief states algorithm
• Assumes perfect tracking
• Perfect model.
• Imperfect model, tracking impossible
• For example: No observable
• New results:
• “Informative observables” implies efficient tracking.
• Towards a spectrum of “partially” …