1 / 35

# Learning and Planning for POMDPs - PowerPoint PPT Presentation

Learning and Planning for POMDPs. Eyal Even-Dar, Tel-Aviv University Sham Kakade , University of Pennsylvania Yishay Mansour, Tel-Aviv University. Talk Outline. Bounded Rationality and Partially Observable MDPs Mathematical Model of POMDPs Learning in POMDPs Planning in POMDPs

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Learning and Planning for POMDPs ' - tracey

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Learning and Planning for POMDPs

Eyal Even-Dar,Tel-Aviv University

Yishay Mansour,Tel-Aviv University

• Bounded Rationality and

Partially Observable MDPs

• Mathematical Model of POMDPs

• Learning in POMDPs

• Planning in POMDPs

• Tracking in POMDPs

• Rationality:

• Unlimited Computational power players

• Bounded Rationality

• Computational limitation

• Finite Automata

• Challenge: play optimally against a Finite Automata

• Size of automata unknown

• Model:

• Perform an action

• See an observation

• Either immediate rewards or delay reward

• This is a POMDP

• Unknown size is a serious challenge

Classical Reinforcement LearningAgent – Environment Interaction

Agent

Agent action

Reward

Environment

Next state

• Maximize the return.

• Discounted return ∑trt 0<<1

• Undiscounted return ∑rt/ T

t=1

T

t=1

• Policy Π:

• Mapping states to distribution over

• Optimal policy Π*:

• Attains optimal return from any start state.

• Theorem:

There exists a stationary deterministic optimal policy

• Planning:

• Input: a complete model

• Output: an optimal policy Π*:

• Learning:

• Interaction with the environment

• Achieve near optimal return.

• For MDPs both planning and learning can be done efficiently

• Polynomial in the number of states

• representation in tabular form

Partial ObservableAgent – Environment Interaction

Agent

Agent action

Reward

Environment

Signal correlated with state

• S the states

• A actions

• Psa(-) next state distribution

• R(s,a) Reward distribution

• O Observations

• O(s,a) Observation distribution

O1 = .1

02 = .8

03 = .1

O1 = .8

02 = .1

03 = .1

s2

s1

0.3

0.7

O1 = .1

02 = .1

03 = .8

E[R(s3,a)] = 10

s3

• The optimal policy is not stationary furthermore it is history dependent

• Example:

Partial Observables – Complexity Hardness results

LGM01, L95

• Suppose an agent knows its state initially, can he keep track of his state?

• Easy given a completely accurate model.

• Inaccurate model: Our new tracking result.

• What is the meaning of very long histories?

• Do we really need to keep all the history?!

Planning in POMDPs – Belief State Algorithm

• A Bayesian setting

• Prior over initial state

• Given an action and observation defines a posterior

• belief state: distribution over states

• View the possible belief states as “states”

• Infinite number of states

• Assumes also a “perfect model”

• Find local optimal policy in a restricted class of polices (parameterized policies)

• Need to assume a reset to the start state!

• Cannot guarantee asymptotic results

• [Peshkin et al, Baxter & Bartlett,…]

• Trajectory trees [KMN]:

• Assume a generative model

• A strong RESET procedure

• Find “near best” policy in a restricted class of polices

• finite horizon policies

• parameterized policies

s0

a1

a2

o1

o2

a1

a1

a2

a2

o2

o4

o3

o1

• Return: Average reward criteria

• One long trajectory

• No RESET

• Connected environment (unichain POMDP)

• Goal: Achieve the optimal return (average reward) with probability 1

• Homing strategy is a strategy that identifies the state.

• Knows how to return “home”

• Enables to “approximate reset” in during a long trajectory.

• Learning finite automata [Rivest Schapire]

• Use homing sequence to identify the state

• The homing sequence is exact

• It can lead to many states

• Use finite automata learning of [Angluin 87]

• Diversity based learning [Rivest Schpire]

• Similar to our setting

• Major difference: deterministic transitions

Definition:

H is an (,K)-homing strategy if

for every two belief states x1 and x2,

after K steps of following H,

the expected belief states b1 and b2 are within  distance.

• The POMDP is strongly connected, then the random walk Markov chain is irreducible

• Following the random walk assures that we converge to the steady state

• What if the Markov chain is periodic?

• a cycle

• Use “stay action” to overcome periodicity problems

Claim:

If H is an (,K)-homing sequence then repeating H for T times is an (T,KT)-homing sequence

• Usually algorithms should balance between exploration and exploitation

• Now they should balance between exploration, exploitation and homing

• Homing is performed in both exploration and exploitation

Theorem:

For any connected POMDP the policy testing algorithm obtains the optimal average reward with probability 1

After T time steps is competes with policies of horizon log log T

• Enumerate the policies

• Run in phases:

• Test policy πk

• Average runs, resetting between runs

• Run the best policy so far

• Ensures good average return

• Again, reset between runs.

Theorem:

For any connected POMDP the model based algorithm obtains the optimal average reward with probability 1

After T time steps is competes with policies of horizon log T

Exploration

• For t=1 to ∞

• For K1(t)times do

• Run random for t steps and build an empirical model

• Use homing sequence to approximate reset

• Compute optimal policy on the empirical model

• For K2(t)times do

• Run the empirical optimal policy for t steps

• Use homing sequence to approximate reset

Exploitation

s0

~

a2

a1

o1

o2

o1

o2

a1

a2

a2

a1

…………………………………………………………………………

• Bounding the error in the model

• Significant Nodes

• Sampling

• Approximate reset

• Insignificant Nodes

• Compute an ε-optimal t horizon policy in each step

Model Based algorithm- Convergence w.p 1 proof

• Proof idea:

• At any stage K1(t) is large enough so we compute an t-optimal t horizon policy

• K2(t) is large enough such that all phases before influence is bounded by t

• For a large enough horizon, the homing sequence influence is also bounded

Model Based algorithmConvergence rate

• Model based algorithm produces an -optimal policy with probability 1 -  in time polynomial in , |A|,|O|, log(1/ ), Homing sequence length, and exponential in the horizon time of the optimal policy

• Note the algorithm does not depend on |S|

• Unfortunately, not today …

• Basic results:

• Tight connections with Multiplicity Automata

• Well establish theory starting in the 60’s

• Rank of the Hankel matrix

• Similar to PSR

• Always less then the number of states

• Planning algorithm:

• Exponential in the rank of the Hankel matrix

• Belief states algorithm

• Assumes perfect tracking

• Perfect model.

• Imperfect model, tracking impossible

• For example: No observable

• New results:

• “Informative observables” implies efficient tracking.

• Towards a spectrum of “partially” …