- 83 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Learning and Planning for POMDPs ' - tracey

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Learning and Planning for POMDPs

Eyal Even-Dar,Tel-Aviv University

Sham Kakade, University of Pennsylvania

Yishay Mansour,Tel-Aviv University

Talk Outline

- Bounded Rationality and

Partially Observable MDPs

- Mathematical Model of POMDPs
- Learning in POMDPs
- Planning in POMDPs
- Tracking in POMDPs

Bounded Rationality

- Rationality:
- Unlimited Computational power players
- Bounded Rationality
- Computational limitation
- Finite Automata
- Challenge: play optimally against a Finite Automata
- Size of automata unknown

Bounded Rationality and RL

- Model:
- Perform an action
- See an observation
- Either immediate rewards or delay reward
- This is a POMDP
- Unknown size is a serious challenge

Classical Reinforcement LearningAgent – Environment Interaction

Agent

Agent action

Reward

Environment

Next state

Reinforcement Learning - Goal

- Maximize the return.
- Discounted return ∑trt 0<<1
- Undiscounted return ∑rt/ T

∞

t=1

T

t=1

Reinforcement Learning ModelPolicy

- Policy Π:
- Mapping states to distribution over
- Optimal policy Π*:
- Attains optimal return from any start state.
- Theorem:

There exists a stationary deterministic optimal policy

Planning and Learning in MDPs

- Planning:
- Input: a complete model
- Output: an optimal policy Π*:
- Learning:
- Interaction with the environment
- Achieve near optimal return.
- For MDPs both planning and learning can be done efficiently
- Polynomial in the number of states
- representation in tabular form

Partial ObservableAgent – Environment Interaction

Agent

Agent action

Reward

Environment

Signal correlated with state

Partially Observable Markov Decision Process

- S the states

- A actions

- Psa(-) next state distribution

- R(s,a) Reward distribution

- O Observations
- O(s,a) Observation distribution

O1 = .1

02 = .8

03 = .1

O1 = .8

02 = .1

03 = .1

s2

s1

0.3

0.7

O1 = .1

02 = .1

03 = .8

E[R(s3,a)] = 10

s3

Partial Observables – problems in Planning

- The optimal policy is not stationary furthermore it is history dependent
- Example:

Learning in PODMPs – Difficulties

- Suppose an agent knows its state initially, can he keep track of his state?
- Easy given a completely accurate model.
- Inaccurate model: Our new tracking result.
- How can the agent return to the same state?
- What is the meaning of very long histories?
- Do we really need to keep all the history?!

Planning in POMDPs – Belief State Algorithm

- A Bayesian setting
- Prior over initial state
- Given an action and observation defines a posterior
- belief state: distribution over states
- View the possible belief states as “states”
- Infinite number of states
- Assumes also a “perfect model”

Learning in POMDPs – Popular methods

- Policy gradient methods :
- Find local optimal policy in a restricted class of polices (parameterized policies)
- Need to assume a reset to the start state!
- Cannot guarantee asymptotic results
- [Peshkin et al, Baxter & Bartlett,…]

Learning in POMDPs

- Trajectory trees [KMN]:
- Assume a generative model
- A strong RESET procedure
- Find “near best” policy in a restricted class of polices
- finite horizon policies
- parameterized policies

Our setting

- Return: Average reward criteria
- One long trajectory
- No RESET
- Connected environment (unichain POMDP)
- Goal: Achieve the optimal return (average reward) with probability 1

Homing strategies - POMDPs

- Homing strategy is a strategy that identifies the state.
- Knows how to return “home”
- Enables to “approximate reset” in during a long trajectory.

Homing strategies

- Learning finite automata [Rivest Schapire]
- Use homing sequence to identify the state
- The homing sequence is exact
- It can lead to many states
- Use finite automata learning of [Angluin 87]
- Diversity based learning [Rivest Schpire]
- Similar to our setting
- Major difference: deterministic transitions

Homing strategies - POMDPs

Definition:

H is an (,K)-homing strategy if

for every two belief states x1 and x2,

after K steps of following H,

the expected belief states b1 and b2 are within distance.

Homing strategies – Random Walk

- The POMDP is strongly connected, then the random walk Markov chain is irreducible
- Following the random walk assures that we converge to the steady state

Homing strategies – Random Walk

- What if the Markov chain is periodic?
- a cycle
- Use “stay action” to overcome periodicity problems

Homing strategies – Amplifying

Claim:

If H is an (,K)-homing sequence then repeating H for T times is an (T,KT)-homing sequence

Reinforcement learning with homing

- Usually algorithms should balance between exploration and exploitation
- Now they should balance between exploration, exploitation and homing
- Homing is performed in both exploration and exploitation

Policy testing algorithm

Theorem:

For any connected POMDP the policy testing algorithm obtains the optimal average reward with probability 1

After T time steps is competes with policies of horizon log log T

Policy testing

- Enumerate the policies
- Gradually increase horizon
- Run in phases:
- Test policy πk
- Average runs, resetting between runs
- Run the best policy so far
- Ensures good average return
- Again, reset between runs.

Model based algorithm

Theorem:

For any connected POMDP the model based algorithm obtains the optimal average reward with probability 1

After T time steps is competes with policies of horizon log T

Model based algorithm

Exploration

- For t=1 to ∞
- For K1(t)times do
- Run random for t steps and build an empirical model
- Use homing sequence to approximate reset
- Compute optimal policy on the empirical model
- For K2(t)times do
- Run the empirical optimal policy for t steps
- Use homing sequence to approximate reset

Exploitation

Model based algorithm –Computing the optimal policy

- Bounding the error in the model
- Significant Nodes
- Sampling
- Approximate reset
- Insignificant Nodes
- Compute an ε-optimal t horizon policy in each step

Model Based algorithm- Convergence w.p 1 proof

- Proof idea:
- At any stage K1(t) is large enough so we compute an t-optimal t horizon policy
- K2(t) is large enough such that all phases before influence is bounded by t
- For a large enough horizon, the homing sequence influence is also bounded

Model Based algorithmConvergence rate

- Model based algorithm produces an -optimal policy with probability 1 - in time polynomial in , |A|,|O|, log(1/ ), Homing sequence length, and exponential in the horizon time of the optimal policy
- Note the algorithm does not depend on |S|

Planning in POMDP

- Unfortunately, not today …
- Basic results:
- Tight connections with Multiplicity Automata
- Well establish theory starting in the 60’s
- Rank of the Hankel matrix
- Similar to PSR
- Always less then the number of states
- Planning algorithm:
- Exponential in the rank of the Hankel matrix

Tracking in POMDPs

- Belief states algorithm
- Assumes perfect tracking
- Perfect model.
- Imperfect model, tracking impossible
- For example: No observable
- New results:
- “Informative observables” implies efficient tracking.
- Towards a spectrum of “partially” …

Download Presentation

Connecting to Server..