Loading in 5 sec....

Learning and Planning for POMDPs PowerPoint Presentation

Learning and Planning for POMDPs

- 83 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Learning and Planning for POMDPs ' - tracey

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Learning and Planning for POMDPs

Eyal Even-Dar,Tel-Aviv University

Sham Kakade, University of Pennsylvania

Yishay Mansour,Tel-Aviv University

Talk Outline

- Bounded Rationality and
Partially Observable MDPs

- Mathematical Model of POMDPs
- Learning in POMDPs
- Planning in POMDPs
- Tracking in POMDPs

Bounded Rationality

- Rationality:
- Unlimited Computational power players

- Bounded Rationality
- Computational limitation
- Finite Automata

- Challenge: play optimally against a Finite Automata
- Size of automata unknown

Bounded Rationality and RL

- Model:
- Perform an action
- See an observation
- Either immediate rewards or delay reward

- This is a POMDP
- Unknown size is a serious challenge

Classical Reinforcement LearningAgent – Environment Interaction

Agent

Agent action

Reward

Environment

Next state

Reinforcement Learning - Goal

- Maximize the return.
- Discounted return ∑trt 0<<1
- Undiscounted return ∑rt/ T

∞

t=1

T

t=1

Reinforcement Learning ModelPolicy

- Policy Π:
- Mapping states to distribution over

- Optimal policy Π*:
- Attains optimal return from any start state.

- Theorem:
There exists a stationary deterministic optimal policy

Planning and Learning in MDPs

- Planning:
- Input: a complete model
- Output: an optimal policy Π*:

- Learning:
- Interaction with the environment
- Achieve near optimal return.

- For MDPs both planning and learning can be done efficiently
- Polynomial in the number of states
- representation in tabular form

Partial ObservableAgent – Environment Interaction

Agent

Agent action

Reward

Environment

Signal correlated with state

Partially Observable Markov Decision Process

- S the states

- A actions

- Psa(-) next state distribution

- R(s,a) Reward distribution

- O Observations
- O(s,a) Observation distribution

O1 = .1

02 = .8

03 = .1

O1 = .8

02 = .1

03 = .1

s2

s1

0.3

0.7

O1 = .1

02 = .1

03 = .8

E[R(s3,a)] = 10

s3

Partial Observables – problems in Planning

- The optimal policy is not stationary furthermore it is history dependent
- Example:

Partial Observables – Complexity Hardness results

LGM01, L95

Learning in PODMPs – Difficulties

- Suppose an agent knows its state initially, can he keep track of his state?
- Easy given a completely accurate model.
- Inaccurate model: Our new tracking result.

- How can the agent return to the same state?
- What is the meaning of very long histories?
- Do we really need to keep all the history?!

Planning in POMDPs – Belief State Algorithm

- A Bayesian setting
- Prior over initial state
- Given an action and observation defines a posterior
- belief state: distribution over states

- View the possible belief states as “states”
- Infinite number of states

- Assumes also a “perfect model”

Learning in POMDPs – Popular methods

- Policy gradient methods :
- Find local optimal policy in a restricted class of polices (parameterized policies)
- Need to assume a reset to the start state!
- Cannot guarantee asymptotic results
- [Peshkin et al, Baxter & Bartlett,…]

Learning in POMDPs

- Trajectory trees [KMN]:
- Assume a generative model
- A strong RESET procedure

- Find “near best” policy in a restricted class of polices
- finite horizon policies
- parameterized policies

- Assume a generative model

Our setting

- Return: Average reward criteria
- One long trajectory
- No RESET
- Connected environment (unichain POMDP)

- Goal: Achieve the optimal return (average reward) with probability 1

Homing strategies - POMDPs

- Homing strategy is a strategy that identifies the state.
- Knows how to return “home”

- Enables to “approximate reset” in during a long trajectory.

Homing strategies

- Learning finite automata [Rivest Schapire]
- Use homing sequence to identify the state
- The homing sequence is exact
- It can lead to many states

- Use finite automata learning of [Angluin 87]

- Use homing sequence to identify the state
- Diversity based learning [Rivest Schpire]
- Similar to our setting

- Major difference: deterministic transitions

Homing strategies - POMDPs

Definition:

H is an (,K)-homing strategy if

for every two belief states x1 and x2,

after K steps of following H,

the expected belief states b1 and b2 are within distance.

Homing strategies – Random Walk

- The POMDP is strongly connected, then the random walk Markov chain is irreducible
- Following the random walk assures that we converge to the steady state

Homing strategies – Random Walk

- What if the Markov chain is periodic?
- a cycle

- Use “stay action” to overcome periodicity problems

Homing strategies – Amplifying

Claim:

If H is an (,K)-homing sequence then repeating H for T times is an (T,KT)-homing sequence

Reinforcement learning with homing

- Usually algorithms should balance between exploration and exploitation
- Now they should balance between exploration, exploitation and homing
- Homing is performed in both exploration and exploitation

Policy testing algorithm

Theorem:

For any connected POMDP the policy testing algorithm obtains the optimal average reward with probability 1

After T time steps is competes with policies of horizon log log T

Policy testing

- Enumerate the policies
- Gradually increase horizon

- Run in phases:
- Test policy πk
- Average runs, resetting between runs

- Run the best policy so far
- Ensures good average return
- Again, reset between runs.

- Test policy πk

Model based algorithm

Theorem:

For any connected POMDP the model based algorithm obtains the optimal average reward with probability 1

After T time steps is competes with policies of horizon log T

Model based algorithm

Exploration

- For t=1 to ∞
- For K1(t)times do
- Run random for t steps and build an empirical model
- Use homing sequence to approximate reset

- Compute optimal policy on the empirical model
- For K2(t)times do
- Run the empirical optimal policy for t steps
- Use homing sequence to approximate reset

- For K1(t)times do

Exploitation

Model based algorithm –Computing the optimal policy

- Bounding the error in the model
- Significant Nodes
- Sampling
- Approximate reset

- Insignificant Nodes

- Significant Nodes
- Compute an ε-optimal t horizon policy in each step

Model Based algorithm- Convergence w.p 1 proof

- Proof idea:
- At any stage K1(t) is large enough so we compute an t-optimal t horizon policy
- K2(t) is large enough such that all phases before influence is bounded by t
- For a large enough horizon, the homing sequence influence is also bounded

Model Based algorithmConvergence rate

- Model based algorithm produces an -optimal policy with probability 1 - in time polynomial in , |A|,|O|, log(1/ ), Homing sequence length, and exponential in the horizon time of the optimal policy
- Note the algorithm does not depend on |S|

Planning in POMDP

- Unfortunately, not today …
- Basic results:
- Tight connections with Multiplicity Automata
- Well establish theory starting in the 60’s

- Rank of the Hankel matrix
- Similar to PSR
- Always less then the number of states

- Planning algorithm:
- Exponential in the rank of the Hankel matrix

- Tight connections with Multiplicity Automata

Tracking in POMDPs

- Belief states algorithm
- Assumes perfect tracking
- Perfect model.

- Assumes perfect tracking
- Imperfect model, tracking impossible
- For example: No observable

- New results:
- “Informative observables” implies efficient tracking.

- Towards a spectrum of “partially” …

Download Presentation

Connecting to Server..