Policy Evaluation &amp; Policy Iteration

1 / 30

Policy Evaluation &amp; Policy Iteration - PowerPoint PPT Presentation

Policy Evaluation &amp; Policy Iteration. S&amp;B: Sec 4.1, 4.3; 6.5. The Bellman equation. The final recursive equation is known as the Bellman equation : Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M = 〈 S , A ,T,R 〉

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about 'Policy Evaluation &amp; Policy Iteration' - korene

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The Bellman equation
• The final recursive equation is known as the Bellman equation:
• Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M=〈S,A,T,R〉
• When state/action spaces are discrete, can think of V and R as vectors and Tπ as matrix, and get matrix eqn:
Exercise
• Solve the matrix Bellman equation (i.e., find V):
• I formulated the Bellman equations for “state-based” rewards: R(s)
• Formulate & solve the B.E. for “state-action” rewards (R(s,a)) and “state-action-state” rewards (R(s,a,s’))
Policy values in practice

“Robot” navigation in a grid maze

Policy values in practice

Optimal policy, π*

Policy values in practice

Value function for optimal policy, V*

A harder “maze”...

Optimal policy, π*

A harder “maze”...

Value function for optimal policy, V*

A harder “maze”...

Value function for optimal policy, V*

Still more complex...

Optimal policy, π*

Still more complex...

Value function for optimal policy, V*

Still more complex...

Value function for optimal policy, V*

Planning: finding π*
• So we know how to evaluate a single policy, π
• How do you find the best policy?
• Remember: still assuming that we know M=〈S,A,T,R〉
Planning: finding π*
• So we know how to evaluate a single policy, π
• How do you find the best policy?
• Remember: still assuming that we know M=〈S,A,T,R〉
• Non-solution: iterate through all possible π, evaluating each one; keep best
Policy iteration & friends
• Many different solutions available.
• All exploit some characteristics of MDPs:
• For infinite-horizon discounted reward in a discrete, finite MDP, there exists at least one optimal, stationary policy (may exist more than one equivalent policy)
• The Bellman equation expresses recursive structure of an optimal policy
• Leads to a series of closely related policy solutions: policy iteration, value iteration, generalized policy iteration, etc.
The policy iteration alg.
• Function: policy_iteration
• Input: MDPM=〈S,A,T,R〉, discount γ
• Output: optimal policyπ*; opt. value func.V*
• Initialization: chooseπ0arbitrarily
• Repeat {
• Vi=eval_policy(M,πi,γ) // from Bellman eqn
• πi+1=local_update_policy(πi,Vi)
• } Until (πi+1==πi)
• Function: π’=local_update_policy(π,V)
• for i=1..|S| {
• π’(si)=argmaxa∈A{sumj(T(si,a,sj)*V(sj))}
• }
Why does this work?
• 2 explanations:
• Theoretical:
• The local update w.r.t. the policy value is a contractive mapping, ergo a fixed point exists and will be reached
• See, “contraction mapping”, “Banach fixed-point theorem”, etc.
• http://math.arizona.edu/~restrepo/475A/Notes/sourcea/node22.html
• http://planetmath.org/encyclopedia/BanachFixedPointTheorem.html
• Contracts w.r.t. the Bellman Error:
Why does this work?
• The intuitive explanation
• It’s doing a dynamic-programming “backup” of reward from reward “sources”
• At every step, the policy is locally updated to take advantage of new information about reward that is propagated back by the evaluation step
• Value “propagates away” from sources and the policy is able to say “hey! there’s reward over there! I can get some of that action if I change a bit!”
P.I. in action

Iteration 0

Policy

Value

P.I. in action

Iteration 1

Policy

Value

P.I. in action

Iteration 2

Policy

Value

P.I. in action

Iteration 3

Policy

Value

P.I. in action

Iteration 4

Policy

Value

P.I. in action

Iteration 5

Policy

Value

P.I. in action

Iteration 6: done

Policy

Value

Properties & Variants
• Policy iteration
• Known to converge (provable)
• Observed to converge exponentially quickly
• # iterations is O(ln(|S|))
• Empirical observation; strongly believed but no proof (yet)
• O(|S|3) time per iteration (policy evaluation)
• Other methods possible
• Linear program (poly time soln exists)
• Value iteration
• Generalized policy iter. (often best in practice)
Q: A key operative
• Critical step in policy iteration
• π’(si)=argmaxa∈A{sumj(T(si,a,sj)*V(sj))}
• Asks “What happens if I ignore π for just one step, and do a instead (and then resume doing π thereafter)?”
• Often used operation. Gets a special name:
• Definition: the Q function, is:
• Policy iter says: “Figure out Q, act greedily according to Q, then update Q and repeat, until you can’t do any better...”
What to do with Q
• Can think of Q as a big table: one entry for each state/action pair
• “If I’m in state s and take action a, this is my expected discounted reward...”
• A “one-step” exploration: “In state s, if I deviate from my policy π for one timestep, then keep doing π, is my life better or worse?”
• Can get V and π from Q: