policy evaluation policy iteration
Download
Skip this Video
Download Presentation
Policy Evaluation & Policy Iteration

Loading in 2 Seconds...

play fullscreen
1 / 30

Policy Evaluation & Policy Iteration - PowerPoint PPT Presentation


  • 63 Views
  • Uploaded on

Policy Evaluation & Policy Iteration. S&B: Sec 4.1, 4.3; 6.5. The Bellman equation. The final recursive equation is known as the Bellman equation : Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M = 〈 S , A ,T,R 〉

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Policy Evaluation & Policy Iteration' - korene


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the bellman equation
The Bellman equation
  • The final recursive equation is known as the Bellman equation:
  • Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M=〈S,A,T,R〉
  • When state/action spaces are discrete, can think of V and R as vectors and Tπ as matrix, and get matrix eqn:
exercise
Exercise
  • Solve the matrix Bellman equation (i.e., find V):
  • I formulated the Bellman equations for “state-based” rewards: R(s)
    • Formulate & solve the B.E. for “state-action” rewards (R(s,a)) and “state-action-state” rewards (R(s,a,s’))
policy values in practice
Policy values in practice

“Robot” navigation in a grid maze

policy values in practice1
Policy values in practice

Optimal policy, π*

policy values in practice2
Policy values in practice

Value function for optimal policy, V*

a harder maze1
A harder “maze”...

Optimal policy, π*

a harder maze2
A harder “maze”...

Value function for optimal policy, V*

a harder maze3
A harder “maze”...

Value function for optimal policy, V*

still more complex1
Still more complex...

Optimal policy, π*

still more complex2
Still more complex...

Value function for optimal policy, V*

still more complex3
Still more complex...

Value function for optimal policy, V*

planning finding
Planning: finding π*
  • So we know how to evaluate a single policy, π
  • How do you find the best policy?
    • Remember: still assuming that we know M=〈S,A,T,R〉
planning finding1
Planning: finding π*
  • So we know how to evaluate a single policy, π
  • How do you find the best policy?
    • Remember: still assuming that we know M=〈S,A,T,R〉
  • Non-solution: iterate through all possible π, evaluating each one; keep best
policy iteration friends
Policy iteration & friends
  • Many different solutions available.
  • All exploit some characteristics of MDPs:
    • For infinite-horizon discounted reward in a discrete, finite MDP, there exists at least one optimal, stationary policy (may exist more than one equivalent policy)
    • The Bellman equation expresses recursive structure of an optimal policy
  • Leads to a series of closely related policy solutions: policy iteration, value iteration, generalized policy iteration, etc.
the policy iteration alg
The policy iteration alg.
  • Function: policy_iteration
  • Input: MDPM=〈S,A,T,R〉, discount γ
  • Output: optimal policyπ*; opt. value func.V*
  • Initialization: chooseπ0arbitrarily
  • Repeat {
    • Vi=eval_policy(M,πi,γ) // from Bellman eqn
    • πi+1=local_update_policy(πi,Vi)
  • } Until (πi+1==πi)
  • Function: π’=local_update_policy(π,V)
  • for i=1..|S| {
    • π’(si)=argmaxa∈A{sumj(T(si,a,sj)*V(sj))}
  • }
why does this work
Why does this work?
  • 2 explanations:
  • Theoretical:
    • The local update w.r.t. the policy value is a contractive mapping, ergo a fixed point exists and will be reached
    • See, “contraction mapping”, “Banach fixed-point theorem”, etc.
      • http://math.arizona.edu/~restrepo/475A/Notes/sourcea/node22.html
      • http://planetmath.org/encyclopedia/BanachFixedPointTheorem.html
    • Contracts w.r.t. the Bellman Error:
why does this work1
Why does this work?
  • The intuitive explanation
    • It’s doing a dynamic-programming “backup” of reward from reward “sources”
    • At every step, the policy is locally updated to take advantage of new information about reward that is propagated back by the evaluation step
    • Value “propagates away” from sources and the policy is able to say “hey! there’s reward over there! I can get some of that action if I change a bit!”
p i in action
P.I. in action

Iteration 0

Policy

Value

p i in action1
P.I. in action

Iteration 1

Policy

Value

p i in action2
P.I. in action

Iteration 2

Policy

Value

p i in action3
P.I. in action

Iteration 3

Policy

Value

p i in action4
P.I. in action

Iteration 4

Policy

Value

p i in action5
P.I. in action

Iteration 5

Policy

Value

p i in action6
P.I. in action

Iteration 6: done

Policy

Value

properties variants
Properties & Variants
  • Policy iteration
    • Known to converge (provable)
    • Observed to converge exponentially quickly
      • # iterations is O(ln(|S|))
      • Empirical observation; strongly believed but no proof (yet)
    • O(|S|3) time per iteration (policy evaluation)
  • Other methods possible
    • Linear program (poly time soln exists)
    • Value iteration
    • Generalized policy iter. (often best in practice)
q a key operative
Q: A key operative
  • Critical step in policy iteration
    • π’(si)=argmaxa∈A{sumj(T(si,a,sj)*V(sj))}
  • Asks “What happens if I ignore π for just one step, and do a instead (and then resume doing π thereafter)?”
  • Often used operation. Gets a special name:
    • Definition: the Q function, is:
  • Policy iter says: “Figure out Q, act greedily according to Q, then update Q and repeat, until you can’t do any better...”
what to do with q
What to do with Q
  • Can think of Q as a big table: one entry for each state/action pair
    • “If I’m in state s and take action a, this is my expected discounted reward...”
    • A “one-step” exploration: “In state s, if I deviate from my policy π for one timestep, then keep doing π, is my life better or worse?”
  • Can get V and π from Q:
ad