- 63 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Policy Evaluation & Policy Iteration' - korene

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Policy Evaluation & Policy Iteration

- S&B: Sec 4.1, 4.3; 6.5

The Bellman equation

- The final recursive equation is known as the Bellman equation:
- Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M=〈S,A,T,R〉
- When state/action spaces are discrete, can think of V and R as vectors and Tπ as matrix, and get matrix eqn:

Exercise

- Solve the matrix Bellman equation (i.e., find V):
- I formulated the Bellman equations for “state-based” rewards: R(s)
- Formulate & solve the B.E. for “state-action” rewards (R(s,a)) and “state-action-state” rewards (R(s,a,s’))

Policy values in practice

“Robot” navigation in a grid maze

Policy values in practice

Optimal policy, π*

Policy values in practice

Value function for optimal policy, V*

A harder “maze”...

Optimal policy, π*

A harder “maze”...

Value function for optimal policy, V*

A harder “maze”...

Value function for optimal policy, V*

Still more complex...

Optimal policy, π*

Still more complex...

Value function for optimal policy, V*

Still more complex...

Value function for optimal policy, V*

Planning: finding π*

- So we know how to evaluate a single policy, π
- How do you find the best policy?
- Remember: still assuming that we know M=〈S,A,T,R〉

Planning: finding π*

- So we know how to evaluate a single policy, π
- How do you find the best policy?
- Remember: still assuming that we know M=〈S,A,T,R〉
- Non-solution: iterate through all possible π, evaluating each one; keep best

Policy iteration & friends

- Many different solutions available.
- All exploit some characteristics of MDPs:
- For infinite-horizon discounted reward in a discrete, finite MDP, there exists at least one optimal, stationary policy (may exist more than one equivalent policy)
- The Bellman equation expresses recursive structure of an optimal policy
- Leads to a series of closely related policy solutions: policy iteration, value iteration, generalized policy iteration, etc.

The policy iteration alg.

- Function: policy_iteration
- Input: MDPM=〈S,A,T,R〉, discount γ
- Output: optimal policyπ*; opt. value func.V*
- Initialization: chooseπ0arbitrarily
- Repeat {
- Vi=eval_policy(M,πi,γ) // from Bellman eqn
- πi+1=local_update_policy(πi,Vi)
- } Until (πi+1==πi)
- Function: π’=local_update_policy(π,V)
- for i=1..|S| {
- π’(si)=argmaxa∈A{sumj(T(si,a,sj)*V(sj))}
- }

Why does this work?

- 2 explanations:
- Theoretical:
- The local update w.r.t. the policy value is a contractive mapping, ergo a fixed point exists and will be reached
- See, “contraction mapping”, “Banach fixed-point theorem”, etc.
- http://math.arizona.edu/~restrepo/475A/Notes/sourcea/node22.html
- http://planetmath.org/encyclopedia/BanachFixedPointTheorem.html
- Contracts w.r.t. the Bellman Error:

Why does this work?

- The intuitive explanation
- It’s doing a dynamic-programming “backup” of reward from reward “sources”
- At every step, the policy is locally updated to take advantage of new information about reward that is propagated back by the evaluation step
- Value “propagates away” from sources and the policy is able to say “hey! there’s reward over there! I can get some of that action if I change a bit!”

Properties & Variants

- Policy iteration
- Known to converge (provable)
- Observed to converge exponentially quickly
- # iterations is O(ln(|S|))
- Empirical observation; strongly believed but no proof (yet)
- O(|S|3) time per iteration (policy evaluation)
- Other methods possible
- Linear program (poly time soln exists)
- Value iteration
- Generalized policy iter. (often best in practice)

Q: A key operative

- Critical step in policy iteration
- π’(si)=argmaxa∈A{sumj(T(si,a,sj)*V(sj))}
- Asks “What happens if I ignore π for just one step, and do a instead (and then resume doing π thereafter)?”
- Often used operation. Gets a special name:
- Definition: the Q function, is:
- Policy iter says: “Figure out Q, act greedily according to Q, then update Q and repeat, until you can’t do any better...”

What to do with Q

- Can think of Q as a big table: one entry for each state/action pair
- “If I’m in state s and take action a, this is my expected discounted reward...”
- A “one-step” exploration: “In state s, if I deviate from my policy π for one timestep, then keep doing π, is my life better or worse?”
- Can get V and π from Q:

Download Presentation

Connecting to Server..