**QUIZ!!** • T/F: Optimal policies can be defined from an optimal Value function. TRUE • T/F: “Pick the MEU action first, then follow optimal policy” is optimal. TRUE • T/F: π*(s)=max s’ V*(s’). FALSE • T/F: The Bellman equation can be satisfied by sub-optimal value functions FALSE • T/F: Value Iteration: The policy cannot converge before the value function FALSE • Explain the difference between Policy Iteration and Value Iteration. • Why can Policy Iteration be faster than Value Iteration?

**CS 511a: Artificial IntelligenceSpring 2013** Lecture 11: MDPs / Reinforcement Learning Feb 25, 2013 Robert Pless, Course adopted from Kilian Weinberger, with many slides from either Dan Klein, Stuart Russell or Andrew Moore

**Announcements** • Project 2 due Thursday night. • HW 1 due Friday 5pm* • * accepted no penalty or late-day charge until Monday 10am.

**Policy Iteration** Why do we compute V* or Q*, if all we care about is the best policy *?

**s** a s, a R(s,a,s’) T(s,a,s’) s’ Utilities for Fixed Policies • Another basic operation: compute the utility of a state s under a fix (general non-optimal) policy • Define the utility of a state s, under a fixed policy : V(s) = expected total discounted rewards (return) starting in s and following • Recursive relation (one-step look-ahead / Bellman equation): V (s) Q*(s,a) a=(s)

**Policy Evaluation** • How do we calculate the V’s for a fixed policy? • Idea one: modify Bellman updates • Idea two: Optimal solution is stationary point (equality). Then it’s just a linear system, solve with Matlab (or whatever)

**Policy Iteration** • Policy evaluation: with fixed current policy , find values with simplified Bellman updates: • Iterate until values converge • Policy improvement: with fixed utilities, find the best action according to one-step look-ahead

**Comparison** • In value iteration: • Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy) • Policy might not change between updates (wastes computation) • In policy iteration: • Several passes to update utilities with frozen policy • Occasional passes to update policies • Value update can be solved as linear system • Can be faster, if policy changes infequently • Hybrid approaches (asynchronous policy iteration): • Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely often

**Asynchronous Value Iteration** • In value iteration, we update every state in each iteration • Actually, any sequences of Bellman updates will converge if every state is visited infinitely often • In fact, we can update the policy as seldom or often as we like, and we will still converge • Idea: Update states whose value we expect to change: If is large then update predecessors of s

**Reinforcement Learning**

**Reinforcement Learning** • Reinforcement learning: • Still have an MDP: • A set of states s S • A set of actions (per state) A • A model T(s,a,s’) • A reward function R(s,a,s’) • Still looking for a policy (s) • New twist: don’t know T or R • I.e. don’t know which states are good or what the actions do • Must actually try actions and states out to learn Demo

**Example: Animal Learning** • RL studied experimentally for more than 60 years in psychology • Rewards: food, pain, hunger, drugs, etc. • Mechanisms and sophistication debated • Example: foraging • Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies • Bees have a direct neural connection from nectar intake measurement to motor planning area

**Passive Learning** • Simplified task • You don’t know the transitions T(s,a,s’) • You don’t know the rewards R(s,a,s’) • You are given a policy (s) • Goal: learn the state values • … what policy evaluation did • In this case: • Learner “along for the ride” • No choice about what actions to take • Just execute the policy and learn from experience • We’ll get to the active case soon • This is NOT offline planning! You actually take actions in the world and see what happens…

**Passive Model-Based Learning** • Idea: • Learn the model empirically through experience • Solve for values as if the learned model were correct • Simple empirical model learning • Count outcomes for each s,a • Normalize to give estimate of T(s,a,s’) • Discover R(s,a,s’) when we experience (s,a,s’) • Solving the MDP with the learned model • Iterative policy evaluation, for example s (s) s, (s) s, (s),s’ s’

**Example: Model-Based Learning** y • Episodes: +100 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) -100 x = 1 T(<3,3>, right, <4,3>) = 1 / 3 T(<2,3>, right, <3,3>) = 2 / 2

**Passive Model-Free Learning** • Big idea: why bother learning T? • 1. Direct Estimation: • Average V(s) value directly and compute expected discounted reward for each state. • No need to compute T or R. s (s) s, (s) s’

**Model-Free Learning** • Want to compute an expectation weighted by P(x): • Model-based: estimate P(x) from samples, compute expectation • Model-free: estimate expectation directly from samples • Why does this work? Because samples appear with the right frequencies!

**Example:Model-Free Estimation** y • Episodes: +100 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) -100 x = 1, R = -1 V(2,3) ~ (96 + -103) / 2 = -3.5 V(3,3) ~ (99 + 97 + -102) / 3 = 31.3

**Sample-Based Policy Evaluation?** • Update V without building T or R. s (s) s, (s),s’ s, (s) s’ s1’ s3’ s2’

**Passive Model-Free Learning** • Big idea: why bother learning T? • 1. Direct Estimation: • Average V(s) value directly and compute expected discounted reward for each state. • No need to compute T or R. • 2. Temporal-Difference Leearning: • Update value function towards whatever successor occurs – maintain running average. s (s) s, (s) s’

**Temporal-Difference Learning** • Big idea: learn from every experience! • Update V(s) each time we experience (s,a,s’,r) • Likely s’ will contribute updates more often • Temporal difference learning • Policy still fixed! • Move values toward value of whatever successor occurs: running average! s (s) s, (s) s’ Sample of V(s): Update to V(s): Same update:

**Exponential Moving Average** • Exponential moving average • Makes recent samples more important • Forgets about the past (distant past values were wrong anyway) • Easy to compute from the running average • Decreasing learning rate can give converging averages

**s** a s, a s,a,s’ s’ Problems with TD Value Learning • TD value leaning is a model-free way to do policy evaluation • However, if we want to turn values into a (new) policy, we’re sunk: • Idea: learn Q-values directly • Makes action selection model-free too!

**Active Learning** • Full reinforcement learning • You don’t know the transitions T(s,a,s’) • You don’t know the rewards R(s,a,s’) • You can choose any actions you like • Goal: learn the optimal policy • … what value iteration did! • In this case: • Learner makes choices! • Fundamental tradeoff: exploration vs. exploitation • This is NOT offline planning! You actually take actions in the world and find out what happens…

**Detour: Q-Value Iteration** • Value iteration: find successive approx optimal values • Start with V0*(s) = 0, which we know is right (why?) • Given Vi*, calculate the values for all states for depth i+1: • But Q-values are more useful! • Start with Q0*(s,a) = 0, which we know is right (why?) • Given Qi*, calculate the q-values for all q-states for depth i+1:

**[DEMO – Grid Q’s]** Q-Learning • Q-Learning: sample-based Q-value iteration • Learn Q*(s,a) values • Receive a sample (s,a,s’,r) • Consider your old estimate: • Consider your new sample estimate: • Incorporate the new estimate into a running average:

**Q-Learning** • Q-Learning: sample-based Q-value iteration • Learn Q*(s,a) values • Receive a sample (s,a,s’,r) • Consider your old estimate: • Consider your new sample estimate: • Incorporate the new estimate into a running average:

**Example’s Tom Erez, Hopper:** • http://www.youtube.com/watch?feature=player_embedded&v=kUfmnoobTHQ - !