Download Presentation
## Lirong Xia

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Markov Decision Processes**Lirong Xia Tue, March 4, 2014**Reminder**• Midterm Mar 7 • in-class • open book and lecture notes • simple calculators are allowed • cannot use smartphone/laptops/wifi • practice exams and solutions (check piazza) • OH tomorrow (Lirong); Thursday (Hongzhao)**MEU Principle**• Theorem: • [Ramsey, 1931; von Neumann & Morgenstern, 1944] • Given any preference satisfying these axioms, there exists a real-value function U such that: • Maximum expected utility (MEU) principle: • Choose the action that maximizes expected utility • Utilities are just a representation! • Utilities are NOT money**Different possible risk attitudes under expected utility**maximization utility • Green has decreasing marginal utility → risk-averse • Blue has constant marginal utility → risk-neutral • Red has increasing marginal utility → risk-seeking • Grey’s marginal utility is sometimes increasing, sometimes decreasing → neither risk-averse (everywhere) nor risk-seeking (everywhere) money**Open question of last class**• Which would you prefer? • A lottery ticket that pays out $10 with probability .5 and $0 otherwise, or • A lottery ticket that pays out $3 with probability 1 • How about: • A lottery ticket that pays out $100,000,000 with probability .5 and $0 otherwise, or • A lottery ticket that pays out $30,000,000 with probability 1 • u(0)=0, u(3)=3, u(10)=9, u(30M)=1000, u(100M)=1500**Acting optimally over time**• Finite number of periods: • Overall utility = sum of rewards in individual periods • Infinite number of periods: • … are we just going to add up the rewards over infinitely many periods? • Always get infinity! • (Limit of) average payoff: limn→∞Σ1≤t≤nr(t)/n • Limit may not exist… • Discounted payoff: Σtϒtr(t) for some ϒ< 1 • Interpretations of discounting: • Interest rate r: ϒ= 1/(1+r) • World ends with some probability 1-ϒ • Discounting is mathematically convenient**Today**• Markov decision processes • search with uncertain moves and “infinite” space • Computing optimal policy • value iteration • policy iteration**Grid World**• The agent lives in a grid • Walls block the agent’s path • The agent’s actions do not always go as planned: • 80% of the time, the action North takes the agent North (if there is no wall there) • 10% of the time, North takes the agent West; 10% East • If there is a wall in the direction the agent would have taken, the agent stays for this turn • Small “living” reward each step • Big rewards come at the end • Goal: maximize sum of reward**Grid Feature**Deterministic Grid World Stochastic Grid World**Markov Decision Processes**• An MDP is defined by: • A set of statess∈S • A set of actionsa∈A • A transition function T(s,a,s’) • Prob that a from s leads to s’ • i.e., p(s’|s,a) • sometimes called the model • A reward function R(s, a, s’) • Sometimes just R(s) or R(s’) • A start state (or distribution) • Maybe a terminal state • MDPs are a family of nondeterministic search problems • Reinforcement learning (next class): MDPs where we don’t know the transition or reward functions**What is Markov about MDPs?**• Andrey Markov (1856-1922) • “Markov” generally means that given the present state, the future and the past are independent • For Markov decision processes, “Markov” means:**Solving MDPs**• In deterministic single-agent search problems, want an optimal plan, or sequence of actions, from start to a goal • In an MDP, we want an optimal policy • A policy π gives an action for each state • An optimal policy maximizes expected utility if followed • Defines a reflex agent • Optimal policy when R(s, a, s’) = -0.03 for all non-terminal state**Example Optimal Policies**R(s) = -0.01 R(s) = -0.03 R(s) = -0.4 R(s) = -2.0**Example: High-Low**• Three card type: 2,3,4 • Infinite deck, twice as many 2’s • Start with 3 showing • After each card, you say “high” or “low” • If you’re right, you win the points shown on the new card • Ties are no-ops • If you’re wrong, game ends • Why not use expectimax? • #1: get rewards as you go • #2: you might play forever!**High-Low as an MDP**• States: 2, 3, 4, done • Actions: High, Low • Model: T(s,a,s’): • p(s’=4|4,low) = ¼ • p(s’=3|4,low) = ¼ • p(s’=2|4,low) = ½ • p(s’=done|4,low) =0 • p(s’=4|4,high) = ¼ • p(s’=3|4,high) = 0 • p(s’=2|4,high) = 0 • p(s’=done|4,high) = ¾ • … • Rewards: R(s,a,s’): • Number shown on s’ if • 0 otherwise • Start: 3**MDP Search Trees**• Each MDP state gives an expectimax-like search tree**Utilities of Sequences**• In order to formalize optimality of a policy, need to understand utilities of sequences of rewards • Typically consider stationary preferences: • Two coherent ways to define stationary utilities • Additive utility: • Discounted utility:**Infinite Utilities?!**• Problems: infinite state sequences have infinite rewards • Solutions: • Finite horizon: • Terminate episodes after a fixed T steps (e.g. life) • Gives nonstationary policies (π depends on time left) • Absorbing state: guarantee that for every policy, a terminal state will eventually be reached (like “done” for High-Low) • Discounting: for 0<ϒ<1 • Smaller ϒ means smaller “horizon”-shorter term focus**Discounting**• Typically discount rewards by each time step • Sooner rewards have higher utility than later rewards • Also helps the algorithms converge**Recap: Defining MDPs**• Markov decision processes: • States S • Start state s0 • Actions A • Transition p(s’|s,a) (or T(s,a,s’)) • Reward R(s,a,s’) (and discount ) • MDP quantities so far: • Policy = Choice of action for each (MAX) state • Utility (or return) = sum of discounted rewards**Optimal Utilities**• Fundamental operation: compute the values (optimal expectimax utilities) of states s • Why? Optimal values define optimal policies! • Define the value of a state s: • V*(s) = expected utility starting in s and acting optimally • Define the value of a q-state (s,a): • Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally • Define the optimal policy: • π*(s) = optimal action from state s**The Bellman Equations**• Definition of “optimal utility” leads to a simple one-step lookahead relationship amongst optimal utility values: Optimal rewards = maximize over first and then follow optimal policy • Formally:**Solving MDPs**• We want to find the optimal policy • Proposal 1: modified expectimax search, starting from each state s:**Why Not Search Trees?**• Why not solve with expectimax? • Problems: • This tree is usually infinite • Same states appear over and over • We would search once per state • Idea: value iteration • Compute optimal values for all states all at once using successive approximations**Value Estimates**• Calculate estimates Vk*(s) • Not the optimal value of s! • The optimal value considering only next k time steps (k rewards) • As , it approaches the optimal value • Almost solution: recursion (i.e. expectimax) • Correct solution: dynamic programming**Computing the optimal policy**• Value iteration • Policy iteration**Value Iteration**• Idea: • Start with V1(s) = 0 • Given Vi, calculate the values for all states for depth i+1: • This is called a value update or Bellman update • Repeat until converge • Use Vias evaluation function when computing Vi+1 • Theorem: will converge to unique optimal values • Basic idea: approximations get refined towards optimal values • Policy may converge long before values do**Example: γ=0.9, living reward=0, noise=0.2**Example: Bellman Updates max happens for a=right, other actions not shown**Example: Value Iteration**• Information propagates outward from terminal states and eventually all states have correct value estimates**Convergence**• Define the max-norm: • Theorem: for any two approximations U and V • I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solution • Theorem: • I.e. once the change in our approximation is small, it must also be close to correct**Practice: Computing Actions**• Which action should we chose from state s: • Given optimal values V? • Given optimal q-values Q?**Utilities for a Fixed Policy**• Another basic operation: compute the utility of a state s under a fixed (generally non-optimal) policy • Define the utility of a state s, under a fixed policy π: Vπ(s)=expected total discounted rewards (return) starting in s and following π • Recursive relation (one-step lookahead / Bellman equation):**Policy Evaluation**• How do we calculate the V’s for a fixed policy? • Idea one: turn recursive equations into updates • Idea two: it’s just a linear system, solve with Matlab (or what ever)**Policy Iteration**• Alternative approach: • Step 1: policy evaluation: calculate utilities for some fixed policy (not optimal utilities!) • Step 2: policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities as future values • Repeat steps until policy converges • This is policy iteration • It’s still optimal! • Can converge faster under some conditions**Policy Iteration**• Policy evaluation: with fixed current policy π, find values with simplified Bellman updates: • Policy improvement: with fixed utilities, find the best action according to one-step look-ahead**Comparison**• Both compute same thing (optimal values for all states) • In value iteration: • Every iteration updates both utilities (explicitly, based on current utilities) and policy (implicitly, based on current utilities) • Tracking the policy isn’t necessary; we take the max • In policy iteration: • Compute utilities with fixed policy • After utilities are computed, a new policy is chosen • Both are dynamic programs for solving MDPs**Preview: Reinforcement Learning**• Reinforcement learning: • Still have an MDP: • A set of states • A set of actions (per state) A • A model T(s,a,s’) • A reward function R(s,a,s’) • Still looking for a policy π(s) • New twist: don’t know T or R • I.e. don’t know which states are good or what the actions do • Must actually try actions and states out to learn