Cs 188 artificial intelligence spring 2007
Download
1 / 30

CS 188: Artificial Intelligence Spring 2007 - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

CS 188: Artificial Intelligence Spring 2007. Lecture 21:Reinforcement Learning: II MDP 4/12/2007. Srini Narayanan – ICSI and UC Berkeley. Announcements. Othello tournament signup Please send email to [email protected] HW on classification out Due 4/23 Can work in pairs.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS 188: Artificial Intelligence Spring 2007' - cicada


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cs 188 artificial intelligence spring 2007

CS 188: Artificial IntelligenceSpring 2007

Lecture 21:Reinforcement Learning: II

MDP

4/12/2007

Srini Narayanan – ICSI and UC Berkeley


Announcements
Announcements

  • Othello tournament signup

    • Please send email to [email protected]

  • HW on classification out

    • Due 4/23

    • Can work in pairs


Reinforcement learning
Reinforcement Learning

  • Basic idea:

    • Receive feedback in the form of rewards

    • Agent’s utility is defined by the reward function

    • Must learn to act so as to maximize expected utility

    • Change the rewards, change the behavior

  • Examples:

    • Learning optimal paths

    • Playing a game, reward at the end for winning / losing

    • Vacuuming a house, reward for each piece of dirt picked up

    • Automated taxi, reward for each passenger delivered


Recap mdps
Recap: MDPs

  • Markov decision processes:

    • States S

    • Actions A

    • Transitions P(s’|s,a) (or T(s,a,s’))

    • Rewards R(s,a,s’)

    • Start state s0

  • Examples:

    • Gridworld, High-Low, N-Armed Bandit

    • Any process where the result of your action is stochastic

  • Goal: find the “best” policy 

    • Policies are maps from states to actions

    • What do we mean by “best”?

    • This is like search – it’s planning using a model, not actually interacting with the environment


Mdp solutions
MDP Solutions

  • In deterministic single-agent search, want an optimal sequence of actions from start to a goal

  • In an MDP, like expectimax, want an optimal policy (s)

    • A policy gives an action for each state

    • Optimal policy maximizes expected utility (i.e. expected rewards) if followed

    • Defines a reflex agent

Optimal policy when R(s, a, s’) = -0.04 for all non-terminals s


Example optimal policies
Example Optimal Policies

R(s) = -0.01

R(s) = -0.03

R(s) = -0.4

R(s) = -2.0


Stationarity
Stationarity

  • In order to formalize optimality of a policy, need to understand utilities of reward sequences

  • Typically consider stationary preferences:

  • Theorem: only two ways to define stationary utilities

    • Additive utility:

    • Discounted utility:


Infinite utilities
Infinite Utilities?!

  • Problem: infinite state sequences with infinite rewards

  • Solutions:

    • Finite horizon:

      • Terminate after a fixed T steps

      • Gives nonstationary policy ( depends on time left)

    • Absorbing state(s): guarantee that for every policy, agent will eventually “die” (like “done” for High-Low)

    • Discounting: for 0 <  < 1

      • Smaller  means smaller horizon


How not to solve an mdp
How (Not) to Solve an MDP

  • The inefficient way:

    • Enumerate policies

    • For each one, calculate the expected utility (discounted rewards) from the start state

      • E.g. by simulating a bunch of runs

    • Choose the best policy

  • We’ll return to a (better) idea like this later


Utility of a state
Utility of a State

  • Define the utility of a state under a policy:

    V(s) = expected total (discounted) rewards starting in s and following 

  • Recursive definition (one-step look-ahead):


Policy evaluation
Policy Evaluation

  • Idea one: turn recursive equations into updates

  • Idea two: it’s just a linear system, solve with Matlab (or Mosek, or Cplex)


Example high low
Example: High-Low

  • Policy: always say “high”

  • Iterative updates:


Optimal utilities
Optimal Utilities

  • Goal: calculate the optimal utility of each state

    V*(s) = expected (discounted) rewards with optimal actions

  • Why: Given optimal utilities, MEU tells us the optimal policy


Bellman s equation for selecting actions

That’s my equation!

Bellman’s Equation for Selecting actions

  • Definition of utility leads to a simple relationship amongst optimal utility values:

    Optimal rewards = maximize over first action and then follow optimal policy

    Formally: Bellman’s Equation



Value iteration
Value Iteration

  • Idea:

    • Start with bad guesses at all utility values (e.g. V0(s) = 0)

    • Update all values simultaneously using the Bellman equation (called a value update or Bellman update):

    • Repeat until convergence

  • Theorem: will converge to unique optimal values

    • Basic idea: bad guesses get refined towards optimal values

    • Policy may converge long before values do



Example value iteration
Example: Value Iteration

  • Information propagates outward from terminal states and eventually all states have correct value estimates

[DEMO]


Convergence
Convergence*

  • Define the max-norm:

  • Theorem: For any two approximations U and V (any two utility vectors)

    • I.e. any distinct approximations must get closer to each other (after the Bellman update), so, in particular, any approximation must get closer to the true U (Bellman update is U) and value iteration converges to a unique, stable, optimal solution

  • Theorem:

    • I.e. once the change in our approximation is small, it must also be close to correct


Policy iteration
Policy Iteration

  • Alternate approach:

    • Policy evaluation: calculate utilities for a fixed policy until convergence (remember the beginning of lecture)

    • Policy improvement: update policy based on resulting converged utilities

    • Repeat until policy converges

  • This is policy iteration

    • Can converge faster under some conditions


Policy iteration1
Policy Iteration

  • If we have a fixed policy , use simplified Bellman equation to calculate utilities:

  • For fixed utilities, easy to find the best action according to one-step look-ahead


Comparison
Comparison

  • In value iteration:

    • Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy)

  • In policy iteration:

    • Several passes to update utilities with frozen policy

    • Occasional passes to update policies

  • Hybrid approaches (asynchronous policy iteration):

    • Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely often


Reinforcement learning1
Reinforcement Learning

  • Reinforcement learning:

    • Still have an MDP:

      • A set of states s  S

      • A model T(s,a,s’)

      • A reward function R(s)

    • Still looking for a policy (s)

    • New twist: don’t know T or R

      • I.e. don’t know which states are good or what the actions do

      • Must actually try actions and states out to learn


Example animal learning
Example: Animal Learning

  • RL studied experimentally for more than 60 years in psychology

    • Rewards: food, pain, hunger, drugs, etc.

    • Mechanisms and sophistication debated

  • Example: foraging

    • Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies

    • Bees have a direct neural connection from nectar intake measurement to motor planning area


Example backgammon
Example: Backgammon

  • Reward only for win / loss in terminal states, zero otherwise

  • TD-Gammon learns a function approximation to U(s) using a neural network

  • Combined with depth 3 search, one of the top 3 players in the world


Passive learning
Passive Learning

  • Simplified task

    • You don’t know the transitions T(s,a,s’)

    • You don’t know the rewards R(s,a,s’)

    • You are given a policy (s)

    • Goal: learn the state values (and maybe the model)

  • In this case:

    • No choice about what actions to take

    • Just execute the policy and learn from experience

    • We’ll get to the general case soon


Example direct estimation
Example: Direct Estimation

y

  • Episodes:

+100

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) right -1

(4,2) right -100

(done)

-100

x

 = 1, R = -1

U(1,1) ~ (93 + -105) / 2 = -6

U(3,3) ~ (100 + 98 + -101) / 3 = 32.3


Model based learning
Model-Based Learning

  • Idea:

    • Learn the model empirically (rather than values)

    • Solve the MDP as if the learned model were correct

  • Empirical model learning

    • Simplest case:

      • Count outcomes for each s,a

      • Normalize to give estimate of T(s,a,s’)

      • Discover R(s,a,s’) the first time we experience (s,a,s’)

    • More complex learners are possible (e.g. if we know that all squares have related action outcomes, e.g. “stationary noise”)


Example model based learning
Example: Model-Based Learning

y

  • Episodes:

+100

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) right -1

(4,2) right -100

(done)

-100

x

 = 1

T(<3,3>, right, <4,3>) = 1 / 3

T(<2,3>, right, <3,3>) = 2 / 2


Model based learning1
Model-Based Learning

  • In general, want to learn the optimal policy, not evaluate a fixed policy

  • Idea: adaptive dynamic programming

    • Learn an initial model of the environment:

    • Solve for the optimal policy for this model (value or policy iteration)

    • Refine model through experience and repeat

    • Crucial: we have to make sure we actually learn about all of the model


ad