cs 188 artificial intelligence spring 2007
Download
Skip this Video
Download Presentation
CS 188: Artificial Intelligence Spring 2007

Loading in 2 Seconds...

play fullscreen
1 / 30

CS 188: Artificial Intelligence Spring 2007 - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

CS 188: Artificial Intelligence Spring 2007. Lecture 21:Reinforcement Learning: II MDP 4/12/2007. Srini Narayanan – ICSI and UC Berkeley. Announcements. Othello tournament signup Please send email to [email protected] HW on classification out Due 4/23 Can work in pairs.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS 188: Artificial Intelligence Spring 2007' - cicada


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cs 188 artificial intelligence spring 2007

CS 188: Artificial IntelligenceSpring 2007

Lecture 21:Reinforcement Learning: II

MDP

4/12/2007

Srini Narayanan – ICSI and UC Berkeley

announcements
Announcements
  • Othello tournament signup
  • HW on classification out
    • Due 4/23
    • Can work in pairs
reinforcement learning
Reinforcement Learning
  • Basic idea:
    • Receive feedback in the form of rewards
    • Agent’s utility is defined by the reward function
    • Must learn to act so as to maximize expected utility
    • Change the rewards, change the behavior
  • Examples:
    • Learning optimal paths
    • Playing a game, reward at the end for winning / losing
    • Vacuuming a house, reward for each piece of dirt picked up
    • Automated taxi, reward for each passenger delivered
recap mdps
Recap: MDPs
  • Markov decision processes:
    • States S
    • Actions A
    • Transitions P(s’|s,a) (or T(s,a,s’))
    • Rewards R(s,a,s’)
    • Start state s0
  • Examples:
    • Gridworld, High-Low, N-Armed Bandit
    • Any process where the result of your action is stochastic
  • Goal: find the “best” policy 
    • Policies are maps from states to actions
    • What do we mean by “best”?
    • This is like search – it’s planning using a model, not actually interacting with the environment
mdp solutions
MDP Solutions
  • In deterministic single-agent search, want an optimal sequence of actions from start to a goal
  • In an MDP, like expectimax, want an optimal policy (s)
    • A policy gives an action for each state
    • Optimal policy maximizes expected utility (i.e. expected rewards) if followed
    • Defines a reflex agent

Optimal policy when R(s, a, s’) = -0.04 for all non-terminals s

example optimal policies
Example Optimal Policies

R(s) = -0.01

R(s) = -0.03

R(s) = -0.4

R(s) = -2.0

stationarity
Stationarity
  • In order to formalize optimality of a policy, need to understand utilities of reward sequences
  • Typically consider stationary preferences:
  • Theorem: only two ways to define stationary utilities
    • Additive utility:
    • Discounted utility:
infinite utilities
Infinite Utilities?!
  • Problem: infinite state sequences with infinite rewards
  • Solutions:
    • Finite horizon:
      • Terminate after a fixed T steps
      • Gives nonstationary policy ( depends on time left)
    • Absorbing state(s): guarantee that for every policy, agent will eventually “die” (like “done” for High-Low)
    • Discounting: for 0 <  < 1
      • Smaller  means smaller horizon
how not to solve an mdp
How (Not) to Solve an MDP
  • The inefficient way:
    • Enumerate policies
    • For each one, calculate the expected utility (discounted rewards) from the start state
      • E.g. by simulating a bunch of runs
    • Choose the best policy
  • We’ll return to a (better) idea like this later
utility of a state
Utility of a State
  • Define the utility of a state under a policy:

V(s) = expected total (discounted) rewards starting in s and following 

  • Recursive definition (one-step look-ahead):
policy evaluation
Policy Evaluation
  • Idea one: turn recursive equations into updates
  • Idea two: it’s just a linear system, solve with Matlab (or Mosek, or Cplex)
example high low
Example: High-Low
  • Policy: always say “high”
  • Iterative updates:
optimal utilities
Optimal Utilities
  • Goal: calculate the optimal utility of each state

V*(s) = expected (discounted) rewards with optimal actions

  • Why: Given optimal utilities, MEU tells us the optimal policy
bellman s equation for selecting actions

That’s my equation!

Bellman’s Equation for Selecting actions
  • Definition of utility leads to a simple relationship amongst optimal utility values:

Optimal rewards = maximize over first action and then follow optimal policy

Formally: Bellman’s Equation

value iteration
Value Iteration
  • Idea:
    • Start with bad guesses at all utility values (e.g. V0(s) = 0)
    • Update all values simultaneously using the Bellman equation (called a value update or Bellman update):
    • Repeat until convergence
  • Theorem: will converge to unique optimal values
    • Basic idea: bad guesses get refined towards optimal values
    • Policy may converge long before values do
example value iteration
Example: Value Iteration
  • Information propagates outward from terminal states and eventually all states have correct value estimates

[DEMO]

convergence
Convergence*
  • Define the max-norm:
  • Theorem: For any two approximations U and V (any two utility vectors)
    • I.e. any distinct approximations must get closer to each other (after the Bellman update), so, in particular, any approximation must get closer to the true U (Bellman update is U) and value iteration converges to a unique, stable, optimal solution
  • Theorem:
    • I.e. once the change in our approximation is small, it must also be close to correct
policy iteration
Policy Iteration
  • Alternate approach:
    • Policy evaluation: calculate utilities for a fixed policy until convergence (remember the beginning of lecture)
    • Policy improvement: update policy based on resulting converged utilities
    • Repeat until policy converges
  • This is policy iteration
    • Can converge faster under some conditions
policy iteration1
Policy Iteration
  • If we have a fixed policy , use simplified Bellman equation to calculate utilities:
  • For fixed utilities, easy to find the best action according to one-step look-ahead
comparison
Comparison
  • In value iteration:
    • Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy)
  • In policy iteration:
    • Several passes to update utilities with frozen policy
    • Occasional passes to update policies
  • Hybrid approaches (asynchronous policy iteration):
    • Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely often
reinforcement learning1
Reinforcement Learning
  • Reinforcement learning:
    • Still have an MDP:
      • A set of states s  S
      • A model T(s,a,s’)
      • A reward function R(s)
    • Still looking for a policy (s)
    • New twist: don’t know T or R
      • I.e. don’t know which states are good or what the actions do
      • Must actually try actions and states out to learn
example animal learning
Example: Animal Learning
  • RL studied experimentally for more than 60 years in psychology
    • Rewards: food, pain, hunger, drugs, etc.
    • Mechanisms and sophistication debated
  • Example: foraging
    • Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies
    • Bees have a direct neural connection from nectar intake measurement to motor planning area
example backgammon
Example: Backgammon
  • Reward only for win / loss in terminal states, zero otherwise
  • TD-Gammon learns a function approximation to U(s) using a neural network
  • Combined with depth 3 search, one of the top 3 players in the world
passive learning
Passive Learning
  • Simplified task
    • You don’t know the transitions T(s,a,s’)
    • You don’t know the rewards R(s,a,s’)
    • You are given a policy (s)
    • Goal: learn the state values (and maybe the model)
  • In this case:
    • No choice about what actions to take
    • Just execute the policy and learn from experience
    • We’ll get to the general case soon
example direct estimation
Example: Direct Estimation

y

  • Episodes:

+100

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) right -1

(4,2) right -100

(done)

-100

x

 = 1, R = -1

U(1,1) ~ (93 + -105) / 2 = -6

U(3,3) ~ (100 + 98 + -101) / 3 = 32.3

model based learning
Model-Based Learning
  • Idea:
    • Learn the model empirically (rather than values)
    • Solve the MDP as if the learned model were correct
  • Empirical model learning
    • Simplest case:
      • Count outcomes for each s,a
      • Normalize to give estimate of T(s,a,s’)
      • Discover R(s,a,s’) the first time we experience (s,a,s’)
    • More complex learners are possible (e.g. if we know that all squares have related action outcomes, e.g. “stationary noise”)
example model based learning
Example: Model-Based Learning

y

  • Episodes:

+100

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) right -1

(4,2) right -100

(done)

-100

x

 = 1

T(<3,3>, right, <4,3>) = 1 / 3

T(<2,3>, right, <3,3>) = 2 / 2

model based learning1
Model-Based Learning
  • In general, want to learn the optimal policy, not evaluate a fixed policy
  • Idea: adaptive dynamic programming
    • Learn an initial model of the environment:
    • Solve for the optimal policy for this model (value or policy iteration)
    • Refine model through experience and repeat
    • Crucial: we have to make sure we actually learn about all of the model
ad