- 125 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' CS 188: Artificial Intelligence Spring 2007' - cicada

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### CS 188: Artificial IntelligenceSpring 2007

Lecture 21:Reinforcement Learning: II

MDP

4/12/2007

Srini Narayanan – ICSI and UC Berkeley

Announcements

- Othello tournament signup
- Please send email to [email protected]
- HW on classification out
- Due 4/23
- Can work in pairs

Reinforcement Learning

- Basic idea:
- Receive feedback in the form of rewards
- Agent’s utility is defined by the reward function
- Must learn to act so as to maximize expected utility
- Change the rewards, change the behavior
- Examples:
- Learning optimal paths
- Playing a game, reward at the end for winning / losing
- Vacuuming a house, reward for each piece of dirt picked up
- Automated taxi, reward for each passenger delivered

Recap: MDPs

- Markov decision processes:
- States S
- Actions A
- Transitions P(s’|s,a) (or T(s,a,s’))
- Rewards R(s,a,s’)
- Start state s0
- Examples:
- Gridworld, High-Low, N-Armed Bandit
- Any process where the result of your action is stochastic
- Goal: find the “best” policy
- Policies are maps from states to actions
- What do we mean by “best”?
- This is like search – it’s planning using a model, not actually interacting with the environment

MDP Solutions

- In deterministic single-agent search, want an optimal sequence of actions from start to a goal
- In an MDP, like expectimax, want an optimal policy (s)
- A policy gives an action for each state
- Optimal policy maximizes expected utility (i.e. expected rewards) if followed
- Defines a reflex agent

Optimal policy when R(s, a, s’) = -0.04 for all non-terminals s

Stationarity

- In order to formalize optimality of a policy, need to understand utilities of reward sequences
- Typically consider stationary preferences:
- Theorem: only two ways to define stationary utilities
- Additive utility:
- Discounted utility:

Infinite Utilities?!

- Problem: infinite state sequences with infinite rewards
- Solutions:
- Finite horizon:
- Terminate after a fixed T steps
- Gives nonstationary policy ( depends on time left)
- Absorbing state(s): guarantee that for every policy, agent will eventually “die” (like “done” for High-Low)
- Discounting: for 0 < < 1
- Smaller means smaller horizon

How (Not) to Solve an MDP

- The inefficient way:
- Enumerate policies
- For each one, calculate the expected utility (discounted rewards) from the start state
- E.g. by simulating a bunch of runs
- Choose the best policy
- We’ll return to a (better) idea like this later

Utility of a State

- Define the utility of a state under a policy:

V(s) = expected total (discounted) rewards starting in s and following

- Recursive definition (one-step look-ahead):

Policy Evaluation

- Idea one: turn recursive equations into updates
- Idea two: it’s just a linear system, solve with Matlab (or Mosek, or Cplex)

Example: High-Low

- Policy: always say “high”
- Iterative updates:

Optimal Utilities

- Goal: calculate the optimal utility of each state

V*(s) = expected (discounted) rewards with optimal actions

- Why: Given optimal utilities, MEU tells us the optimal policy

Bellman’s Equation for Selecting actions

- Definition of utility leads to a simple relationship amongst optimal utility values:

Optimal rewards = maximize over first action and then follow optimal policy

Formally: Bellman’s Equation

Value Iteration

- Idea:
- Start with bad guesses at all utility values (e.g. V0(s) = 0)
- Update all values simultaneously using the Bellman equation (called a value update or Bellman update):
- Repeat until convergence
- Theorem: will converge to unique optimal values
- Basic idea: bad guesses get refined towards optimal values
- Policy may converge long before values do

Example: Value Iteration

- Information propagates outward from terminal states and eventually all states have correct value estimates

[DEMO]

Convergence*

- Define the max-norm:
- Theorem: For any two approximations U and V (any two utility vectors)
- I.e. any distinct approximations must get closer to each other (after the Bellman update), so, in particular, any approximation must get closer to the true U (Bellman update is U) and value iteration converges to a unique, stable, optimal solution
- Theorem:
- I.e. once the change in our approximation is small, it must also be close to correct

Policy Iteration

- Alternate approach:
- Policy evaluation: calculate utilities for a fixed policy until convergence (remember the beginning of lecture)
- Policy improvement: update policy based on resulting converged utilities
- Repeat until policy converges
- This is policy iteration
- Can converge faster under some conditions

Policy Iteration

- If we have a fixed policy , use simplified Bellman equation to calculate utilities:
- For fixed utilities, easy to find the best action according to one-step look-ahead

Comparison

- In value iteration:
- Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy)
- In policy iteration:
- Several passes to update utilities with frozen policy
- Occasional passes to update policies
- Hybrid approaches (asynchronous policy iteration):
- Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely often

Reinforcement Learning

- Reinforcement learning:
- Still have an MDP:
- A set of states s S
- A model T(s,a,s’)
- A reward function R(s)
- Still looking for a policy (s)
- New twist: don’t know T or R
- I.e. don’t know which states are good or what the actions do
- Must actually try actions and states out to learn

Example: Animal Learning

- RL studied experimentally for more than 60 years in psychology
- Rewards: food, pain, hunger, drugs, etc.
- Mechanisms and sophistication debated
- Example: foraging
- Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies
- Bees have a direct neural connection from nectar intake measurement to motor planning area

Example: Backgammon

- Reward only for win / loss in terminal states, zero otherwise
- TD-Gammon learns a function approximation to U(s) using a neural network
- Combined with depth 3 search, one of the top 3 players in the world

Passive Learning

- Simplified task
- You don’t know the transitions T(s,a,s’)
- You don’t know the rewards R(s,a,s’)
- You are given a policy (s)
- Goal: learn the state values (and maybe the model)
- In this case:
- No choice about what actions to take
- Just execute the policy and learn from experience
- We’ll get to the general case soon

Example: Direct Estimation

y

- Episodes:

+100

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) right -1

(4,2) right -100

(done)

-100

x

= 1, R = -1

U(1,1) ~ (93 + -105) / 2 = -6

U(3,3) ~ (100 + 98 + -101) / 3 = 32.3

Model-Based Learning

- Idea:
- Learn the model empirically (rather than values)
- Solve the MDP as if the learned model were correct
- Empirical model learning
- Simplest case:
- Count outcomes for each s,a
- Normalize to give estimate of T(s,a,s’)
- Discover R(s,a,s’) the first time we experience (s,a,s’)
- More complex learners are possible (e.g. if we know that all squares have related action outcomes, e.g. “stationary noise”)

Example: Model-Based Learning

y

- Episodes:

+100

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) right -1

(4,2) right -100

(done)

-100

x

= 1

T(<3,3>, right, <4,3>) = 1 / 3

T(<2,3>, right, <3,3>) = 2 / 2

Model-Based Learning

- In general, want to learn the optimal policy, not evaluate a fixed policy
- Idea: adaptive dynamic programming
- Learn an initial model of the environment:
- Solve for the optimal policy for this model (value or policy iteration)
- Refine model through experience and repeat
- Crucial: we have to make sure we actually learn about all of the model

Download Presentation

Connecting to Server..