Decision Making in Intelligent Systems Lecture 3

1 / 29

# Decision Making in Intelligent Systems Lecture 3 - PowerPoint PPT Presentation

Decision Making in Intelligent Systems Lecture 3. BSc course Kunstmatige Intelligentie 2008 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl. Overview of this lecture. Solving the full RL problem

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Decision Making in Intelligent Systems Lecture 3' - krisalyn

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Decision Making in Intelligent SystemsLecture 3

BSc course Kunstmatige Intelligentie 2008

Bram Bakker

Intelligent Systems Lab Amsterdam

Informatics Institute

Universiteit van Amsterdam

bram@science.uva.nl

Overview of this lecture
• Solving the full RL problem
• Given that we have the MDP model
• Dynamic Programming
• Policy iteration
• Value iteration

r

r

r

. . .

. . .

t +1

t +2

s

s

t +3

s

s

t +1

t +2

t +3

a

a

a

a

t

t +1

t +2

t

t +3

Markov Decision Processes
Returns: A Unified Notation
• In episodic tasks, we number the time steps of each episode starting from zero.
• Think of each episode as ending in an absorbing state that always produces reward of zero:
• We can cover all cases by writing
Value Functions
• The value of a state is the expected return starting from that state; depends on the agent’s policy:
• The value of taking an action in a stateunder policy p is the expected return starting from that state, taking that action, and thereafter following p :
Bellman Equation for a Policy p

The basic idea:

So:

Or, without the expectation operator:

Bellman Optimality Equation for V*

The value of a state under an optimal policy must equal

the expected return for the best action from that state:

is the unique solution of this system of nonlinear equations.

Dynamic Programming (DP)
• A collection of classical solution methods for MDPs
• Policy iteration
• Value iteration
• DP can be used to compute value functions, and hence, optimal policies
• Assumes a known MDP model (state transition model and reward model)
• Combination of Policy Evaluation and Policy Improvement
Iterative Methods

a “sweep”

A sweep consists of applying a backup operation to each state,

And results in a new value function for iteration k+1.

A full policy-evaluation backup:

Bootstrapping

A full policy-evaluation backup:

The new estimated value for each state s is based on the estimated old value of all possible successor states s’

Bootstrapping: estimating values based on

A Small Gridworld
• Nonterminal states: 1, 2, . . ., 14;
• Terminal states shown as shaded squares
• Actions that would take agent off the grid leave state unchanged
• Reward is –1 until a terminal state is reached
Policy Iteration

policy evaluation

policy improvement

“greedification”

Value Iteration

Recall the full policy-evaluation backup:

Here is the full value-iteration backup:

Essentially, combines policy evaluation and improvement

in one step

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

Monte Carlo
Asynchronous DP
• All the DP methods described so far require exhaustive sweeps of the entire state set.
• Asynchronous DP does not use sweeps. Instead it works like this:
• Repeat until convergence criterion is met:
• Pick a state at random and apply the appropriate backup
• Still need lots of computation, but does not get locked into hopelessly long sweeps
• Can you select states to backup intelligently? YES: an agent’s experience can act as a guide.
Generalized Policy Iteration

Generalized Policy Iteration (GPI):

any interaction of policy evaluation and policy improvement,

independent of their granularity.

A geometric metaphor for

convergence of GPI:

Efficiency of DP
• To find an optimal policy is polynomial in the number of states…
• BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”).
• In practice, classical DP can be applied to problems with a few millions of states.
• Asynchronous DP can be applied to larger problems, and appropriate for parallel computation.
• It is surprisingly easy to come up with MDPs for which DP methods are not practical.
Summary
• Policy evaluation: estimate value of current policy
• Policy improvement: determine new current policy by being greedy w.r.t. current value function
• Policy iteration: alternate the above two processes
• Value iteration: combine the above two processes in 1 step
• Bootstrapping: updating estimates based on your own estimates
• Full backups (to be contrasted with sample backups)
• Generalized Policy Iteration (GPI)
Before we start…
• Questions? Are some concepts still unclear?
• Are you making progress in the Prakticum?
• Advice: I do not cover everything from the book in detail. Issues and algorithms that I emphasize in the lectures are the most important (also for exams).
Next Class
• Lecture next Monday:
• Chapters 5 (& 6?) of Sutton & Barto