Monte-Carlo Planning:
Download
1 / 41

Monte-Carlo Planning: Policy Improvement - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

Monte-Carlo Planning: Policy Improvement. Alan Fern. Monte-Carlo Planning. Often a simulator of a planning domain is available or can be learned from data. Conservation Planning. Fire & Emergency Response. 2. Large Worlds: Monte-Carlo Approach.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Monte-Carlo Planning: Policy Improvement' - syshe


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Monte-Carlo Planning:

Policy Improvement

Alan Fern


Monte carlo planning
Monte-Carlo Planning

  • Often a simulator of a planning domain is availableor can be learned from data

Conservation Planning

Fire & Emergency Response

2


Large worlds monte carlo approach
Large Worlds: Monte-Carlo Approach

  • Often a simulator of a planning domain is availableor can be learned from data

  • Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator

action

World Simulator

RealWorld

State + reward

3


Mdp simulation based representation
MDP: Simulation-Based Representation

  • A simulation-based representation gives: S, A, R, T, I:

    • finite state set S (|S|=n and is generally very large)

    • finite action set A (|A|=m and will assume is of reasonable size)

    • Stochastic, real-valued, bounded reward function R(s,a) = r

      • Stochastically returns a reward r given input s and a

    • Stochastic transition function T(s,a) = s’ (i.e. a simulator)

      • Stochastically returns a state s’ given input s and a

      • Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP

    • Stochastic initial state function I.

      • Stochastically returns a state according to an initial state distribution

        These stochastic functions can be implemented in any language!


Outline
Outline

  • You already learned how to evaluate a policy given a simulator

    • Just run the policy multiple times for a finite horizon and average the rewards

  • In next two lectures we’ll learn how to select good actions


Monte carlo planning outline
Monte-Carlo Planning Outline

  • Single State Case (multi-armed bandits)

    • A basic tool for other algorithms

  • Monte-Carlo Policy Improvement

    • Policy rollout

    • Policy Switching

  • Monte-Carlo Tree Search

    • Sparse Sampling

    • UCT and variants

Today


Single state monte carlo planning
Single State Monte-Carlo Planning

  • Suppose MDP has a single state and k actions

    • Can sample rewards of actions using calls to simulator

    • Sampling action a is like pulling slot machine arm with random payoff function R(s,a)

s

ak

a1

a2

R(s,ak)

R(s,a2)

R(s,a1)

Multi-Armed Bandit Problem


Multi armed bandits
Multi-Armed Bandits

  • We will use bandit algorithms as components for multi-state Monte-Carlo planning

    • But they are useful in their own right

  • Pure bandit problems arise in many applications

  • Applicable whenever:

    • We have a set of independent options with unknown utilities

    • There is a cost for sampling options or a limit on total samples

    • Want to find the best option or maximize utility of our samples


Multi armed bandits examples
Multi-Armed Bandits: Examples

  • Clinical Trials

    • Arms = possible treatments

    • Arm Pulls = application of treatment to inidividual

    • Rewards = outcome of treatment

    • Objective = Find best treatment quickly (debatable)

  • Online Advertising

    • Arms = different ads/ad-types for a web page

    • Arm Pulls = displaying an ad upon a page access

    • Rewards = click through

    • Objective = find best add quickly (the maximize clicks)


Simple regret objective
Simple Regret Objective

  • Different applications suggest different types of bandit objectives.

  • Today minimizing simple regret will be the objective

    • Simple Regret Minimization (informal): quickly identify arm with close to optimal expected reward

s

ak

a1

a2

R(s,ak)

R(s,a2)

R(s,a1)

Multi-Armed Bandit Problem


Simple regret objective formal definition
Simple Regret Objective: Formal Definition

  • Protocol:at time step n based on all prior observations

    • Pick an “exploration” arm, then pull it and observe reward

    • Pick an “exploitation” arm index that currently looks best (if algorithm is stopped at time it returns ) ( are random variables).

  • Let be the expected reward of truly best arm

  • Expected Simple Regret (: difference between and expected reward of arm selected by our strategy at time n


Uniformbandit algorith or round robin
UniformBanditAlgorith(or Round Robin)

Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19), 1832-1852

UniformBandit Algorithm:

  • At round n pull arm with index (k mod n) + 1

  • At round n return arm (if asked) with largest average reward

    • I.e. is the index of arm with best average so far

  • This bound is exponentially decreasing in n!

    • So even this simple algorithm has a provably small simple regret.

Theorem: The expected simple regret of Uniform after n arm pulls is upper bounded by Ofor a constant c.


Can we do better
Can we do better?

Tolpin, D. & Shimony, S, E. (2012). MCTS Based on Simple Regret. AAAI Conference on Artificial Intelligence.

Algorithm -GreedyBandit: (parameter )

  • At round n, with probability pull arm with best average reward so far, otherwise pull one of the other arms at random.

  • At round n return arm (if asked) with largest average reward

Theorem: The expected simple regret of -Greedy for after n arm pulls is upper bounded by Ofor a constant c that is larger than the constant for Uniform(this holds for “large enough” n).

Often is more effective than UniformBandit in practice.


Monte carlo planning outline1
Monte-Carlo Planning Outline

  • Single State Case (multi-armed bandits)

    • A basic tool for other algorithms

  • Monte-Carlo Policy Improvement

    • Policy rollout

    • Policy Switching

  • Monte-Carlo Tree Search

    • Sparse Sampling

    • UCT and variants

Today


Policy improvement via monte carlo
Policy Improvement via Monte-Carlo

  • Now consider a very large multi-state MDP.

  • Suppose we have a simulator and a non-optimal policy

    • E.g. policy could be a standard heuristic or based on intuition

  • Can we somehow compute an improved policy?

World Simulator

+ Base Policy

action

RealWorld

State + reward

15


Policy improvement theorem
Policy Improvement Theorem

  • Definition: The Q-value function gives the expected future reward of starting in state s, taking action , and then following policy until the horizon h.

    • How good is it to execute after taking action in state

  • Define:

  • Theorem [Howard, 1960]: For any non-optimal policy the policy is strictly better than .

    • So if we can compute at any state we encounter, then we can execute an improved policy

  • Can we use bandit algorithms to compute


Policy improvement via bandits
Policy Improvement via Bandits

s

ak

a1

a2

SimQ(s,a1,π,h)

SimQ(s,a2,π,h)

SimQ(s,ak,π,h)

  • Idea: define a stochastic function SimQ(s,a,π,h)that we can implement and whose expected value is Qπ(s,a,h)

  • Then use Bandit algorithm to select (approximately) the action with best Q-value (i.e. the action )

How to implement SimQ?


Policy improvement via bandits1

Policy Improvement via Bandits

  • SimQ(s,a,π,h)

    • q = R(s,a) simulate a in s

    • s = T(s,a)

    • for i = 1 to h-1q = q + R(s, π(s)) simulate h-1 steps

    • s = T(s, π(s)) of policy

    • Return q

Trajectory under p

Sum of rewards = SimQ(s,a1,π,h)

a1

Sum of rewards = SimQ(s,a2,π,h)

s

a2

ak

Sum of rewards = SimQ(s,ak,π,h)


Policy improvement via bandits2
Policy Improvement via Bandits

  • SimQ(s,a,π,h)

    • q = R(s,a) simulate a in s

    • s = T(s,a)

    • for i = 1 to h-1q = q + R(s, π(s)) simulate h-1 steps

    • s = T(s, π(s)) of policy

    • Return q

  • Simply simulate taking a in s and following policy for h-1 steps, returning discounted sum of rewards

  • Expected value of SimQ(s,a,π,h) is Qπ(s,a,h)

    • So averaging across multiple runs of SimQ quickly converges to Qπ(s,a,h)


Policy improvement via bandits3
Policy Improvement via Bandits

s

ak

a1

a2

SimQ(s,a1,π,h)

SimQ(s,a2,π,h)

SimQ(s,ak,π,h)

  • Now apply your favorite bandit algorithm for simple regret

  • UniformRollout : use UniformBandit

    • Parameters: number of trials n and horizon/height h

  • -GreedyRollout: use -GreedyBandit

    • Parameters: number of trials n, and horizon/height h( often is a good choice)


Uniformrollout

UniformRollout

  • Each action is tried roughly the same number of times (approximately times)

s

ak

a1

a2

SimQ(s,ai,π,h) trajectories

Each simulates taking action ai then following π for h-1 steps.

q11 q12 … q1w

q21 q22 … q2w

qk1 qk2 … qkw

Samples of SimQ(s,ai,π,h)


  • Allocates a non-uniform number of trials across actions (focuses on more promising actions)

s

ak

a1

a2

  • For we might expect it to be better than UniformRollout for same value of n.

q11 q12 … q1u

q21 q22 … q2v

qk1


Executing rollout in real world

Executing Rollout in Real World

a2

ak

Real worldstate/action sequence

ak

ak

a1

a1

a2

a2

s

run policy rollout

run policy rollout

Simulated experience

How much time does each decision take?


Policy rollout of simulator calls

Policy Rollout: # of Simulator Calls

s

ak

a1

a2

SimQ(s,ai,π,h) trajectories

Each simulates taking action ai then following π for h-1 steps.

  • Total of n SimQ callseach using h calls to simulator and policy

  • Total of hncalls to the simulator and to the policy (dominates time to make decision)


Practical issues accuracy
Practical Issues: Accuracy

  • Selecting number of trajectories

    • n should be at least as large as the number of available actions(so each is tried at least once)

    • In general n needs to be larger as the randomness of the simulator increases (so each action gets tried a sufficient number of times)

    • Rule-of-Thumb : start with n set so that each action can be tried approximately 5 times and then see impact of decreasing/increasing n

  • Selecting height/horizon h of trajectories

    • A common option is to just select h to be the same as the horizon of the problem being solved

    • Suggestion: setting h = -1 in our framework, which will run all trajectories until the simulator hits a terminal state

    • Using a smaller value of h can sometimes be effective if enough reward is accumulated to give a good estimate of Q-values

In general, larger values are better, but this increases time.


Practical issues speed
Practical Issues: Speed

  • There are three ways to speedup decision making time

    • Use a faster policy


Practical issues speed1
Practical Issues: Speed

  • There are three ways to speedup decision making time

    • Use a faster policy

    • Decrease the number of trajectories n

  • Decreasing Trajectories:

    • If n is small compared to # of actions k, then performance could be poor since actions don’t get tried very often

    • One way to get away with a smaller n is to use an action filter

  • Action Filter: a function f(s) that returns a subset of the actions in state s that rollout should consider

    • You can use your domain knowledge to filter out obviously bad actions

    • Rollout decides among the remaining actions returned by f(s)

    • Since rollout only tries actions in f(s) can use a smaller value of n


Practical issues speed2
Practical Issues: Speed

  • There are three ways to speedup either rollout procedure

    • Use a faster policy

    • Decrease the number of trajectories n

    • Decrease the horizon h

  • Decrease Horizon h:

    • If h is too small compared to the “real horizon” of the problem, then the Q-estimates may not be accurate

    • Can get away with a smaller h by using a value estimation heuristic

  • Heuristic function: a heuristic function v(s) returns an estimate of the value of state s

    • SimQ is adjusted to run policy for h steps ending in state s’ and returns the sum of rewards up until s’ added to the estimate v(s’)


Multi stage rollout
Multi-Stage Rollout

  • A single call to Rollout[π,h,w](s) yields one iteration of policy improvement starting at policy π

  • We can use more computation time to yield multiple iterations of policy improvement via nesting calls to Rollout

    • Rollout[Rollout[π,h,w],h,w](s) returns the action for state s resulting from two iterations of policy improvement

    • Can nest this arbitrarily

  • Gives a way to use more time in order to improve performance


Multi stage rollout1

Multi-Stage Rollout

s

Each step requires nhsimulator callsfor Rollout policy

ak

a1

a2

Trajectories of SimQ(s,ai,Rollout[π,h,w],h)

  • Two stage: compute rollout policy of “rollout policy of π”

  • Requires (nh)2calls to the simulator for 2 stages

  • In general exponential in the number of stages


Example rollout for solitaire yan et al nips 04
Example: Rollout for Solitaire[Yan et al. NIPS’04]

  • Multiple levels of rollout can payoff but is expensive


Rollout in 2 player games

Rollout in 2-Player Games

  • SimQ simply uses the base policy to select moves for both players until the horizon

  • Rollout is biased toward playing well against

  • Is this ok?

s

ak

a1

a2

p1

p2

q11 q12 … q1w

q21 q22 … q2w

qk1 qk2 … qkw


Another useful technique policy switching
Another Useful Technique: Policy Switching

  • Suppose you have a set of base policies {π1, π2,…, πM}

  • Also suppose that the best policy to use can depend on the specific state of the system and we don’t know how to select.

  • Policy switching is a simple way to select which policy to use at a given step via a simulator


Another useful technique policy switching1
Another Useful Technique: Policy Switching

s

πM

π 1

π 2

Sim(s,π1,h)

Sim(s,π2,h)

Sim(s,πM,h)

  • The stochastic function Sim(s,π,h) simply samples the h-horizon value of π starting in state s

  • Implement by simply simulating π starting in s for h steps and returning discounted total reward

  • Use Bandit algorithm to select best policy and then select action chosen by that policy


Policyswitching

PolicySwitching

PolicySwitch[{π1, π2,…, πM},h,n](s)

  • Define bandit with M arms giving rewards Sim(s,πi,h)

  • Let i* be index of the arm/policy selected by your favorite bandit algorithm using n trials

  • Return action πi*(s)

s

πM

π 1

π 2

Sim(s,πi,h) trajectories

Each simulates following πifor h steps.

v11 v12 … v1w

v21 v22 … v2w

vM1 vM2 … vMw

Discounted cumulative rewards


Executing policy switching in real world

Executing Policy Switching in Real World

𝜋2(s)

𝜋k(s’)

Real worldstate/action sequence

1

𝜋1

𝜋k

𝜋k

𝜋2

𝜋2

s

run policy rollout

run policy rollout

Simulated experience


Policy switching quality
Policy Switching: Quality

  • Let denote the ideal switching policy

    • Always pick the best policy indexat any state

  • The value of the switching policy is at least as good as the best single policy in the set

    • It will often perform better than any single policy in set.

    • For non-ideal case, were bandit algorithm only picks approximately the best arm we can add an error term to the bound.

Theorem: For any state s, .


Policy switching in 2 player games
Policy Switching in 2-Player Games

Suppose we have a two sets of polices, one for each player.

Max Policies (us) :

Min Policies (them) : }

These policy sets will often be the same, when players have the same actions sets.

Policies encode our knowledge of what the possible effective strategies might be in the game

But we might not know exactly when each strategy will be mosteffective.


Minimax policy switching
Minimax Policy Switching

Build GameMatrix

Current State s

Game Simulator

Each entry gives estimated value (for max player)of playing a policy pair against one another

Each value estimated by averaging across w simulated games.


Maximin switching
MaxiMin Switching

Build GameMatrix

Current State s

Game Simulator

MaxiMin Policy

Select action

Can switch between policies based on state of game!


Maximin switching1
MaxiMin Switching

Build GameMatrix

Current State s

Game Simulator

Parameters in Library Implementation:

Policy Sets: , }

Sampling Width w : number of simulations per policy pair

Height/Horizon h : horizon used for simulations


ad