1 / 21

# Solving Large Markov Decision Processes - PowerPoint PPT Presentation

Solving Large Markov Decision Processes. Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004. Outline. Introduction: what’s the problem ? Temporal abstraction Logical representation of MDPs Potential future directions. Markov Decision Processes (MDPs).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Solving Large Markov Decision Processes' - flo

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Solving Large Markov Decision Processes

Yilan Gu

Dept. of Computer Science

University of Toronto

April 12, 2004

• Introduction: what’s the problem ?

• Temporal abstraction

• Logical representation of MDPs

• Potential future directions

• Decision-theoretic planning and learning problems are often modeled in MDPs.

• An MDP is a model M = < S, A, T, R > consisting

• a set of environment states S,

• a set of actions A,

• a transition function T: S  A  S [0,1]

T(s,a,s’) = Pr (s’| s,a),

• a reward function R: S A  R .

• A policy is a function : S  A.

• Expected cumulative reward -- value function V: S R .

The Bellman Eq.: V(s) = R(s, (s)) +  s’T(s, (s),s’) V(s’)

1 2 3 4 5 6 7 8

S = {(1,1), (1,2), …,(8,8)}

A = {up, down, left, right}

e.g.,

T((2,2),up,(1,2)) = 0.8, T((2,2),up,(2,1))=0.1,

T((2,2),up,(2,3)) = 0.1,

T((2,2),up,s’) = 0 for s’(1,2), (2,1), (2,3)

……

R((1,8)) = 1, R(s)= -1 for s  (1,8).

+1

1 2 3 4 5 6 7 8

0.8

up

(2,2)

0.1

0.1

Fig. The 8*8 grid world

Notice: explicit representation of the model

• Goal: looking for optimal policy * so that

V*(s) = V*(s)  V(s) for all sS and 

• Conventionalalgorithms

• Dynamic programming: value iteration and policy iteration,

• Decision tree search algorithm, etc.

• Example: Value iteration

Beginning with arbitrary V0;

In each iteration n>0: for every s  S ,

Qn(s,a) := R(s, a) +  s’T(s, a, s’) Vn-1 (s’) for any a;

Vn(s) := max a Qn(s,a) ;

When n  , Vn(s)  V*(s).

• Problem: it does not scale up!

• Temporal abstraction approaches (basic idea)

• Solving MDPs hierarchically

• Using complex actions or subtasks to compress the scales of the state spaces

• Representing and solving MDPs in a logical way (basic idea)

• Logically representing environment features

• Aggregating ‘similar’ states

• Representing effect of actions compactly by using logical structures, and eliminating unaffected features during reasoning

Example

1 2 3 4 5 6 7 8

• Partition {S1 , S2 , S3 ,S4 }

• A macro-action -- a local

policy i : Si A on region Si

• E.g., EPer (S1) = {(3,5),(5,3)}

• Discounted transition model

Ti: Si  {i}  Eper(Si)  [0,1]

• Discounted reward model

Ri: Si  {i} R

1 2 3 4 5 6 7 8

S1

   

   

   

   

• Abstract MDP M’= < S’, A’, T’, R’>

• S ’=  Eper(Si ), e.g.,

• {(4,3),(3,4),(5,3),(3,5),(6,4),(4,6), (5,6),(6,5)} .

• A’=  Ai , where Ai is a set of macro-actions

• on region Si .

• Transition model T’: S ’ A’ S ’  [0,1]

• T’(s, i, s’) = Ti(s, i,s’) if s Si , s’  Eper(Si );

• T’(s, i, s’) = 0 otherwise.

• Reward model R’: S ’ A’  R

• R’(s, i ) = Ri(s, i ) for any s’  Eper(Si ).

• Options [Sutton 1995; Singh, Sutton and Precup 1999]

Macro-actions [Hauskrecht et al. 1998; Parr 1998]

• Fixed policies

• Hierarchical abstract machines (HAMs) [Parr and Russell 1997; Andre and Russell 2001,2002]

• Finite controllers

• MAXQ methods [Dietterich 1998, 2000]

• Etc.

• Temporal abstraction approaches (basic idea)

• Solving MDPs hierarchically

• Using complex actions or subtasks to compress the scale of the state spaces

• Representing and solving MDPs logically (basic idea)

• Logically representing environment features

• Aggregating ‘similar’ states

• Representing effect of actions compactly by using logical structures, and eliminating unaffected features during reasoning

• Using the stochastic situation calculus to model decision-theoretic planning problems

• Underlying model : first-order MDPs (FOMDPs)

• Solving FOMDPs using symbolic dynamic programming

• Using choice axioms to specify possible outcomes ni(x) of any stochastic actiona(x)

Example:choice(delCoff(x),a) a = delCoffS(x) a = delCoffF(x)

• Situations: S0 , do(a,s)

• Fluents F(x,s) – modeling environment features compactly

Examples:office(x,s), coffeeReq(x,s), holdingCoffee(s)

• Basic action theory:

• Using successor state axioms to describe the effect of the actions’ outcomes on each fluent

coffeeReq(x,do(a,s))  coffeeReq(x,s)  a = delCoffS(x)

• Asserting probabilities (may be depended on conditions of

current situation

Example: prob(delCoffS(x), delCoff(x), s) = case [hot, 0.9; hot, 0.7]

• Specifying rewards/costs conditionally

Example:R(do(a,s)) =

case[ x. a = delCoffS(x) , 10;   x.a = delCoffS(x) , 0]

• stGolog programs, policies

proc (x)

if holdingCoffee then getCoffee else (

?(coffeeReq(x)) ; delCoffee(x))

end proc

• Representing value function Vn-1(s) logically

case [1 (s) ,v1 ; … ; m (s) ,vm ]

• Input: the system described in stochastic SitCal and Vn-1(s)

• Output (also in case format):

• Q-functions

Q n(a(x),s)= R(s)+ iprob(ni(x),a(x),s) Vn-1(do(ni(x),s))

• Value function Vn(s)

Vn(s) = ( a)( b) Q n(a,s) Q n(b,s)

• First-order MDPs [e.g., Boutilier et al. 2000; Boutilier, Reiter and Price 2001]

• Factored MDPs [e.g., Boutilier and Dearden 1994, Boutilier, Dearden and Goldszmit 1995; Hoey et al. 1999]

• Relational MDPs [e.g., Guestrin et al. 2003]

• Integrated Bayesian Agent Language (IBAL) [Pfeffer 2001]

• Etc [e.g., Bacchus 1993, Poole 1995].

Motivation representations of MDPs.

cityA

cityB

living(X, houseA)

inCity(Y,cityA)

houseB

Prior Work representations of MDPs.

• MAXQ approaches [Dietterich 2000] and PHAMs method [Andre and Sutton 2001]

• Using variables to represent state features

• Propositional representations

• Extending DTGolog with options [Ferrein, Fritz and Lakemeyer 2003]

• Specifying options with the SitCal and Golog programs

• Benefit: reusable when entering the exact same region

• Shortage: options are based on explicit regions, and therefore not reusable under ‘similar’ regions

Our Idea and Potential Directions representations of MDPs.

• Given any stGolog program (a macro-action schema)

Example: proc getCoffee(X)

if holdingCoffee then getCoffee else (

while coffeeReq(X) do delCoffee(X) )

end proc

• Basic Idea – inspired by macro-actions [Boutilier et al 1998]:

• Analyzing the macro-action to find what has been affected by the macro-action

Example: holdingCoffee, coffeeReq(X)

• Preprocessing discounted transition and reward models

Example: tr(holdingCoffee  coffeeReq(X) , getCoffee(X),

 holdingCoffee   coffeeReq(X) )

(Continue) representations of MDPs.

• using and re-using macro-actions as primitive actions

• Benefit:

• Schematic

• Free variables in the macro-actions can represent a class of objects which have same characteristics

• Even for infinite objects

• Reusable in similar regions, other than the exact region

Thank you!