Advanced mdp topics
Download
1 / 18

Advanced MDP Topics - PowerPoint PPT Presentation


  • 102 Views
  • Uploaded on

Advanced MDP Topics. Ron Parr Duke University. Value Function Approximation. Why? Duality between value functions and policies Softens the problems State spaces are too big Many problems have continuous variables “Factored” (symbolic) representations don’t always save us How

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Advanced MDP Topics' - mindy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Advanced mdp topics

Advanced MDP Topics

Ron Parr

Duke University


Value function approximation
Value Function Approximation

  • Why?

    • Duality between value functions and policies

    • Softens the problems

    • State spaces are too big

      • Many problems have continuous variables

      • “Factored” (symbolic) representations don’t always save us

  • How

    • Can tie in to vast body of

      • Machine learning methods

      • Pattern matching (neural networks)

      • Approximation methods


Implementing vfa
Implementing VFA

  • Can’t represent V as a big vector

  • Use (parametric) function approximator

    • Neural network

    • Linear regression (least squares)

    • Nearest neighbor (with interpolation)

  • (Typically) sample a subset of the the states

  • Use function approximation to “generalize”


Basic value function approximation
Basic Value Function Approximation

Idea: Consider restricted class of value functions

resample?

Subset

of states

VI

FA

V0

V*?

Alternate value iteration with supervised learning


Vfa outline
VFA Outline

1. Initialize V0(s,w0), n=1

2. Select some s0…si

3. For each sj

4. Compute Vn(s,wn) by training w on

5. n := n+1

6. Unless Vn+1-Vn<e goto 2

If supervised learning error is “small”,

then Vfinal “close” to V*.


Stability problem
Stability Problem

Problem: Most VFA methods are unstable

s2

s1

No rewards, g = 0.9: V* = 0

Example: Bertsekas & Tsitsiklis 1996


Least squares approximation
Least Squares Approximation

Restrict V to linear functions:

Find q s.t. V(s1) = q, V(s2) = 2q

V(x)

s1

s2

S

Counterintuitive Result: If we do a least squares fit of q

qt+1 = 1.08 qt


Unbounded growth of v
Unbounded Growth of V

V(x)

n

1

2

S


What went wrong
What Went Wrong?

  • VI reduces error in maximum norm

  • Least squares (= projection) non-expansive in L2

  • May increase maximum norm distance

  • Grows max norm error at faster rate than VI shrinks it

  • And we didn’t even use sampling!

  • Bad news for neural networks…

  • Success depends on

    • sampling distribution

    • pairing approximator and problem


Success stories linear td
Success Stories - Linear TD

  • [Tsitsiklis & Van Roy 96, Bratdke & Barto 96]

  • Start with a set of basis functions

  • Restrict V to linear space spanned by bases

  • Sample states from current policy

Space of true value functions

P = Projection

VI

Restricted Linear Space

N.B. linear is still expressive due to basis functions


Linear td formal properties
Linear TD Formal Properties

  • Use to evaluate policies only

  • Converges w.p. 1

  • Error measured w.r.t. stationary distribution

  • Frequently visited states have low error

  • Infrequent states can have high error


Linear td methods
Linear TD Methods

  • Applications

    • Inventory control: Van Roy et al.

    • Packet routing: Marbach et al.

    • Used by Morgan Stanley to value options

    • Natural idea: use for policy iteration

  • No guarantees

    • Can produce bad policies for trivial problems [Koller & Parr 99]

    • Modified for better PI: LSPI [Lagoudakis & Parr 01]

  • Can be done symbolically [Koller & Parr 00]

  • Issues

    • Selection of basis functions

    • Mixing rate of process - affects k, speed


Success story averagers gordon 95 and others
Success Story: Averagers [Gordon 95, and others…]

  • Pick set, Y=y1…yi of representative states

  • Perform VI on Y

  • For x not in Y,

  • Averagers are non expansions in max norm

  • Converge to within 1/(1-g) factor of “best”


Interpretation of averagers
Interpretation of Averagers

y1

b1

b2

x

y2

b3

y3


Interpretation of averagers ii
Interpretation of Averagers II

Averagers Interpolate:

y1

y2

Grid vertices = Y

x

y4

y3


General vfa issues
General VFA Issues

  • What’s the best we can hope for?

    • We’d like to get approximate close to

    • How does this relate to

  • In practice:

    • We are quite happy if we can prove stability

    • Obtaining good results often involves an iterative process of tweaking the approximator, measuring empirical performance, and repeating…


Why i m still excited about vfa
Why I’m Still Excited About VFA

  • Symbolic methods often fail

    • Stochasticity increases branching factor

    • Many trivial problems have no exploitable structure

  • “Bad” value functions can have good performance

  • We can bound “badness” of value functions

    • By simulation

    • Symbolically in some cases [Koller & Parr 00; Guestrin, Koller & Parr 01; Dean & Kim 01]

  • Basis function selection can be systematized


Hierarchical abstraction
Hierarchical Abstraction

  • Reduce problem in to simpler subproblems

  • Chain primitive actions into macro-actions

  • Lots of results that mirror classical results

    • Improvements dependent on user-provided decompositions

    • Macro-actions great if you start with good macros

  • See Dean & Lin, Parr & Russell, Precup, Sutton & Singh, Schmidhuber & Weiring, Hauskrecht et al., etc.