Mdp reinforcement learning
Download
1 / 19

MDP Reinforcement Learning - PowerPoint PPT Presentation


  • 138 Views
  • Uploaded on

MDP Reinforcement Learning. Markov Decision Process. “Should you give money to charity?”. “Should you give money to charity?”. “Would you contribute?”. “Would you contribute?”. $. Charity MDP. State space : 3 states

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' MDP Reinforcement Learning' - maxine


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Markov decision process
Markov Decision Process

“Should you

give money to

charity?”

“Should you

give money to

charity?”

“Would you contribute?”

“Would you contribute?”

$


Charity mdp
Charity MDP

  • State space : 3 states

  • Actions : “Should you give money to charity” ,“Would you contribute”

  • Observations : knowledge of current state

  • Rewards : in final state, positive reward proportional to amount of money gathered



Lecture outline
Lecture Outline

  • Computing the Value Function

  • Finding the Optimal Policy

  • Computing the Value Function in an Online Environment


Useful definitions
Useful definitions

Define:  to be a policy

(j) : the action to take in j

R(j) the reward from a certain state

f(j,) : the next state, starting from state j and performing action 


Computing the value function
Computing The Value Function

  • When the reward is known, we can compute the value function for a particular policy

  • V(j), the value function : Expected reward for being in state j, and following a certain policy 


Calculating v j
Calculating V(j)

  • Set V0 (j) = 0, for all j

  • For i = 1 to Max_i

    • Vi (j) = R(j) +  V(i-1) (f(j, (j)))

  • = the discount rate, measures how much future rewards

    can propagate to previous states

Above formula depends on the rewards being known


Value fn for the charity mdp
Value Fn forthe Charity MDP

  • Fixing  at .5, and two policies, one which asks both

  • questions, and the other cuts to the chase

  • What is V3 if :

  • Assume that the reward is constant at the final state

  • (everyone gives the same amount of money)

  • 2. Assume that if you ask if one should give to charity, the

    • reward is 10 times higher.



  • Policy iteration
    Policy Iteration maximizes the rewards?

    • Set 0 to be an arbitrary policy

    • Set i to 0

    • Compute Vi(j) for all states j

    • Compute (i+1)(j) = argmax  Vi(f(j,))

    • If (i+1) = i stop, otherwise i++ and back to step 3

    What would this for the charity MDP for the two cases?


    Lecture Outline maximizes the rewards?

    Computing the Value Function

    Finding the Optimal Policy

    Computing the Value Function in an Online Environment


    Mdp learning
    MDP Learning maximizes the rewards?

    • So, the rewards are known, we can calculate the optimal policy using policy iteration.

    • But what happens in the case where we don’t know the rewards?


    Lecture Outline maximizes the rewards?

    Computing the Value Function

    Finding the Optimal Policy

    Computing the Value Function in an Online Environment


    Deterministic vs stochastic update
    Deterministic vs. Stochastic Update maximizes the rewards?

    Deterministic :

    Vi (j) = R(j) +  V(i-1) (f(j, (j)))

    Stochastic :

    V(n) = (1 - ) V(n) + [r + V(n’)]

    • Difference in that stochastic version averages over all visits to the state


    Mdp extensions
    MDP extensions maximizes the rewards?

    • Probabilistic state transitions

    • How should you calculate the value function for the first state now?

    “Would you like to contribute?”

    .8

    ““Would you like to contribute?”

    .2

    Mad

    Happy

    +10

    -10


    Probabilistic transitions
    Probabilistic Transitions maximizes the rewards?

    • Online computation strategy works the same even when state transitions are unknown

    • Works in the case when you don’t know what the transitions are


    Online v j computation
    Online V maximizes the rewards?(j) Computation

    • For each j initialize V(j) = 0

    • Set n = initial state

    • Set r = reward in state n

    • Let n’ = f(n, (n))

    • V(n) = (1 - ) V(n) + [r + V(n’)]

    • n = n’, and back to step 3


    1 step q learning
    1-step Q-learning maximizes the rewards?

    • Initialize Q(n,a) arbitraily

    • Select  as policy

    • n = initial state, r = reward, a = (n)

    • Q(n,a) = (1 - ) Q(n,a) +

      [r +  maxa’Q(n’,a’)]

    • n = n’, and back to step 3


    ad