mdps and reinforcement learning
Download
Skip this Video
Download Presentation
MDPs and Reinforcement Learning

Loading in 2 Seconds...

play fullscreen
1 / 11

MDPs and Reinforcement Learning - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

MDPs and Reinforcement Learning. Overview. MDPs Reinforcement learning. Sequential decision problems. In an environment, find a sequence of actions in an uncertain environment that balance risks and rewards Markov Decision Process (MDP):

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' MDPs and Reinforcement Learning' - talli


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overview
Overview
  • MDPs
  • Reinforcement learning
sequential decision problems
Sequential decision problems
  • In an environment, find a sequence of actions in an uncertain environment that balance risks and rewards
  • Markov Decision Process (MDP):
    • In a fully observable environment we know initial state (S0) and state transitions T(Si, Ak, Sj) = probability of reaching Sj from Si when doing Ak
    • States have a reward associated with them R(Si)
  • We can define a policy π that selects an action to perform given a state, i.e., π(Si)
  • Applying a policy leads to a history of actions
  • Goal: find policy maximizing expected utility of history
4x3 grid world1
4x3 Grid World
  • Assume R(s) = -0.04 except where marked
  • Here’s an optimal policy
4x3 grid world2
4x3 Grid World

Different default rewards produce different optimal policies

life=pain, get out quick

Life = struggle, go for +1, accept risk

Life = good, avoid exits

Life = ok, go for +1, minimize risk

finite and infinite horizons
Finite and infinite horizons
  • Finite Horizon
    • There’s a fixed time N when the game is over
    • U([s1…sn]) = U([s1…sn…sk])
    • Find a policy that takes that into account
  • Infinite Horizon
    • Game goes on forever
  • The best policy for with a finite horizon can change over time: more complicated
rewards
Rewards
  • The utility of a sequence is usually additive
    • U([s0…s1]) = R(s0) + R(s1) + … R(sn)
  • But future rewards might be discounted by a factor γ
    • U([s0…s1]) = R(s0) + γ*R(s1) + γ2*R(s2)…+ γn*R(sn)
  • Using discounted rewards
    • Solves some technical difficulties with very long or infinite sequences and
    • Is psychologically realistic
value functions
Value Functions
  • The value of a state is the expected return starting from that state; depends on the agent’s policy:
  • The value of taking an action in a stateunder policy p is the expected return starting from that state, taking that action, and thereafter following p :
bellman equation for a policy p
Bellman Equation for a Policy p

The basic idea:

So:

Or, without the expectation operator:

ad