1 / 11

# MDPs and Reinforcement Learning - PowerPoint PPT Presentation

MDPs and Reinforcement Learning. Overview. MDPs Reinforcement learning. Sequential decision problems. In an environment, find a sequence of actions in an uncertain environment that balance risks and rewards Markov Decision Process (MDP):

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' MDPs and Reinforcement Learning' - talli

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### MDPs and Reinforcement Learning

• MDPs

• Reinforcement learning

• In an environment, find a sequence of actions in an uncertain environment that balance risks and rewards

• Markov Decision Process (MDP):

• In a fully observable environment we know initial state (S0) and state transitions T(Si, Ak, Sj) = probability of reaching Sj from Si when doing Ak

• States have a reward associated with them R(Si)

• We can define a policy π that selects an action to perform given a state, i.e., π(Si)

• Applying a policy leads to a history of actions

• Goal: find policy maximizing expected utility of history

• Assume R(s) = -0.04 except where marked

• Here’s an optimal policy

Different default rewards produce different optimal policies

life=pain, get out quick

Life = struggle, go for +1, accept risk

Life = good, avoid exits

Life = ok, go for +1, minimize risk

• Finite Horizon

• There’s a fixed time N when the game is over

• U([s1…sn]) = U([s1…sn…sk])

• Find a policy that takes that into account

• Infinite Horizon

• Game goes on forever

• The best policy for with a finite horizon can change over time: more complicated

• The utility of a sequence is usually additive

• U([s0…s1]) = R(s0) + R(s1) + … R(sn)

• But future rewards might be discounted by a factor γ

• U([s0…s1]) = R(s0) + γ*R(s1) + γ2*R(s2)…+ γn*R(sn)

• Using discounted rewards

• Solves some technical difficulties with very long or infinite sequences and

• Is psychologically realistic

• The value of a state is the expected return starting from that state; depends on the agent’s policy:

• The value of taking an action in a stateunder policy p is the expected return starting from that state, taking that action, and thereafter following p :

The basic idea:

So:

Or, without the expectation operator: