Loading in 5 sec....

MDPs and Reinforcement LearningPowerPoint Presentation

MDPs and Reinforcement Learning

- By
**talli** - Follow User

- 89 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' MDPs and Reinforcement Learning' - talli

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Overview

- MDPs
- Reinforcement learning

Sequential decision problems

- In an environment, find a sequence of actions in an uncertain environment that balance risks and rewards
- Markov Decision Process (MDP):
- In a fully observable environment we know initial state (S0) and state transitions T(Si, Ak, Sj) = probability of reaching Sj from Si when doing Ak
- States have a reward associated with them R(Si)

- We can define a policy π that selects an action to perform given a state, i.e., π(Si)
- Applying a policy leads to a history of actions
- Goal: find policy maximizing expected utility of history

4x3 Grid World

- Assume R(s) = -0.04 except where marked
- Here’s an optimal policy

4x3 Grid World

Different default rewards produce different optimal policies

life=pain, get out quick

Life = struggle, go for +1, accept risk

Life = good, avoid exits

Life = ok, go for +1, minimize risk

Finite and infinite horizons

- Finite Horizon
- There’s a fixed time N when the game is over
- U([s1…sn]) = U([s1…sn…sk])
- Find a policy that takes that into account

- Infinite Horizon
- Game goes on forever

- The best policy for with a finite horizon can change over time: more complicated

Rewards

- The utility of a sequence is usually additive
- U([s0…s1]) = R(s0) + R(s1) + … R(sn)

- But future rewards might be discounted by a factor γ
- U([s0…s1]) = R(s0) + γ*R(s1) + γ2*R(s2)…+ γn*R(sn)

- Using discounted rewards
- Solves some technical difficulties with very long or infinite sequences and
- Is psychologically realistic

Value Functions

- The value of a state is the expected return starting from that state; depends on the agent’s policy:
- The value of taking an action in a stateunder policy p is the expected return starting from that state, taking that action, and thereafter following p :

Download Presentation

Connecting to Server..