- 112 Views
- Uploaded on
- Presentation posted in: General

Reinforcement Learning

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Reinforcement Learning

16 January 2009

RG Knowledge Based Systems

Hans Kleine Büning

- Motivation
- Applications
- Markov Decision Processes
- Q-learning
- Examples

How to program

a robot to ride

a bicycle?

- A way of programming agents by reward and punishment without specifying how the task is to be achieved

Environment

state

€€€

€€€

action

States:

Angle of handle bars

Angular velocity of handle bars

Angle of bicycle to vertical

Angular velocity of bicycle to vertical

Acceleration of angle of bicycle to vertical

Environment

state

€€€

€€€

action

Actions:

Torque to be applied to the handle bars

Displacement of the center of mass from the bicycle’s plan (in cm)

Environment

state

€€€

€€€

action

Angle of bicycle to vertical is greater than 12°

no

yes

Reward = -1

Reward = 0

Reinforcement

Learning

- Board Games
- TD-Gammon program, based on reinforcement learning, has become a world-class backgammon player

- Mobile Robot Controlling
- Learning to Drive a Bicycle
- Navigation
- Pole-balancing
- Acrobot

- Sequential Process Controlling
- Elevator Dispatching

- Learner is not told which actions to take
- Trial and error search
- Possibility of delayed reward:
- Sacrifice of short-term gains for greater long-term gains

- Explore/Exploit trade-off
- Considers the whole problem of a goal-directed agent interacting with an uncertain environment

- Agent and environment interact at discrete time steps: t = 0,1, 2, …
- Agent observes state at step t : st2 S
- produces action at step t: at2A
- gets resulting reward : rt +12 ℜ
- and resulting next state: st +12 S

- Coarsely, the agent’s goal is to get as much reward as it
can over the long run

Policy is

- a mapping from states to action (s) = a
- Reinforcement learning methods specify how the agent changes its policy as a result of experience experience

P = 0.8

P = 0.1

P = 0.1

Model (reward function and transition

probabilities) is known

Model (reward function or transition

probabilities) is unknown

discrete states

continuous

states

discrete states

continuous

states

Dynamic

Programming

Value

Function

Approximation

+

Dynamic

Programming

Reinforcement

Learning,

Monte Carlo

Methods

Valuation

Function

Approximation

+

Reinforcement

Learning

- Standard rules of blackjack hold
- State space:
- element[0] - current value of player's hand (4-21)
- element[1] - value of dealer's face-up card (2-11)
- element[2] - player does not have usable ace (0/1)

- Starting states:
- player has any 2 cards (uniformly distributed), dealer has any 1 card (uniformly distributed)

- Actions:
- HIT
- STICK

- Rewards:
- 1 for a loss
- 0 for a draw
- 1 for a win

States

Grids

Actions

Left

Up

Right

Down

Rewards

Bonus 20

Food 1

Predator -10

Empty grid -0.1

Transition probabilities

0.80 – agent goes where he intends to go

0.20 – to any other adjacent grid or remains where it was (in case he is on the board of the grid world he goes to the other side)