slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Kunstmatige Intelligentie / RuG PowerPoint Presentation
Download Presentation
Kunstmatige Intelligentie / RuG

Loading in 2 Seconds...

play fullscreen
1 / 23

Kunstmatige Intelligentie / RuG - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on

KI2 - 11. Reinforcement Learning. Sander van Dijk. Kunstmatige Intelligentie / RuG. What is Learning ?. Percepts received by an agent should be used not only for acting, but also for improving the agent’s ability to behave optimally in the future to achieve its goal.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Kunstmatige Intelligentie / RuG' - shawna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

KI2 - 11

Reinforcement Learning

Sander van Dijk

Kunstmatige Intelligentie / RuG

what is learning
What is Learning ?
  • Percepts received by an agent should be used not only for acting, but also for improving the agent’s ability to behave optimally in the future to achieve its goal.
  • Interaction between an agent and the world
learning types
Learning Types
  • Supervised learning:
    • Input, output) pairs of the function to be learned can be perceived or are given.Back-propagation
  • Unsupervised Learning:
    • No information at all about given outputSOM
  • Reinforcement learning:
    • Agent receives no examples and starts with no model of the environment and no utility function. Agent gets feedback through rewards, or reinforcement.
reinforcement learning
Reinforcement Learning
  • Task
    • Learn how to behave successfully to achieve a goal while interacting with an external environment

Learn through experience from trial and error

  • Examples
    • Game playing: The agent knows it has won or lost, but it doesn’t know the appropriate action in each state
    • Control: a traffic system can measure the delay of cars, but not know how to decrease it.
elements of rl

State

Reward

Action

Elements of RL
  • Transition model, how action influence states
  • Reward R, immediate value of state-action transition
  • Policy , maps states to actions

Agent

Policy

Environment

elements of rl6

r(state, action)

immediate reward values

0

100

0

0

G

0

0

0

0

0

0

100

0

0

Elements of RL
elements of rl7

r(state, action)

immediate reward values

90

90

90

100

100

100

0

0

0

G

G

G

0

100

0

0

G

81

81

81

90

90

90

100

100

100

0

0

0

0

0

0

100

0

0

(

)

(

)

(

)

(

)

º

+

+

+

+

+

...

2

π

V

s

r

t

γr

1

γ

r

t

1

t

Elements of RL
  • Value function: maps states to state values

Discount factor  [0, 1) (here 0.9)

V*(state) values

rl task restated
RL task (restated)
  • Execute actions in environment,

observe results.

  • Learn action policy  : state action that maximizes expected discounted reward

E [r(t) + r(t + 1)+ 2r(t + 2)+ …]

from any starting state in S

reinforcement learning9
Reinforcement Learning
  • Target function is  : state action
  • However…
    • We have no training examples of form <state, action>
    • Training examples are of form

<<state, action>, reward>

utility based agents
Utility-based agents
  • Try to learn V * (abbreviated V*)
  • Perform look ahead search to choose best action from any state s
  • Works well if agent knows
    •  : state  action  state
    • r : state  action  R
  • When agent doesn’t know  and r, cannot choose actions this way
q values
Q-values
  • Q-values
    • Define new function very similar to V*
    • If agent learns Q, it can choose optimal action even without knowing  or R
  • Using Q
learning the q value
Learning the Q-value
  • Note: Q and V* closely related
  • Allows us to write Q recursively as
  • Temporal Difference learning
learning the q value13
Learning the Q-value
  • FOR each <s, a> DO
    • Initialize table entry:
  • Observe current state s
  • WHILE (true) DO
    • Select action a and execute it
    • Receive immediate reward r
    • Observe new state s’
    • Update table entry for as follows
    • Move: record transition from s to s’
q learning

90

100

0

G

0

90

100

100

0

0

0

G

72

81

G

81

81

90

100

0

0

0

0

0

81

90

0

81

90

100

100

0

0

72

81

Q-learning
  • Q-learning, learns the expected utility of taking a particular action a in a particular state s (Q-value of the pair (s,a))

r(state, action)

immediate reward values

Q(state, action) values

V*(state) values

representation
Representation
  • Explicit
  • Implicit
    • Weighted linear function/neural networkClassical weight updating
exploration
Exploration
  • Agent follows policy deduced from learned Q-values
  • Agent always performs same action in certain state, but perhaps there is an even better action?
  • Exploration: Be safe <-> learn more, greed <-> curiosity.
  • Extremely hard, if not impossible, to obtain optimal exploration policy.
  • Randomly try actions that have not been tried often before but avoid actions that are believed to be of low utility
enhancement q
Enhancement: Q()
  • Q-learning estimates one time step difference
  • Why not for n steps?
enhancement q18
Enhancement: Q()
  • Q() formula
  • Intuitive idea: use constant 0    1 to combine estimates from various look ahead distances (note normalization factor (1- ))
enhancement eligibility traces
Enhancement: Eligibility Traces
  • Look backward instead of forward.
  • Weigh updates by eligibility trace e(s, a).
  • On each step, decay all traces by gl and increment the trace for the current state-action pair by 1.
  • Update all state-action pairs in proportion to their eligibility.
genetic algorithms
Genetic algorithms
  • Imagine the individuals as agent functions
  • Fitness function as performance measure or reward function
  • No attempt made to learn the relationship between the rewards and actions taken by an agent
  • Simply searches directly in the individual space to find one that maximizes the fitness functions
genetic algorithms21
Genetic algorithms
  • Represent an individual as a binary string
  • Selection works like this: if individual X scores twice as high as Y on the fitness function, then X is twice as likely to be selected for reproduction than Y.
  • Reproduction is accomplished by cross-over and mutation
cart pole balancing
Cart – Pole balancing
  • Demonstration

http://www.bovine.net/~jlawson/hmc/pole/sane.html

summary
Summary
  • RL addresses the problem of learning control strategies for autonomous agents
  • TD-algorithms learn by iteratively reducing the differences between the estimates produced by the agent at different times
  • In Q-learning an evaluation function over states and actions is learned
  • In the genetic approach, the relation between rewards and actions is not learned. You simply search the fitness function space.