Download
reinforcement learning n.
Skip this Video
Loading SlideShow in 5 Seconds..
Reinforcement Learning PowerPoint Presentation
Download Presentation
Reinforcement Learning

Reinforcement Learning

293 Views Download Presentation
Download Presentation

Reinforcement Learning

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Reinforcement Learning 16 January 2009 RG Knowledge Based Systems Hans Kleine Büning

  2. Outline • Motivation • Applications • Markov Decision Processes • Q-learning • Examples

  3. How to program a robot to ride a bicycle?

  4. Reinforcement Learning: The Idea • A way of programming agents by reward and punishment without specifying how the task is to be achieved

  5. Environment state €€€ €€€ action Learning to Ride a Bicycle

  6. States: Angle of handle bars Angular velocity of handle bars Angle of bicycle to vertical Angular velocity of bicycle to vertical Acceleration of angle of bicycle to vertical Learning to Ride a Bicycle

  7. Environment state €€€ €€€ action Learning to Ride a Bicycle

  8. Actions: Torque to be applied to the handle bars Displacement of the center of mass from the bicycle’s plan (in cm) Learning to Ride a Bicycle

  9. Environment state €€€ €€€ action Learning to Ride a Bicycle

  10. Angle of bicycle to vertical is greater than 12° no yes Reward = -1 Reward = 0

  11. Learning To Ride a Bicycle Reinforcement Learning

  12. Reinforcement Learning: Applications • Board Games • TD-Gammon program, based on reinforcement learning, has become a world-class backgammon player • Mobile Robot Controlling • Learning to Drive a Bicycle • Navigation • Pole-balancing • Acrobot • Sequential Process Controlling • Elevator Dispatching

  13. Key Features of Reinforcement Learning • Learner is not told which actions to take • Trial and error search • Possibility of delayed reward: • Sacrifice of short-term gains for greater long-term gains • Explore/Exploit trade-off • Considers the whole problem of a goal-directed agent interacting with an uncertain environment

  14. The Agent-Environment Interaction • Agent and environment interact at discrete time steps: t = 0,1, 2, … • Agent observes state at step t : st2 S • produces action at step t: at2A • gets resulting reward : rt +12 ℜ • and resulting next state: st +12 S

  15. The Agent’s Goal: • Coarsely, the agent’s goal is to get as much reward as it can over the long run Policy  is • a mapping from states to action (s) = a • Reinforcement learning methods specify how the agent changes its policy as a result of experience experience

  16. Deterministic Markov Decision Process

  17. Example

  18. Example: Corresponding MDP

  19. Example: Corresponding MDP

  20. Example: Corresponding MDP

  21. Example: Policy

  22. Value of Policy and Rewards

  23. Value of Policy and Agent’s Task

  24. Nondeterministic Markov Decision Process P = 0.8 P = 0.1 P = 0.1

  25. Nondeterministic Markov Decision Process

  26. Nondeterministic Markov Decision Process

  27. Example with South-Easten Wind

  28. Example with South-Easten Wind

  29. Methods Model (reward function and transition probabilities) is known Model (reward function or transition probabilities) is unknown discrete states continuous states discrete states continuous states Dynamic Programming Value Function Approximation + Dynamic Programming Reinforcement Learning, Monte Carlo Methods Valuation Function Approximation + Reinforcement Learning

  30. Q-learning Algorithm

  31. Q-learning Algorithm

  32. Example

  33. Example: Q-table Initialization

  34. Example: Episode 1

  35. Example: Episode 1

  36. Example: Episode 1

  37. Example: Episode 1

  38. Example: Episode 1

  39. Example: Q-table

  40. Example: Episode 1

  41. Episode 1

  42. Example: Q-table

  43. Example: Episode 2

  44. Example: Episode 2

  45. Example: Episode 2

  46. Example: Q-table after Convergence

  47. Example: Value Function after Convergence

  48. Example: Optimal Policy

  49. Example: Optimal Policy

  50. Q-learning