1 / 12

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning. Freek Stulp. Overview. General principles of RL Markov Decision Process as model Values of states: V(s) Values of state-actions: Q(a,s) Exploration vs. Exploitation Issues in RL Conclusion.

walker
Download Presentation

Introduction to Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction toReinforcement Learning Freek Stulp

  2. Overview • General principles of RL • Markov Decision Process as model • Values of states: V(s) • Values of state-actions: Q(a,s) • Exploration vs. Exploitation • Issues in RL • Conclusion

  3. Neural Networks are supervised learning algorithms: for each input, we know the output. What if we don‘t know the output for each input? Flight control system example Let the agent learn how to achieve certain goals itself, through interaction with the environment. General principles of RL

  4. Environment action percept reward Agent Rewards to specify goals (example: dogs) General principles of RL Let the agent learn how to achieve certain goals itself, through interaction with the environment. This does not solve the problem!

  5. Markov Decision Process = {S,A,R,T} Set of states S Set of actions A Reward function R Transition function T Markov property Tss´ only depends on s, s´ Policy: p(S)=>A Problem: Find policy p that maximizes the reward Discounted reward: r0 + g1r1 + g2r2 ... gnrn a2 s2 s3 r1 r2 Popular model: MDPs a0 a1 s0 s1 r0

  6. 0 0 -1 -1 -1 -2 -1 -3 0 -14 -20 -22 -1 -1 -2 -1 -3 -1 -2 -1 -14 -20 -22 -20 -1 -2 -3 -1 -2 -1 -1 -1 -20 -22 -20 -14 -1 -3 -1 -2 -1 -1 0 0 -22 -20 -14 0 Vp(s) (Random policy) V*(s) (Optimal policy) R (Rewards) Values of states: Vp(s) • Definition of value Vp(s) • Cumulative reward when starting in state s, and executing some policy untill terminal state is reached. • Optimal policy yields V*(s)

  7. Dynamic programmingV(s) = R(s) + S Vps´(Tss´V(s´)) + Only visited states are used s s Determining Vp(s) TD-learningV(s) = V(s) + a(R(s)+V(s´)-V(s)) -Necessary to consider all states.

  8. Q-values: Q(a,s)Value of doing an action in a certain state. Dynamic Programming: Q(a,s) =R(s) + Ss´Tss´maxaQ(a´,s´) TD-learning Q(a,s) = Q(a,s) + a(R(s) + maxa´Q(a´,s´) - Q(a,s))T is not in this formula: Model free learning! Values of state-action: Q(a,s)

  9. Only exploitation: New (maybe better) paths never discovered Only exploration: What is learned is never exploited Good trade-off: Explore first to learn, exploit later to benefit Exploration vs. Exploitation

  10. Hidden state If you don‘t know where you are, you can‘t know what to do. Curse of dimensionality Very large state spaces. Continuous states/action spaces All algorithms use discrete tables spaces. What about continuous values? Many of your articles discuss solutions to these problems. Some issues

  11. RL: Learning through interaction and rewards. Markov Decision Process popular model Values of states: V(s) Values of action/states: Q(a,s) (model free!) Still some problems... not quite ready for complex real-world problems yet, but research underway! Conclusion

  12. Literature • Artificial Intelligence: A Modern Approach • Stuart Russel and Peter Norvig • Machine Learning • Tom M. Mitchell • Reinforcement learning: A Tutorial • Mance E. Harmon and Stephanie S. Harmon

More Related