210 likes | 451 Views
Introduction to Reinforcement Learning. Dr Kathryn Merrick k.merrick@adfa.edu.au 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th November, 15:30-17:00. Reinforcement Learning is…. … learning from trial-and-error and reward by interaction with an environment.
E N D
Introduction to Reinforcement Learning Dr Kathryn Merrick k.merrick@adfa.edu.au 2008 Spring School on Optimisation, Learning and Complexity Friday 7th November, 15:30-17:00
Reinforcement Learning is… … learning from trial-and-error and reward by interaction with an environment.
Today’s Lecture • A formal framework: Markov Decision Processes • Optimality criteria • Value functions • Solution methods: Q-learning • Examples and exercises • Alternative models • Summary and applications
Markov Decision Processes The reinforcement learning problem can be represented as: • A set S of states {s1, s2, s3, …} • A set A of actions {a1, a2, a3, …} • A transition function T:S x A S (deterministic) orT:S x A x S [0, 1] (stochastic) • A reward function R:S x A Real or R:S x A x S Real • A policyπ:S A (deterministic) or π:S x A [0, 1] (stochastic)
Optimality Criteria Suppose an agent receives a reward rt at time t. Then optimal behaviour might: • Maximise the sum of expected future reward: • Maximise over a finite horizon: • Maximise over an infinite horizon: • Maximise over a discounted infinite horizon: • Maximise average reward:
Value Functions The expected sum of discounted reward for following the policy πfrom state s to the end of time. The expected sum of discounted reward for starting in state s, taking action a once then following the policy πfrom state s’ to the end of time. State value function Vπ:S Real or Vπ(s) State-action value function Qπ:S x A Real or Qπ(s, a)
Optimal State Value Function V*(s) = E{ R(s, a, s’) + γV*(s’) | s, a } = T(s, a, s’) [ R(s, a, s’) + γV*(s’) ] • A Bellman Equation • Can be solved using dynamic programming • Requires knowledge of the transition function T
Optimal State-Action Value Function Q*(s, a) = E{ R(s, a, s’) + γQ*(s’, a’) | s, a } = T(s, a, s’) [ R(s, a, s’) + γQ*(s’, a’) ] • Also a Bellman Equation • Also requires knowledge of the transition function T to solve using dynamic programming • Can now define action selection: π*(s) = Q*(s, a)
Solution Methods • Model based: • For example dynamic programming • Require a model (transition function) of the environment for learning • Model free: • Learn from interaction with the environment without requiring a model • For example Q-learning…
Park Clean Drive Parked Clean Driving Clean Clean Park Clean Drive Clean Drive Parked Dirty Driving Dirty Park Park Drive Q-Learning by Example: Driving in Canberra
Formulating the Problem • States s1 Park clean s2 Park dirty s3 Drive clean s4 Drive dirty • Actions a1 Drive a2 Clean a3 Park • Reward 1 for transitions to a rt = ‘clean’ state 0 otherwise • State-Action Table or Q-Table
A Q-Learning Agent st Learning update to πt Action selection from πt at rt Agent Environment
Q-Learning Algorithmic Components • Learning update (to Q-Table): Q(s, a) (1-α)Q(s, a) + α[r + γQ(s’, a’)] or Q(s, a) Q(s, a) + α[r + γQ(s’, a’) - Q(s, a)] • Action selection (from Q-Table): a = f(Q(s, a))
Exercise You need to program a small robot to learn to find food. • What assumptions will you make about the robot’s sensors and actuators to represent the environment? • How could you model the problem as an MDP? • Calculate a few learning iterations in your domain by hand.
Alternatives • Function approximation of the Q-table: • Neural networks • Decision trees • Gradient descent methods • Reinforcement learning variants: • Relational reinforcement learning • Hierarchical reinforcement learning • Intrinsically motivated reinforcement learning
References and Further Reading • Sutton, R., Barto, A., (2000) Reinforcement Learning: an Introduction, The MIT Press http://www.cs.ualberta.ca/~sutton/book/the-book.html • Kaelbling, L., Littman, M., Moore, A., (1996) Reinforcement Learning: a Survey, Journal of Artificial Intelligence Research, 4:237-285 • Barto, A., Mahadevan, S., (2003) Recent Advances in Hierarchical Reinforcement Learning, Discrete Event Dynamic Systems: Theory and Applications, 13(4):41-77