Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Dr Kathryn Merrick k.merrick@adfa.edu.au 2008 Spring School on Optimisation, Learning and Complexity Friday 7th November, 15:30-17:00

Reinforcement Learning is… … learning from trial-and-error and reward by interaction with an environment.

Today’s Lecture • A formal framework: Markov Decision Processes • Optimality criteria • Value functions • Solution methods: Q-learning • Examples and exercises • Alternative models • Summary and applications

Markov Decision Processes The reinforcement learning problem can be represented as: • A set S of states {s1, s2, s3, …} • A set A of actions {a1, a2, a3, …} • A transition function T:S x A  S (deterministic) orT:S x A x S  [0, 1] (stochastic) • A reward function R:S x A  Real or R:S x A x S  Real • A policyπ:S  A (deterministic) or π:S x A  [0, 1] (stochastic)

Optimality Criteria Suppose an agent receives a reward rt at time t. Then optimal behaviour might: • Maximise the sum of expected future reward: • Maximise over a finite horizon: • Maximise over an infinite horizon: • Maximise over a discounted infinite horizon: • Maximise average reward:

Value Functions The expected sum of discounted reward for following the policy πfrom state s to the end of time. The expected sum of discounted reward for starting in state s, taking action a once then following the policy πfrom state s’ to the end of time. State value function Vπ:S  Real or Vπ(s) State-action value function Qπ:S x A  Real or Qπ(s, a)

Optimal State Value Function V*(s) = E{ R(s, a, s’) + γV*(s’) | s, a } = T(s, a, s’) [ R(s, a, s’) + γV*(s’) ] • A Bellman Equation • Can be solved using dynamic programming • Requires knowledge of the transition function T

Optimal State-Action Value Function Q*(s, a) = E{ R(s, a, s’) + γQ*(s’, a’) | s, a } = T(s, a, s’) [ R(s, a, s’) + γQ*(s’, a’) ] • Also a Bellman Equation • Also requires knowledge of the transition function T to solve using dynamic programming • Can now define action selection: π*(s) = Q*(s, a)

A Possible Application…

Solution Methods • Model based: • For example dynamic programming • Require a model (transition function) of the environment for learning • Model free: • Learn from interaction with the environment without requiring a model • For example Q-learning…

Park Clean Drive Parked Clean Driving Clean Clean Park Clean Drive Clean Drive Parked Dirty Driving Dirty Park Park Drive Q-Learning by Example: Driving in Canberra

Formulating the Problem • States s1 Park clean s2 Park dirty s3 Drive clean s4 Drive dirty • Actions a1 Drive a2 Clean a3 Park • Reward 1 for transitions to a rt = ‘clean’ state 0 otherwise • State-Action Table or Q-Table

A Q-Learning Agent st Learning update to πt Action selection from πt at rt Agent Environment

Q-Learning Algorithmic Components • Learning update (to Q-Table): Q(s, a)  (1-α)Q(s, a) + α[r + γQ(s’, a’)] or Q(s, a)  Q(s, a) + α[r + γQ(s’, a’) - Q(s, a)] • Action selection (from Q-Table): a = f(Q(s, a))

Matlab Code Available on Request

Exercise You need to program a small robot to learn to find food. • What assumptions will you make about the robot’s sensors and actuators to represent the environment? • How could you model the problem as an MDP? • Calculate a few learning iterations in your domain by hand.

Alternatives • Function approximation of the Q-table: • Neural networks • Decision trees • Gradient descent methods • Reinforcement learning variants: • Relational reinforcement learning • Hierarchical reinforcement learning • Intrinsically motivated reinforcement learning

A final application…

References and Further Reading • Sutton, R., Barto, A., (2000) Reinforcement Learning: an Introduction, The MIT Press http://www.cs.ualberta.ca/~sutton/book/the-book.html • Kaelbling, L., Littman, M., Moore, A., (1996) Reinforcement Learning: a Survey, Journal of Artificial Intelligence Research, 4:237-285 • Barto, A., Mahadevan, S., (2003) Recent Advances in Hierarchical Reinforcement Learning, Discrete Event Dynamic Systems: Theory and Applications, 13(4):41-77

Introduction to Reinforcement Learning