General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005

Outline • Reinforcement Learning • Explicit Explore or Exploit (E3) algorithm • Implicit Explore or Exploit (R-Max) algorithm • Conclusions

What is Reinforcement Learning • Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment. • Two strategies for solving reinforcement-learning problems: • Search in the space of behaviors to find the best performance; • Use statistical techniques and dynamic programming methods to estimate the utility of taking actions in states of the world.

Reinforcement Learning Model Formally, the model consists of • a discrete set of environment states, S; • a discrete set of agent actions, A; • A set of scalar reinforcement signals; typically {0,1}, or the real numbers. Figure 1: The standard reinforcement-learning model

Example Dialogue • The environment is non-deterministic but stationary.

Some Measurements • Models of optimal behavior: • Finite-horizon ; • Infinite-horizon ; • Average-reward . • Learning performance Convergence rate and Speed of convergence

Exploitation versus Exploration • One major difference between reinforcement learning and supervised learning is that a reinforcement-learner must explicitly explore its environment. • A simplest traditional reinforcement-learning problem: K-armed bandit problem – K gambling machines, h pulls. How do you decide which machine to pull?

Markov Decision Process Model The MDP is defined by the tuple < S, A, T, R > • S is a finite set of states of the world; • A is a finite set of actions; • T: SA  (S) is the state-transition function, the probability of an action changing the the world state from one to another,T(s, a, s’); • R: SA   is the reward for the agent in a given world state after performing an action, R(s, a). The agent does not know the parameters of this process.

Near-Optimal learning in Polynomial Time • We call the value of the lower bound on T given above the –horizon time for the discounted MDP M.

Proof of the Lemma • The lower bound follows from the definitions, since all expected payoffs are nonnegative. • For the upper bound, fix any infinite path p, and let Ri be the expected payoffs along this path

The Explicit Explore or Exploit (E3) Algorithm • Model-based – Maintain a model for the transition probabilities and the expected payoffs for some subset of the states of the unknown MDP M. • Balanced wandering – Take an arbitrary action from “unknown state”; enough visits to one state makes the state become a “known state”. • Known-state MDP Ms– Induced on the set of currently known states S; all of the unknown states are represented by a single additional, absorbing state s0.

Initialization – The set S of known states is empty; • Balanced wandering – Any time the current state is not in S, the algorithm performs balanced wandering; • Discovery of New known states – Any time a state i has been visited mknown times, it enters the known set S. • Off-line optimization – Upon reaching a known state i in S, the algorithm performs the two off-line optimal policy computations on Ms and Ms’ • Attempted Exploitation: If the resulting exploitation policy achieves return from i in Ms that is at least , the algorithm executes for the next T steps. • Attempted Exploration: Otherwise, the algorithm executes the exploration policy derived from Ms’ to do T steps exploration.

Explore or Exploit Lemma

R-Max – the implicit explore or exploit algorithm • In the spirit of E3 algorithm, a general polynomial time algorithm for near-optimal reinforcement learning. • The agent does not know its behavior is exploitation or exploration. However, it knows that it will either optimize or learn efficiently. • R-max is described in the context of stochastic game (SG), which also considers the actions of the adversary. (Maybe useful for moving target problem?)

SG and MDP • An MDP is an SG in which the adversary has a single action at each state.

Initialization – Construct model M’ consisting of N+1 stage-games, {G0,G1,…,GN}. G0 is an additional fictitious game. Initialize all game matrices to have (Rmax,0) in all entries. Initialize PM(Gi,G0,a,a’)=1 for all I and all actions a,a’. • Compute and Act – Compute an optimal T-step policy for the current state, and execute it for T-steps or until a new entry becomes known. • Observe and update • Update the reward for (a,a’) in the state Gi • Update the set of states reached by playing (a,a’) in Gi • If the record of states reached from this entry contains elements mark this entry as KNOWN,and update the transition matrix for this entry.

Conclusion • The author described R-Max, a simple RL algorithm that leads to polynomial time convergence to near-optimal reward. • R-Max is an optimistic model-based algorithm in the spirit of E3 algorithm. However, unlike E3, R-Max makes implicit trade-off between exploration and exploitation.

Related to our work • This paper focus on the proof of algorithm existence and discussion of optimality and convergence while the detailed MDP solution is not addressed. We may utilize our POMDP solver in this framework to make some extension. • This algorithm does not require random walk for learning environment in advance. This may be interesting for our robot navigation problem.

Reference • R.Brafman and M.Tennenholtz, “R-MAX – A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning”, Journal of Machine Learning Research 2002 • M.Kearns and S.Singh, “Near-optimal reinforcement learning in polynomial time”, ICML 1998 • L.P.Kaelbling, M.L.Littleman and A.W.Moore, “Reinforcement learning: A survey.” Journal of AI Research 1996

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning

Presentation Transcript

A Randomized Polynomial-Time Simplex Algorithm for Linear Programming

A Polynomial-Time Algorithm for Global Value Numbering

Reinforcement Learning

Autonomous Motion Learning for Near Optimal Control

Reinforcement Learning

Reinforcement Learning

A Polynomial-Time Cutting-Plane Algorithm for Matchings

Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Optimal Oblivious Routing in Polynomial Time

LEAP Algorithm Reinforcement Learning with Adaptive Partitioning

Evaluating a Reinforcement Learning Algorithm with a General Intelligence Test

The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

R-Max: A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning

A Near-Optimal Planarization Algorithm

A Polynomial-Time Algorithm for Global Value Numbering

A Randomized Polynomial-Time Simplex Algorithm for Linear Programming

Flight Time Allocation Using Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

A polynomial time algorithm for constructing k-maintainable policies