1 / 19

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning. Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005. Outline. Reinforcement Learning Explicit Explore or Exploit (E 3 ) algorithm Implicit Explore or Exploit (R-Max) algorithm Conclusions.

dudley
Download Presentation

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005

  2. Outline • Reinforcement Learning • Explicit Explore or Exploit (E3) algorithm • Implicit Explore or Exploit (R-Max) algorithm • Conclusions

  3. What is Reinforcement Learning • Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment. • Two strategies for solving reinforcement-learning problems: • Search in the space of behaviors to find the best performance; • Use statistical techniques and dynamic programming methods to estimate the utility of taking actions in states of the world.

  4. Reinforcement Learning Model Formally, the model consists of • a discrete set of environment states, S; • a discrete set of agent actions, A; • A set of scalar reinforcement signals; typically {0,1}, or the real numbers. Figure 1: The standard reinforcement-learning model

  5. Example Dialogue • The environment is non-deterministic but stationary.

  6. Some Measurements • Models of optimal behavior: • Finite-horizon ; • Infinite-horizon ; • Average-reward . • Learning performance Convergence rate and Speed of convergence

  7. Exploitation versus Exploration • One major difference between reinforcement learning and supervised learning is that a reinforcement-learner must explicitly explore its environment. • A simplest traditional reinforcement-learning problem: K-armed bandit problem – K gambling machines, h pulls. How do you decide which machine to pull?

  8. Markov Decision Process Model The MDP is defined by the tuple < S, A, T, R > • S is a finite set of states of the world; • A is a finite set of actions; • T: SA  (S) is the state-transition function, the probability of an action changing the the world state from one to another,T(s, a, s’); • R: SA   is the reward for the agent in a given world state after performing an action, R(s, a). The agent does not know the parameters of this process.

  9. Near-Optimal learning in Polynomial Time • We call the value of the lower bound on T given above the –horizon time for the discounted MDP M.

  10. Proof of the Lemma • The lower bound follows from the definitions, since all expected payoffs are nonnegative. • For the upper bound, fix any infinite path p, and let Ri be the expected payoffs along this path

  11. The Explicit Explore or Exploit (E3) Algorithm • Model-based – Maintain a model for the transition probabilities and the expected payoffs for some subset of the states of the unknown MDP M. • Balanced wandering – Take an arbitrary action from “unknown state”; enough visits to one state makes the state become a “known state”. • Known-state MDP Ms– Induced on the set of currently known states S; all of the unknown states are represented by a single additional, absorbing state s0.

  12. Initialization – The set S of known states is empty; • Balanced wandering – Any time the current state is not in S, the algorithm performs balanced wandering; • Discovery of New known states – Any time a state i has been visited mknown times, it enters the known set S. • Off-line optimization – Upon reaching a known state i in S, the algorithm performs the two off-line optimal policy computations on Ms and Ms’ • Attempted Exploitation: If the resulting exploitation policy achieves return from i in Ms that is at least , the algorithm executes for the next T steps. • Attempted Exploration: Otherwise, the algorithm executes the exploration policy derived from Ms’ to do T steps exploration.

  13. Explore or Exploit Lemma

  14. R-Max – the implicit explore or exploit algorithm • In the spirit of E3 algorithm, a general polynomial time algorithm for near-optimal reinforcement learning. • The agent does not know its behavior is exploitation or exploration. However, it knows that it will either optimize or learn efficiently. • R-max is described in the context of stochastic game (SG), which also considers the actions of the adversary. (Maybe useful for moving target problem?)

  15. SG and MDP • An MDP is an SG in which the adversary has a single action at each state.

  16. Initialization – Construct model M’ consisting of N+1 stage-games, {G0,G1,…,GN}. G0 is an additional fictitious game. Initialize all game matrices to have (Rmax,0) in all entries. Initialize PM(Gi,G0,a,a’)=1 for all I and all actions a,a’. • Compute and Act – Compute an optimal T-step policy for the current state, and execute it for T-steps or until a new entry becomes known. • Observe and update • Update the reward for (a,a’) in the state Gi • Update the set of states reached by playing (a,a’) in Gi • If the record of states reached from this entry contains elements mark this entry as KNOWN,and update the transition matrix for this entry.

  17. Conclusion • The author described R-Max, a simple RL algorithm that leads to polynomial time convergence to near-optimal reward. • R-Max is an optimistic model-based algorithm in the spirit of E3 algorithm. However, unlike E3, R-Max makes implicit trade-off between exploration and exploitation.

  18. Related to our work • This paper focus on the proof of algorithm existence and discussion of optimality and convergence while the detailed MDP solution is not addressed. We may utilize our POMDP solver in this framework to make some extension. • This algorithm does not require random walk for learning environment in advance. This may be interesting for our robot navigation problem.

  19. Reference • R.Brafman and M.Tennenholtz, “R-MAX – A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning”, Journal of Machine Learning Research 2002 • M.Kearns and S.Singh, “Near-optimal reinforcement learning in polynomial time”, ICML 1998 • L.P.Kaelbling, M.L.Littleman and A.W.Moore, “Reinforcement learning: A survey.” Journal of AI Research 1996

More Related