An overview of maxq hierarchical reinforcement learning
Download
1 / 18

An Overview of MAXQ Hierarchical Reinforcement Learning - PowerPoint PPT Presentation


  • 102 Views
  • Uploaded on

An Overview of MAXQ Hierarchical Reinforcement Learning. Thomas G. Dietterich from Oregon State Univ. Presenter: ZhiWei. Motivation. The traditional reinforcement learning algorithms treat the state space of the Markov Decision Process as a single “flat” search space.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'An Overview of MAXQ Hierarchical Reinforcement Learning' - egan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
An overview of maxq hierarchical reinforcement learning

An Overview of MAXQ Hierarchical Reinforcement Learning

Thomas G. Dietterich from Oregon State Univ.

Presenter: ZhiWei


Motivation
Motivation

  • The traditional reinforcement learning algorithms treat the state space of the Markov Decision Process as a single “flat” search space.

  • Drawback of this approach: not scale to tasks that have a complex, hierarchical structure, e.g., robot soccer, air traffic control.

  • To overcome this problem, i.e. to make reinforcement learning hierarchical, need to introduce mechanisms for abstraction and sharing

This paper describes an initial effort in this direction



A learning example cont d
A learning example (cont’d)

Task: the taxi is in a randomly-chosen cell and the passenger is at one of the four special locations (R, G, B, Y). The passenger has a desired destination and the job of the taxi is to go to the passenger, pick him/her up, go to the passenger’s destination, and drop him/her off.

Six available primitive actions:

North, South, East, West, Pickup and Putdown

Reward: each action receives -1; when the passenger is putdown at the destination, receive +20; when the taxi attempts to pickup a non-existent passenger or putdown the passenger at a wrong place, receive -10; running into walls has no effect but entails the usual reward of -1.


Q learning algorithm
Q-learning algorithm

  • For any MDP, there exist one or more optimal policies. All these policies share the same optimal value function, which satisfies the Bellman equation:

  • Q function:


Q learning algorithm cont d
Q-learning algorithm (cont’d)

  • Value function example:



Hierarchical q learning
Hierarchical Q-learning

  • Action a is generally simple, e.g., those available primitive actions (Normal Q- learning)

  • Could action a be also complex, e.g., a subroutine that takes many primitive actions and then exits?

  • Yes! The learning algorithm still works. (Hierarchical Q-learning)


Hierarchical q learning cont d
Hierarchical Q-learning (cont’d)

  • Assumption: some hierarchical structure is given.



Maxq alg value fun decomposition
MAXQ Alg. (Value Fun. Decomposition)

  • Want to obtain some sharing (compactness) in the representation of the value function.

  • Re-write Q(p, s, a) as

where V(a, s) is the expected total reward while executing action a, and C(p, s, a) is the expected reward of completing parent task p after a has returned





State abstraction
State Abstraction

Three fundamental forms

  • Irrelevant variables

    e.g. passenger location is irrelevant for the navigate and put subtasks and it thus could be ignored.

  • Funnel abstraction

    A funnel action is an action that causes a larger number of initial states to be mapped into a small number of resulting states. E.g., the navigate(t) action maps any state into a state where the taxi is at location t. This means the completion cost is independent of the location of the taxi—it is the same for all initial locations of the taxi.


State abstraction cont d
State Abstraction (cont’d)

  • Structure constraints

    - E.g. if a task is terminated in a state s, then there is no need to represent its completion cost in that state

    - Also, in some states, the termination predicate of the child task implies the termination predicate of the parent task

    Effect

    - reduce the amount memory to represent the Q-function.

    14,000 q values required for flat Q-learning

    3,000 for HSMQ (with the irrelevant-variable abstraction

    632 for C() and V() in MAXQ

    - learning faster



Limitations
Limitations

  • Recursively optimal not necessarily optimal

  • Model-free Q-learning

    Model-based algorithms (that is, algorithms that try to learn P(s’|s,a) and R(s’|s,a)) are generally much more efficient because they remember past experience rather than having to re-experience it.