Loading in 5 sec....

An Overview of MAXQ Hierarchical Reinforcement LearningPowerPoint Presentation

An Overview of MAXQ Hierarchical Reinforcement Learning

- By
**egan** - Follow User

- 101 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' An Overview of MAXQ Hierarchical Reinforcement Learning' - egan

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### An Overview of MAXQ Hierarchical Reinforcement Learning

Thomas G. Dietterich from Oregon State Univ.

Presenter: ZhiWei

Motivation

- The traditional reinforcement learning algorithms treat the state space of the Markov Decision Process as a single “flat” search space.
- Drawback of this approach: not scale to tasks that have a complex, hierarchical structure, e.g., robot soccer, air traffic control.
- To overcome this problem, i.e. to make reinforcement learning hierarchical, need to introduce mechanisms for abstraction and sharing

This paper describes an initial effort in this direction

A learning example (cont’d)

Task: the taxi is in a randomly-chosen cell and the passenger is at one of the four special locations (R, G, B, Y). The passenger has a desired destination and the job of the taxi is to go to the passenger, pick him/her up, go to the passenger’s destination, and drop him/her off.

Six available primitive actions:

North, South, East, West, Pickup and Putdown

Reward: each action receives -1; when the passenger is putdown at the destination, receive +20; when the taxi attempts to pickup a non-existent passenger or putdown the passenger at a wrong place, receive -10; running into walls has no effect but entails the usual reward of -1.

Q-learning algorithm

- For any MDP, there exist one or more optimal policies. All these policies share the same optimal value function, which satisfies the Bellman equation:

- Q function:

Q-learning algorithm (cont’d)

- Value function example:

Q-learning algorithm (cont’d)

- Learning Process:

Hierarchical Q-learning

- Action a is generally simple, e.g., those available primitive actions (Normal Q- learning)
- Could action a be also complex, e.g., a subroutine that takes many primitive actions and then exits?
- Yes! The learning algorithm still works. (Hierarchical Q-learning)

Hierarchical Q-learning (cont’d)

- Assumption: some hierarchical structure is given.

MAXQ Alg. (Value Fun. Decomposition)

- Want to obtain some sharing (compactness) in the representation of the value function.
- Re-write Q(p, s, a) as

where V(a, s) is the expected total reward while executing action a, and C(p, s, a) is the expected reward of completing parent task p after a has returned

MAXQ Alg. (cont’d)

- An example

State Abstraction

Three fundamental forms

- Irrelevant variables
e.g. passenger location is irrelevant for the navigate and put subtasks and it thus could be ignored.

- Funnel abstraction
A funnel action is an action that causes a larger number of initial states to be mapped into a small number of resulting states. E.g., the navigate(t) action maps any state into a state where the taxi is at location t. This means the completion cost is independent of the location of the taxi—it is the same for all initial locations of the taxi.

State Abstraction (cont’d)

- Structure constraints
- E.g. if a task is terminated in a state s, then there is no need to represent its completion cost in that state

- Also, in some states, the termination predicate of the child task implies the termination predicate of the parent task

Effect

- reduce the amount memory to represent the Q-function.

14,000 q values required for flat Q-learning

3,000 for HSMQ (with the irrelevant-variable abstraction

632 for C() and V() in MAXQ

- learning faster

Limitations

- Recursively optimal not necessarily optimal
- Model-free Q-learning
Model-based algorithms (that is, algorithms that try to learn P(s’|s,a) and R(s’|s,a)) are generally much more efficient because they remember past experience rather than having to re-experience it.

Download Presentation

Connecting to Server..