Loading in 5 sec....

Hierarchical Reinforcement LearningPowerPoint Presentation

Hierarchical Reinforcement Learning

- By
**uri** - Follow User

- 149 Views
- Updated On :

Download Presentation
## PowerPoint Slideshow about '' - uri

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

The Outline of the Talk

The Outline of the Talk

The Outline of the Talk

- MDPs and Bellman’s curse of dimensionality.
- RL: Simultaneous learning and planning.
- Explore avenues to speed up RL.
- Illustrate prominent HRL methods.
- Compare prominent HRL methods.
- Discuss future research.
- Summarise

Personal Printerbot

- States (S) :{loc,has-robot-printout, user-loc,has-user-printout},map
- Actions (A) :{moven,moves,movee,movew, extend-arm,grab-page,release-pages}
- Reward (R) : if h-u-po +20 else -1
- Goal (G) : All states with h-u-po true.
- Start state: A state with h-u-po false.

Episodic Markov Decision Process

Episodic MDP ´ MDP with absorbing goals

- hS, A, P, R, G, s0i
- S : Set of environment states.
- A: Set of available actions.
- P: Probability Transition model. P(s’|s,a)*
- R: Reward model. R(s)*
- G: Absorbing goal states.
- s0 : Start state.
- : Discount factor**.

* Markovian

assumption.

** bounds R for

infinite horizon.

Goal of an Episodic MDP

Find a policy (S!A), which:

- maximises expected discounted reward for a
- a fully observable* Episodic MDP.
- if agent is allowed to execute for an indefinite horizon.

* Non-noisy

complete

information

perceptors

Solution of an Episodic MDP

- Define V*(s) : Optimal reward starting in state s.
- Value Iteration : Start with an estimate of V*(s) and successively re-estimate it to converge to a fixed point.

Complexity of Value Iteration

- Each iteration – polynomial in |S|
- Number of iterations – polynomial in |S|
- Overall – polynomial in |S|
- Polynomial in |S| -
|S| : exponential in number of

features in the domain*.

* Bellman’s

curse of

dimensionality

The Outline of the Talk

- MDPs and Bellman’s curse of dimensionality.
- RL: Simultaneous learning and planning.
- Explore avenues to speed up RL.
- Illustrate prominent HRL methods.
- Compare prominent HRL methods.
- Discuss future research.
- Summarise

- Gain knowledge
- Gain understanding
- Gain skills
- Modification of behavioural tendency

Environment

Data

- Gain knowledge
- Gain understanding
- Gain skills
- Modification of behavioural tendency

What action next?

Decision Making while Learning*Environment

Percepts

Datum

Action

* Known as

Reinforcement

Learning

Reinforcement Learning

- Unknown Pand reward R.
- Learning Component : Estimate the Pand R values via data observed from the environment.
- Planning Component : Decide which actions to take that will maximise reward.
- Exploration vs. Exploitation
- GLIE (Greedy in Limit with
Infinite Exploration)

- GLIE (Greedy in Limit with

Learning

- Model-based learning
- Learn the model, and do planning
- Requires less data, more computation

- Model-free learning
- Plan without learning an explicit model
- Requires a lot of data, less computation

Q-Learning

- Instead of learning, P and R, learn Q* directly.
- Q*(s,a) : Optimal reward starting in s, if the first action is a, and after that the optimal policy is followed.
- Q* directly defines the optimal policy:

Optimal policy is the action with maximum Q* value.

Q-Learning

- Given an experience tuple hs,a,s’,ri
- Under suitable assumptions, and GLIE exploration Q-Learning converges to optimal.

New estimate of Q value

Old estimate of Q value

Semi-MDP: When actions take time.

- The Semi-MDP equation:
- Semi-MDP Q-Learning equation:
where experience tuple is hs,a,s’,r,Ni

r = accumulated discounted reward while action a was executing.

Printerbot

- Paul G. Allen Center has 85000 sq ft space
- Each floor ~ 85000/7 ~ 12000 sq ft
- Discretise location on a floor: 12000 parts.
- State Space (without map) : 2*2*12000*12000 --- very large!!!!!
- How do humans do the decision making?

The Outline of the Talk

- MDPs and Bellman’s curse of dimensionality.
- RL: Simultaneous learning and planning.
- Explore avenues to speedup RL.
- Illustrate prominent HRL methods.
- Compare prominent HRL methods.
- Discuss future research.
- Summarise

1. The Mathematical PerspectiveA Structure Paradigm

- S: Relational MDP
- A: Concurrent MDP
- P: Dynamic Bayes Nets
- R: Continuous-state MDP
- G: Conjunction of state variables
- V: Algebraic Decision Diagrams
- : Decision List (RMDP)

2. Modular Decision Making

- Go out of room
- Walk in hallway
- Go in the room

2. Modular Decision Making

- Humans plan modularly at different granularities of understanding.
- Going out of one room is similar to going out of another room.
- Navigation steps do not depend on whether we have the print out or not.

3. Background Knowledge

- Classical Planners using additional control knowledge can scale up to larger problems.
- (E.g. : HTN planning, TLPlan)
- What forms of control knowledge can we provide to our Printerbot?
- First pick printouts, then deliver them.
- Navigation – consider rooms, hallway, separately, etc.

A mechanism that exploits all three avenues : Hierarchies

- Way to add a special (hierarchical) structure on different parameters of an MDP.
- Draws from the intuition and reasoning in human decision making.
- Way to provide additional control knowledge to the system.

The Outline of the Talk

- MDPs and Bellman’s curse of dimensionality.
- RL: Simultaneous learning and planning.
- Explore avenues to speedup RL.
- Illustrate prominent HRL methods.
- Compare prominent HRL methods.
- Discuss future research.
- Summarise

Hierarchy

- Hierarchy of : Behaviour, Skill, Module, SubTask, Macro-action, etc.
- picking the pages
- collision avoidance
- fetch pages phase
- walk in hallway

- HRL ´ RL with temporally extended actions

Hierarchical Algos ´ Gating Mechanism

- Hierarchical Learning
- Learning the gating function
- Learning the individual behaviours
- Learning both

*

g is a gate

bi is a behaviour

*Can be a multi-

level hierarchy.

Option : Movee until end of hallway

- Start : Any state in the hallway.
- Execute : policy as shown.
- Terminate : when s is end of hallway.

Options [Sutton, Precup, Singh’99]

- An option is a well defined behaviour.
- o = hIo, o, oi
- Io:Set of states (IoµS) in which o can be initiated.
- o(s): Policy (S!A*) when o is executing.
- o(s) : Probability that o terminates in s.

*Can be a policy

over lower level

options.

Learning

- An option is temporally extended action with well defined policy.
- Set of options (O) replaces the set of actions (A)
- Learning occurs outside options.
- Learning over options ´ Semi MDP Q-Learning.

Movew

Moven

Moven

Return

Movew

Moves

Moves

Return

Machine: Movee + Collision Avoidance: End of hallway

Call M1

Movee

Choose

Obstacle

Call M2

End of hallway

Return

M1

M2

Hierarchies of Abstract Machines[Parr, Russell’97]

- A machine is a partial policy represented by a Finite State Automaton.
- Node :
- Execute a ground action.
- Call a machine as a subroutine.
- Choose the next node.
- Return to the calling machine.

Hierarchies of Abstract Machines

- A machine is a partial policy represented by a Finite State Automaton.
- Node :
- Execute a ground action.
- Call a machine as subroutine.
- Choose the next node.
- Return to the calling machine.

Learning

- Learning occurs within machines, as machines are only partially defined.
- Flatten all machines out and consider states [s,m] where s is a world state, and m, a machine node ´MDP
- reduce(SoM) : Consider only states where machine node is a choice node ´Semi-MDP.
- Learning ¼ Semi-MDP Q-Learning

Task Hierarchy: MAXQ Decomposition[Dietterich’00]

Root

Children of a task are unordered

Fetch

Deliver

Take

Give

Navigate(loc)

Extend-arm

Grab

Release

Extend-arm

Moven

Moves

Movew

Movee

MAXQ Decomposition

- Augment the state s by adding the subtask i : [s,i].
- Define C([s,i],j) as the reward received in i after j finishes.
- Q([s,Fetch],Navigate(prr)) = V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr))*
- Express V in terms of C
- Learn C, instead of learning Q

Reward received while navigating

Reward received after navigation

*Observe the

context-free

nature of

Q-value

The Outline of the Talk

- MDPs and Bellman’s curse of dimensionality.
- RL: Simultaneous learning and planning.
- Explore avenues to speedup RL.
- Illustrate prominent HRL methods.
- Compare prominent HRL methods.
- Discuss future research.
- Summarise

1. State Abstraction

- Abstract state : A state having fewer state variables; different world states maps to the same abstract state.
- If we can reduce some state variables, then we can reduce on the learning time considerably!
- We may use different abstract states for different macro-actions.

State Abstraction in MAXQ

- Relevance : Only some variables are relevant for the task.
- Fetch : user-loc irrelevant
- Navigate(printer-room) : h-r-po,h-u-po,user-loc
- Fewer params for V of lower levels.

- Funnelling : Subtask maps many states to smaller set of states.
- Fetch : All states map to h-r-po=true, loc=pr.room.
- Fewer params for C of higher levels.

State Abstraction in Options, HAM

- Options : Learning required only in states that are terminal states for some option.
- HAM : Original work has no abstraction.
- Extension: Three-way value decomposition*:
Q([s,m],n) = V([s,n]) +C([s,m],n) + Cex([s,m])

- Similar abstractions are employed.

- Extension: Three-way value decomposition*:

*[Andre,Russell’02]

Optimality

- Options : Hierarchical
- Use (A[O) : Global**
- Interrupt options

- HAM : Hierarchical*
- MAXQ : Recursive*
- Interrupt subtasks
- Use Pseudo-rewards
- Iterate!

* Can define

eqns for both

optimalities

**Adv. of using

macro-actions

maybe lost.

3. Language Expressiveness

- Option
- Can only input a complete policy

- HAM
- Can input a complete policy.
- Can input a task hierarchy.
- Can represent “amount of effort”.
- Later extended to partial programs.

- MAXQ
- Cannot input a policy (full/partial)

4. Knowledge Requirements

- Options
- Requires complete specification of policy.
- One could learn option policies – given subtasks.

- HAM
- Medium requirements

- MAXQ
- Minimal requirements

5. Models advanced

- Options : Concurrency
- HAM : Richer representation, Concurrency
- MAXQ : Continuous time, state, actions; Multi-agents, Average-reward.
- In general, more researchers have followed MAXQ
- Less input knowledge
- Value decomposition

6. Structure Paradigm

- S: Options, MAXQ
- A: All
- P: None
- R: MAXQ
- G: All
- V: MAXQ
- : All

- MDPs and Bellman’s curse of dimensionality.
- RL: Simultaneous learning and planning.
- Explore avenues to speedup RL.
- Illustrate prominent HRL methods.
- Compare prominent HRL methods.
- Discuss future research.
- Summarise

Directions for Future Research

- Bidirectional State Abstractions
- Hierarchies over other RL research
- Model based methods
- Function Approximators

- Probabilistic Planning
- Hierarchical P and Hierarchical R

- Imitation Learning

Directions for Future Research

- Theory
- Bounds (goodness of hierarchy)
- Non-asymptotic analysis

- Automated Discovery
- Discovery of Hierarchies
- Discovery of State Abstraction

- Apply…

P1

D2

D1

Parts

Ware-house

Assemblies

D3

D4

P3

P4

Applications- Toy Robot
- Flight Simulator
- AGV Scheduling
- Keepaway soccer

Images courtesy various sources

Thinking Big…

"... consider maze domains. Reinforcement learning researchers, including this author, have spent countless years of research solving a solved problem! Navigating in grid worlds, even with stochastic dynamics, has been far from rocket science since the advent of search techniques such as A*.” -- David Andre

- Use planners, theorem provers, etc. as components in big hierarchical solver.

- MDPs and Bellman’s curse of dimensionality.
- RL: Simultaneous learning and planning.
- Explore avenues to speedup RL.
- Illustrate prominent HRL methods.
- Compare prominent HRL methods.
- Discuss future research.
- Summarise

How to choose appropriate hierarchy

- Look at available domain knowledge
- If some behaviours are completely specified – options
- If some behaviours are partially specified – HAM
- If less domain knowledge available – MAXQ

- We can use all three to specify different behaviours in tandem.

Main ideas in HRL community

- Hierarchies speedup learning
- Value function decomposition
- State Abstractions
- Greedy non-hierarchical execution
- Context-free learning and pseudo-rewards
- Policy improvement by re-estimation and re-learning.

Download Presentation

Connecting to Server..