Hierarchical reinforcement learning
Download
1 / 54

Hierarchical Reinforcement Learning - PowerPoint PPT Presentation


  • 149 Views
  • Updated On :

Hierarchical Reinforcement Learning. Mausam. [A Survey and Comparison of HRL techniques]. The Outline of the Talk. MDPs and Bellman’s curse of dimensionality. RL: Simultaneous learning and planning. Explore avenues to speed up RL. Illustrate prominent HRL methods.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Hierarchical Reinforcement Learning' - uri


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Hierarchical reinforcement learning l.jpg

Hierarchical Reinforcement Learning

Mausam

[A Survey and Comparison of HRL techniques]


The outline of the talk l.jpg
The Outline of the Talk

  • MDPs and Bellman’s curse of dimensionality.

  • RL: Simultaneous learning and planning.

  • Explore avenues to speed up RL.

  • Illustrate prominent HRL methods.

  • Compare prominent HRL methods.

  • Discuss future research.

  • Summarise


Decision making l.jpg

Environment

What action next?

Percept

Action

Decision Making

Slide courtesy Dan Weld


Personal printerbot l.jpg
Personal Printerbot

  • States (S) :{loc,has-robot-printout, user-loc,has-user-printout},map

  • Actions (A) :{moven,moves,movee,movew, extend-arm,grab-page,release-pages}

  • Reward (R) : if h-u-po +20 else -1

  • Goal (G) : All states with h-u-po true.

  • Start state: A state with h-u-po false.


Episodic markov decision process l.jpg
Episodic Markov Decision Process

Episodic MDP ´ MDP with absorbing goals

  • hS, A, P, R, G, s0i

  • S : Set of environment states.

  • A: Set of available actions.

  • P: Probability Transition model. P(s’|s,a)*

  • R: Reward model. R(s)*

  • G: Absorbing goal states.

  • s0 : Start state.

  • : Discount factor**.

* Markovian

assumption.

** bounds R for

infinite horizon.


Goal of an episodic mdp l.jpg
Goal of an Episodic MDP

Find a policy (S!A), which:

  • maximises expected discounted reward for a

  • a fully observable* Episodic MDP.

  • if agent is allowed to execute for an indefinite horizon.

* Non-noisy

complete

information

perceptors


Solution of an episodic mdp l.jpg
Solution of an Episodic MDP

  • Define V*(s) : Optimal reward starting in state s.

  • Value Iteration : Start with an estimate of V*(s) and successively re-estimate it to converge to a fixed point.


Complexity of value iteration l.jpg
Complexity of Value Iteration

  • Each iteration – polynomial in |S|

  • Number of iterations – polynomial in |S|

  • Overall – polynomial in |S|

  • Polynomial in |S| - 

    |S| : exponential in number of

    features in the domain*.

* Bellman’s

curse of

dimensionality


The outline of the talk9 l.jpg
The Outline of the Talk

  • MDPs and Bellman’s curse of dimensionality.

  • RL: Simultaneous learning and planning.

  • Explore avenues to speed up RL.

  • Illustrate prominent HRL methods.

  • Compare prominent HRL methods.

  • Discuss future research.

  • Summarise


Learning l.jpg

  • Gain knowledge

  • Gain understanding

  • Gain skills

  • Modification of behavioural tendency

Learning

Environment

Data


Decision making while learning l.jpg

  • Gain knowledge

  • Gain understanding

  • Gain skills

  • Modification of behavioural tendency

What action next?

Decision Making while Learning*

Environment

Percepts

Datum

Action

* Known as

Reinforcement

Learning


Reinforcement learning l.jpg
Reinforcement Learning

  • Unknown Pand reward R.

  • Learning Component : Estimate the Pand R values via data observed from the environment.

  • Planning Component : Decide which actions to take that will maximise reward.

  • Exploration vs. Exploitation

    • GLIE (Greedy in Limit with

      Infinite Exploration)


Learning13 l.jpg
Learning

  • Model-based learning

    • Learn the model, and do planning

    • Requires less data, more computation

  • Model-free learning

    • Plan without learning an explicit model

    • Requires a lot of data, less computation


Q learning l.jpg
Q-Learning

  • Instead of learning, P and R, learn Q* directly.

  • Q*(s,a) : Optimal reward starting in s, if the first action is a, and after that the optimal policy is followed.

  • Q* directly defines the optimal policy:

Optimal policy is the action with maximum Q* value.


Q learning15 l.jpg
Q-Learning

  • Given an experience tuple hs,a,s’,ri

  • Under suitable assumptions, and GLIE exploration Q-Learning converges to optimal.

New estimate of Q value

Old estimate of Q value


Semi mdp when actions take time l.jpg
Semi-MDP: When actions take time.

  • The Semi-MDP equation:

  • Semi-MDP Q-Learning equation:

    where experience tuple is hs,a,s’,r,Ni

    r = accumulated discounted reward while action a was executing.


Printerbot l.jpg
Printerbot

  • Paul G. Allen Center has 85000 sq ft space

  • Each floor ~ 85000/7 ~ 12000 sq ft

  • Discretise location on a floor: 12000 parts.

  • State Space (without map) : 2*2*12000*12000 --- very large!!!!!

  • How do humans do the decision making?


The outline of the talk18 l.jpg
The Outline of the Talk

  • MDPs and Bellman’s curse of dimensionality.

  • RL: Simultaneous learning and planning.

  • Explore avenues to speedup RL.

  • Illustrate prominent HRL methods.

  • Compare prominent HRL methods.

  • Discuss future research.

  • Summarise


1 the mathematical perspective a structure paradigm l.jpg
1. The Mathematical PerspectiveA Structure Paradigm

  • S: Relational MDP

  • A: Concurrent MDP

  • P: Dynamic Bayes Nets

  • R: Continuous-state MDP

  • G: Conjunction of state variables

  • V: Algebraic Decision Diagrams

  • : Decision List (RMDP)



2 modular decision making21 l.jpg
2. Modular Decision Making

  • Go out of room

  • Walk in hallway

  • Go in the room


2 modular decision making22 l.jpg
2. Modular Decision Making

  • Humans plan modularly at different granularities of understanding.

  • Going out of one room is similar to going out of another room.

  • Navigation steps do not depend on whether we have the print out or not.


3 background knowledge l.jpg
3. Background Knowledge

  • Classical Planners using additional control knowledge can scale up to larger problems.

  • (E.g. : HTN planning, TLPlan)

  • What forms of control knowledge can we provide to our Printerbot?

    • First pick printouts, then deliver them.

    • Navigation – consider rooms, hallway, separately, etc.


A mechanism that exploits all three avenues hierarchies l.jpg
A mechanism that exploits all three avenues : Hierarchies

  • Way to add a special (hierarchical) structure on different parameters of an MDP.

  • Draws from the intuition and reasoning in human decision making.

  • Way to provide additional control knowledge to the system.


The outline of the talk25 l.jpg
The Outline of the Talk

  • MDPs and Bellman’s curse of dimensionality.

  • RL: Simultaneous learning and planning.

  • Explore avenues to speedup RL.

  • Illustrate prominent HRL methods.

  • Compare prominent HRL methods.

  • Discuss future research.

  • Summarise


Hierarchy l.jpg
Hierarchy

  • Hierarchy of : Behaviour, Skill, Module, SubTask, Macro-action, etc.

    • picking the pages

    • collision avoidance

    • fetch pages phase

    • walk in hallway

  • HRL ´ RL with temporally extended actions


Hierarchical algos gating mechanism l.jpg
Hierarchical Algos ´ Gating Mechanism

  • Hierarchical Learning

    • Learning the gating function

    • Learning the individual behaviours

    • Learning both

*

g is a gate

bi is a behaviour

*Can be a multi-

level hierarchy.


Option move e until end of hallway l.jpg
Option : Movee until end of hallway

  • Start : Any state in the hallway.

  • Execute : policy as shown.

  • Terminate : when s is end of hallway.


Options sutton precup singh 99 l.jpg
Options [Sutton, Precup, Singh’99]

  • An option is a well defined behaviour.

  • o = hIo, o, oi

  • Io:Set of states (IoµS) in which o can be initiated.

  • o(s): Policy (S!A*) when o is executing.

  • o(s) : Probability that o terminates in s.

*Can be a policy

over lower level

options.


Learning30 l.jpg
Learning

  • An option is temporally extended action with well defined policy.

  • Set of options (O) replaces the set of actions (A)

  • Learning occurs outside options.

  • Learning over options ´ Semi MDP Q-Learning.


Machine move e collision avoidance l.jpg

Movew

Moven

Moven

Return

Movew

Moves

Moves

Return

Machine: Movee + Collision Avoidance

: End of hallway

Call M1

Movee

Choose

Obstacle

Call M2

End of hallway

Return

M1

M2


Hierarchies of abstract machines parr russell 97 l.jpg
Hierarchies of Abstract Machines[Parr, Russell’97]

  • A machine is a partial policy represented by a Finite State Automaton.

  • Node :

    • Execute a ground action.

    • Call a machine as a subroutine.

    • Choose the next node.

    • Return to the calling machine.


Hierarchies of abstract machines l.jpg
Hierarchies of Abstract Machines

  • A machine is a partial policy represented by a Finite State Automaton.

  • Node :

    • Execute a ground action.

    • Call a machine as subroutine.

    • Choose the next node.

    • Return to the calling machine.


Learning34 l.jpg
Learning

  • Learning occurs within machines, as machines are only partially defined.

  • Flatten all machines out and consider states [s,m] where s is a world state, and m, a machine node ´MDP

  • reduce(SoM) : Consider only states where machine node is a choice node ´Semi-MDP.

  • Learning ¼ Semi-MDP Q-Learning


Task hierarchy maxq decomposition dietterich 00 l.jpg
Task Hierarchy: MAXQ Decomposition[Dietterich’00]

Root

Children of a task are unordered

Fetch

Deliver

Take

Give

Navigate(loc)

Extend-arm

Grab

Release

Extend-arm

Moven

Moves

Movew

Movee


Maxq decomposition l.jpg
MAXQ Decomposition

  • Augment the state s by adding the subtask i : [s,i].

  • Define C([s,i],j) as the reward received in i after j finishes.

  • Q([s,Fetch],Navigate(prr)) = V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr))*

  • Express V in terms of C

  • Learn C, instead of learning Q

Reward received while navigating

Reward received after navigation

*Observe the

context-free

nature of

Q-value


The outline of the talk37 l.jpg
The Outline of the Talk

  • MDPs and Bellman’s curse of dimensionality.

  • RL: Simultaneous learning and planning.

  • Explore avenues to speedup RL.

  • Illustrate prominent HRL methods.

  • Compare prominent HRL methods.

  • Discuss future research.

  • Summarise


1 state abstraction l.jpg
1. State Abstraction

  • Abstract state : A state having fewer state variables; different world states maps to the same abstract state.

  • If we can reduce some state variables, then we can reduce on the learning time considerably!

  • We may use different abstract states for different macro-actions.


State abstraction in maxq l.jpg
State Abstraction in MAXQ

  • Relevance : Only some variables are relevant for the task.

    • Fetch : user-loc irrelevant

    • Navigate(printer-room) : h-r-po,h-u-po,user-loc

    • Fewer params for V of lower levels.

  • Funnelling : Subtask maps many states to smaller set of states.

    • Fetch : All states map to h-r-po=true, loc=pr.room.

    • Fewer params for C of higher levels.


State abstraction in options ham l.jpg
State Abstraction in Options, HAM

  • Options : Learning required only in states that are terminal states for some option.

  • HAM : Original work has no abstraction.

    • Extension: Three-way value decomposition*:

      Q([s,m],n) = V([s,n]) +C([s,m],n) + Cex([s,m])

    • Similar abstractions are employed.

*[Andre,Russell’02]


2 optimality l.jpg
2. Optimality

Hierarchical Optimality

vs.

Recursive Optimality


Optimality l.jpg
Optimality

  • Options : Hierarchical

    • Use (A[O) : Global**

    • Interrupt options

  • HAM : Hierarchical*

  • MAXQ : Recursive*

    • Interrupt subtasks

    • Use Pseudo-rewards

    • Iterate!

* Can define

eqns for both

optimalities

**Adv. of using

macro-actions

maybe lost.


3 language expressiveness l.jpg
3. Language Expressiveness

  • Option

    • Can only input a complete policy

  • HAM

    • Can input a complete policy.

    • Can input a task hierarchy.

    • Can represent “amount of effort”.

    • Later extended to partial programs.

  • MAXQ

    • Cannot input a policy (full/partial)


4 knowledge requirements l.jpg
4. Knowledge Requirements

  • Options

    • Requires complete specification of policy.

    • One could learn option policies – given subtasks.

  • HAM

    • Medium requirements

  • MAXQ

    • Minimal requirements


5 models advanced l.jpg
5. Models advanced

  • Options : Concurrency

  • HAM : Richer representation, Concurrency

  • MAXQ : Continuous time, state, actions; Multi-agents, Average-reward.

  • In general, more researchers have followed MAXQ

    • Less input knowledge

    • Value decomposition


6 structure paradigm l.jpg
6. Structure Paradigm

  • S: Options, MAXQ

  • A: All

  • P: None

  • R: MAXQ

  • G: All

  • V: MAXQ

  • : All


The outline of the talk47 l.jpg
The Outline of the Talk

  • MDPs and Bellman’s curse of dimensionality.

  • RL: Simultaneous learning and planning.

  • Explore avenues to speedup RL.

  • Illustrate prominent HRL methods.

  • Compare prominent HRL methods.

  • Discuss future research.

  • Summarise


Directions for future research l.jpg
Directions for Future Research

  • Bidirectional State Abstractions

  • Hierarchies over other RL research

    • Model based methods

    • Function Approximators

  • Probabilistic Planning

    • Hierarchical P and Hierarchical R

  • Imitation Learning


Directions for future research49 l.jpg
Directions for Future Research

  • Theory

    • Bounds (goodness of hierarchy)

    • Non-asymptotic analysis

  • Automated Discovery

    • Discovery of Hierarchies

    • Discovery of State Abstraction

  • Apply…


Applications l.jpg

P2

P1

D2

D1

Parts

Ware-house

Assemblies

D3

D4

P3

P4

Applications

  • Toy Robot

  • Flight Simulator

  • AGV Scheduling

  • Keepaway soccer

Images courtesy various sources


Thinking big l.jpg
Thinking Big…

"... consider maze domains. Reinforcement learning researchers, including this author, have spent countless years of research solving a solved problem! Navigating in grid worlds, even with stochastic dynamics, has been far from rocket science since the advent of search techniques such as A*.” -- David Andre

  • Use planners, theorem provers, etc. as components in big hierarchical solver.


The outline of the talk52 l.jpg
The Outline of the Talk

  • MDPs and Bellman’s curse of dimensionality.

  • RL: Simultaneous learning and planning.

  • Explore avenues to speedup RL.

  • Illustrate prominent HRL methods.

  • Compare prominent HRL methods.

  • Discuss future research.

  • Summarise


How to choose appropriate hierarchy l.jpg
How to choose appropriate hierarchy

  • Look at available domain knowledge

    • If some behaviours are completely specified – options

    • If some behaviours are partially specified – HAM

    • If less domain knowledge available – MAXQ

  • We can use all three to specify different behaviours in tandem.


Main ideas in hrl community l.jpg
Main ideas in HRL community

  • Hierarchies speedup learning

  • Value function decomposition

  • State Abstractions

  • Greedy non-hierarchical execution

  • Context-free learning and pseudo-rewards

  • Policy improvement by re-estimation and re-learning.


ad