life play and win in 20 trillion moves stuart russell n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Life: play and win in 20 trillion moves Stuart Russell PowerPoint Presentation
Download Presentation
Life: play and win in 20 trillion moves Stuart Russell

Loading in 2 Seconds...

play fullscreen
1 / 56

Life: play and win in 20 trillion moves Stuart Russell - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on

Life: play and win in 20 trillion moves Stuart Russell. White to play and win in 2 moves. Life. 100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions (Not to mention choosing brain activities!)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Life: play and win in 20 trillion moves Stuart Russell' - marlee


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide3
Life
  • 100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions
  • (Not to mention choosing brain activities!)
  • The world has a very large, partially observable state space and an unknown model
  • So, being intelligent is “provably” hard
  • How on earth do we manage?
hierarchical structure among other things
Hierarchical structure!! (among other things)
  • Deeply nested behaviors give modularity
    • E.g., tongue controls for [t] independent of everything else given decision to say “tongue”
  • High-level decisions give scale
    • E.g., “go to IJCAI 13” 3,000,000,000 actions
    • Look further ahead, reduce computation exponentially

[NIPS 97, ICML 99, NIPS 00, AAAI 02, ICML 03, IJCAI 05, ICAPS 07, ICAPS 08, ICAPS 10, IJCAI 11]

reminder mdps
Reminder: MDPs
  • Markov decision process (MDP) defined by
    • Set of states S
    • Set of actions A
    • Transition model P(s’ | s,a)
      • Markovian, i.e., depends only on st and not on history
    • Reward function R(s,a,s’)
    • Discount factor γ
  • Optimal policyπ* maximizes the value function Vπ= Σt γt R(st,π(st),st+1)
  • Q(s,a) = Σs’ P(s’ | s,a)[R(s,a,s’) + γV(s’)]
reminder policy invariance
Reminder: policy invariance
  • MDP M with reward function R(s,a,s’)
  • Any potential function Φ(s)

then

  • Any policy that is optimal for M is optimal for any version M’ with reward function R’(s,a,s’) = R(s,a,s’) + γΦ(s’) - Φ(s)
  • So we can provide hints to speed up RL by shaping the reward function
reminder bellman equation
Reminder: Bellman equation
  • Bellman equation for a fixed policy π :

Vπ(s) = Σs’P(s’|s,π(s)) [R(s,π(s),s’) + γVπ (s’) ]

  • Bellman equation for optimal value function:

V(s) = maxaΣs’P(s’|s,a) [R(s,a,s’) + γV(s’) ]

reminder reinforcement learning
Reminder: Reinforcement learning
  • Agent observes transitions and rewards:
    • Doing a in s leads to s’with reward r
  • Temporal-difference (TD) learning:
    • V(s) <- (1-α)V(s) + α[r+γV(s’) ]
  • Q-learning:
    • Q(s,a) <- (1-α)Q(s,a) + α[r+γ maxa’Q(s’,a’) ]
  • SARSA (after observing the next action a’ ):
    • Q(s,a) <- (1-α)Q(s,a) + α[r+γQ(s’,a’) ]
reminder rl issues
Reminder: RL issues
  • Credit assignment (problem:
  • Generalization problem:
abstract states
Abstract states
  • First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.
abstract states1
Abstract states
  • First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.
abstract states2
Abstract states
  • First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.

Oops – lost the

Markov property!

Action outcome

depends on the

low-level state,

which depends

on history

abstract actions
Abstract actions
  • Forestier & Varaiya, 1978: supervisory control:
    • Abstract action is a fixed policy for a region
    • “Choice states” where policy exits region

Decision process on choice states is semi-Markov

semi markov decision processes
Semi-Markov decision processes
  • Add random duration N for action executions:

P(s’,N|s,a)

  • Modified Bellman equation:

Vπ = Σs’,N P(s’,N|s,π(s)) [R(s,π(s),s’,N) + γNVπ (s’) ]

  • Modified Q-learning with abstract actions:
    • For transitions within an abstract action πa:

rc = rc + γcR(s,πa(s),s’) ; γc = γcγ

    • For transitions that terminate action begun at s0

Q(s0,πa) = (1-α)Q(s0,πa) + α(rc + γc maxa’Q(s’,πa’) ]

hierarchical abstract machines
Hierarchical abstract machines
  • Hierarchy of NDFAs; states come in 4 types:
    • Action states execute a physical action
    • Call states invoke a lower-level NDFA
    • Choice points nondeterministically select next state
    • Stop states return control to invoking NDFA
  • NDFA internal transitions dictated by percept
running example
Running example
  • Peasants can move, pickup and dropoff
  • Penalty for collision
  • Cost-of-living each step
  • Reward for dropping off resources
  • Goal : gather 10 gold + 10 wood
  • (3L)n++ states s
  • 7n primitive actions a
joint smdp
Joint SMDP
  • Physical states S, machine states Θ, joint state space Ω = S x Θ
  • Joint MDP:
    • Transition model for physical state (from physical MDP) and machine state (as defined by HAM)
    • A(ω) is fixed except when θ is at a choice point
  • Reduction to SMDP:
    • States are just those where θ is at a choice point
hierarchical optimality
Hierarchical optimality
  • Theorem: the optimal solution to the SMDP is hierarchically optimal for the original MDP, i.e., maximizes value among all behaviors consistent with the HAM
partial programs
Partial programs
  • HAMs (and Dietterich’s MAXQ hierarchies and Sutton’s Options) are all partially constrained descriptions of behaviors with internal state
  • A partial program is like an ordinary program in any programming language, with a nondeterministic choice operator added to allow for ignorance about what to do
  • HAMs, MAXQs, Options are trivially expressible as partial programs
  • Standard MDP: loop forever choose an action
rl and partial programs
RL and partial programs

Partial program

a

Learning algorithm

s,r

Completion

rl and partial programs1
RL and partial programs

Partial program

a

Learning algorithm

s,r

Completion

Hierarchically optimal

for all terminating programs

single threaded alisp program
Single-threaded Alisp program

Program state θ includes

Program counter

Call stack

Global variables

(defun top ()

(loop do

(choose ‘top-choice

(gather-wood)

(gather-gold))))

(defun gather-wood ()

(with-choice ‘forest-choice

(dest *forest-list*)

(nav dest)

(action‘get-wood)

(nav *base-loc*)

(action‘dropoff)))

(defun gather-gold ()

(with-choice ‘mine-choice

(dest *goldmine-list*)

(nav dest))

(action‘get-gold)

(nav *base-loc*))

(action‘dropoff)))

(defun nav (dest)

(until (= (pos (get-state)) dest)

(with-choice‘nav-choice

(move ‘(N S E W NOOP))

(action move))))

q functions
Q-functions
  • Represent completions using Q-function
  • Joint stateω = [s,θ] env state + program state
  • MDP + partial program = SMDP over {ω}
  • Qπ(ω,u) = “Expected total reward if I make choice u in ω and follow completion π thereafter”
  • Modified Q-learning [AR 02] finds optimal completion
internal state
Internal state
  • Availability of internal state (e.g., goal stack) can greatly simplify value functions and policies
  • E.g., while navigating to location (x,y), moving towards (x,y) is a good idea
  • Natural local shaping potential (distance from destination) impossible to express in external terms
temporal q decomposition
Temporal Q-decomposition

Top

Top

Top

GatherGold

GatherWood

Nav(Mine2)

Nav(Forest1)

Qr

Qc

Qe

Q

  • Temporal decompositionQ = Qr+Qc+Qewhere
    • Qr(ω,u) = reward while doing u (may be many steps)
    • Qc(ω,u) = reward in current subroutine after doing u
    • Qe(ω,u) = reward after current subroutine
state abstraction
State abstraction
  • Temporal decomposition => state abstraction

E.g., while navigating, Qc independent of gold reserves

  • In general, local Q-components can depend on few variables => fast learning
handling multiple effectors
Handling multiple effectors

Get-gold

Get-wood

Multithreadedagent programs

  • Threads = tasks
  • Each effector assigned to a thread
  • Threads can be created/destroyed
  • Effectors can be reassigned
  • Effectors can be created/destroyed

(Defend-Base)

Get-wood

an example concurrent alisp program
An example Concurrent ALisp program

An example single-threaded ALisp program

(defun gather-gold ()

(with-choice ‘mine-choice

(dest *goldmine-list*)

(nav dest)

(action ‘get-gold)

(nav *base-loc*)

(action ‘dropoff)))

(defun nav (dest)

(until (= (my-pos) dest)

(with-choice ‘nav-choice

(move ‘(N S E W NOOP))

(action move))))

(defun top ()

(loop do

(until (my-effectors)

(choose ‘dummy))

(setf peas

(first (my-effectors))

(choose ‘top-choice

(spawn gather-wood peas)

(spawn gather-gold peas))))

(defun gather-wood ()

(with-choice ‘forest-choice

(dest *forest-list*)

(nav dest)

(action ‘get-wood)

(nav *base-loc*)

(action ‘dropoff)))

(defun top ()

(loop do

(choose ‘top-choice

(gather-gold)

(gather-wood))))

(defun gather-wood ()

(with-choice ‘forest-choice

(dest *forest-list*)

(nav dest)

(action ‘get-wood)

(nav *base-loc*)

(action ‘dropoff)))

concurrent alisp semantics
Concurrent Alisp semantics

Peasant 1

(defun nav (Wood1)

(loop

until (at-pos Wood1)

(setf d (choose ‘(N S E W R)))

do (action d)

))

  • defun gather-wood ()
  • (setf dest (choose *forests*))
  • (nav dest)
  • (action ‘get-wood)
  • (nav *home-base-loc*)
  • (action ‘put-wood))

Thread 1

Running

Paused

Waiting for joint action

Making joint choice

(defun nav (Gold2)

(loop

until (at-pos Gold2)

(setf d (choose ‘(N S E W R)))

do (action d)

))

(defun gather-gold ()

(setf dest (choose *goldmines*))

(nav dest)

(action ‘get-gold)

(nav *home-base-loc*)

(action ‘put-gold))

Peasant 2

Paused

Thread 2

Running

Making joint choice

Waiting for joint action

Environment timestep 27

Environment timestep 26

slide31

Concurrent Alisp semantics

  • Threads execute independently until they hit a choice or action
  • Wait until all threads are at a choice or action
    • If all effectors have been assigned an action, do that joint action in environment
    • Otherwise, make joint choice
q functions1
Q-functions
  • To complete partial program, at each choice state ω, need to specify choices for all choosing threads
  • So Q(ω,u) as before, except u is a joint choice
  • Suitable SMDP Q-learning gives optimal completion

Example Q-function

problems with concurrent activities
Problems with concurrent activities
  • Temporal decomposition of Q-function lost
  • No credit assignment among threads
    • Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly
    • Peasant 2 thinks he’s done very well!!
    • Significantly slows learning as number of peasants increases
threadwise decomposition
Threadwise decomposition
  • Idea : decompose reward among threads (Russell+Zimdars, 2003)
  • E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants
  • Qjπ(ω,u) = “Expected total reward received by thread j if we make joint choice u and then do π”
  • Threadwise Q-decompositionQ = Q1+…Qn
  • Recursively distributed SARSA => global optimality
learning threadwise decomposition
Learning threadwise decomposition

Peasant 3 thread

Peasant 2 thread

Peasant 1 thread

r2

r1

r3

Top thread

r

decomp

a

Action state

learning threadwise decomposition1
Learning threadwise decomposition

Peasant 3 thread

Q-update

Peasant 2 thread

Q-update

Peasant 1 thread

Q2(ω,·)

Q-update

Q1(ω,·)

Q3(ω,·)

Top thread

+

argmax

= u

Q(ω,·)

Choice state ω

threadwise and temporal decomposition
Threadwise and temporal decomposition

Peasant 3 thread

  • Qj = Qj,r + Qj,c + Qj,e where
    • Qj,r (ω,u) = Expected reward gained by thread jwhile doing u
    • Qj,c (ω,u) = Expected reward gained by thread jafter u but before leaving current subroutine
    • Qj,e (ω,u) = Expected reward gained by thread jafter current subroutine

Top thread

Peasant 2 thread

Peasant 1 thread

slide38

Resource gathering with 15 peasants

Threadwise + Temporal

Threadwise

Undecomposed

Reward of learnt policy

Flat

Num steps learning (x 1000)

summary of part i
Summary of Part I
  • Structure in behavior seems essential for scaling up
  • Partial programs
    • Provide natural structural constraints on policies
    • Decompose value functions into simple components
    • Include internal state (e.g., “goals”) that further simplifies value functions, shaping rewards
  • Concurrency
    • Simplifies description of multieffector behavior
    • Messes up temporal decomposition and credit assignment (but threadwise reward decomposition restores it)
outline
Outline
  • Hierarchical RL with abstract actions
  • Hierarchical RL with partial programs
  • Efficient hierarchical planning
abstract lookahead
Abstract Lookahead

Rxa1

c2

c2

Kxa1

Qa4

‘a’

‘b’

a

b

‘b’

‘a’

‘b’

‘a’

aa

ab

ba

bb

  • k-step lookahead >> 1-step lookahead
    • e.g., chess
  • k-step lookahead no use if steps are too small
    • e.g., first k characters of a NIPS paper

Kc3

  • Abstract plans (high-level executions of partial programs) are shorter
    • Much shorter plans for given goals => exponential savings
    • Can look ahead much further into future
high level actions
High-level actions

[Act]

[GoSFO, Act]

[GoOAK, Act]

  • Start with restricted classical HTN form of partial program
  • A high-level action (HLA)
    • Has a set of possible immediate refinements into sequences of actions, primitive or high-level
    • Each refinement may have a precondition on its use
    • Executable plan space = all primitive refinements of Act

[WalkToBART, BARTtoSFO, Act]

lookahead with hlas
Lookahead with HLAs
  • To do offline or online planning with HLAs, need a model
  • See classical HTN planning literature
lookahead with hlas1
Lookahead with HLAs
  • To do offline or online planning with HLAs, need a model
  • See classical HTN planning literature – not!
  • Drew McDermott (AI Magazine, 2000):
    • The semantics of hierarchical planning have never been clarified … no one has ever figured out how to reconcile the semantics of hierarchical plans with the semantics of primitive actions
downward refinement
Downward refinement
  • Downward refinement property (DRP)
    • Every apparently successful high-level plan has a successful primitive refinement
  • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking
downward refinement1
Downward refinement
  • Downward refinement property (DRP)
    • Every apparently successful high-level plan has a successful primitive refinement
  • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking
  • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general
downward refinement2
Downward refinement
  • Downward refinement property (DRP)
    • Every apparently successful high-level plan has a successful primitive refinement
  • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking
  • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general
  • MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds
downward refinement3
Downward refinement
  • Downward refinement property (DRP)
    • Every apparently successful high-level plan has a successful primitive refinement
  • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking
  • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general
  • MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds
  • Problem: how to say true things about effects of HLAs?
angelic semantics for hlas
Angelic semantics for HLAs

S

S

S

s0

s0

s0

a4 a1 a3 a2

g

g

g

  • Begin with atomic state-space view, start in s0, goal state g
  • Central idea is the reachable set of an HLA from each state
    • When extended to sequences of actions,

allows proving that a plan can or cannot possibly reach the goal

  • May seem related to nondeterminism
    • But the nondeterminism is angelic: the “uncertainty” will be resolved by the agent, not an adversary or nature

State

space

h1h2is a solution

h2

h1

h1h2

h2

technical development
Technical development
  • NCSTRIPS to describe reachable sets
    • STRIPS add/delete plus “possibly add”, “possibly delete”, etc
    • E.g., GoSFO adds AtSFO, possibly adds CarAtSFO
  • Reachable sets may be too hard to describe exactly
  • Upper and lower descriptions bound the exact reachable set (and its cost) above and below
    • Still support proofs of plan success/failure
    • Possibly-successful plans must be refined
  • Sound, complete, optimal offline planning algorithms
  • Complete, eventually-optimal online (real-time) search
example warehouse world
Example – Warehouse World
  • Has similarities to blocks and taxi domains, but more choices and constraints
    • Gripper must stay in bounds
    • Can’t pass through blocks
    • Can only turn around at top row
  • Goal: have C on T4
    • Can’t just move directly
    • Final plan has 22 steps

Left, Down, Pickup,

Up, Turn, Down, Putdown, Right, Right, Down, Pickup, Left, Put, Up, Left, Pickup, Up, Turn, Right, Down, Down, Putdown

representing descriptions ncstrips
Representing descriptions: NCSTRIPS

~

s

t

s

~

t

x

s

t

  • Assume S = set of truth assignments to propositions
  • Descriptions specify propositions (possibly) added/deleted by HLA
  • An efficient algorithm exists to progress state sets (represented as DNF formulae) through descriptions

Navigate(xt,yt) (Pre: At(xs,ys))

Upper: -At(xs,ys), +At(xt,yt), FacingRight

Lower: IF (Free(xt,yt) x Free(x,ymax)):

-At(xs,ys), +At(xt,yt), FacingRight,

ELSE:

nil

experiment
Experiment
  • Instance 3
    • 5x8 world
    • 90-step plan
  • Flat/hierarchical without descriptions did not terminate within 10,000 seconds