Reinforcement learning 2: action selection - PowerPoint PPT Presentation

Anita
reinforcement learning 2 action selection l.
Skip this Video
Loading SlideShow in 5 Seconds..
Reinforcement learning 2: action selection PowerPoint Presentation
Download Presentation
Reinforcement learning 2: action selection

play fullscreen
1 / 74
Download Presentation
Reinforcement learning 2: action selection
298 Views
Download Presentation

Reinforcement learning 2: action selection

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Reinforcement learning 2: action selection Peter Dayan (thanks to Nathaniel Daw)

  2. Global plan • Reinforcement learning I: (Wednesday) • prediction • classical conditioning • dopamine • Reinforcement learning II: • dynamic programming; action selection • sequential sensory decisions • vigor • Pavlovian misbehaviour • Chapter 9 of Theoretical Neuroscience

  3. Learning and Inference • Learning: predict; control ∆ weight  (learning rate) x (error) x (stimulus) • dopamine phasic prediction error for future reward • serotonin phasic prediction error for future punishment • acetylcholine expected uncertainty boosts learning • norepinephrine unexpected uncertainty boosts learning

  4. Action Selection • Evolutionary specification • Immediate reinforcement: • leg flexion • Thorndike puzzle box • pigeon; rat; human matching • Delayed reinforcement: • these tasks • mazes • chess Bandler; Blanchard

  5. Immediate Reinforcement • stochastic policy: • based on action values:

  6. Indirect Actor use RW rule: switch every 100 trials

  7. Direct Actor

  8. Direct Actor

  9. Could we Tell? • correlate past rewards, actions with present choice • indirect actor (separate clocks): • direct actor (single clock):

  10. Matching: Concurrent VI-VI Lau, Glimcher, Corrado, Sugrue, Newsome

  11. Matching • income not return • approximately exponential in r • alternation choice kernel

  12. Action at a (Temporal) Distance • learning an appropriate action at u=1: • depends on the actions at u=2 and u=3 • gains no immediate feedback • idea: use prediction as surrogate feedback

  13. Action Selection start with policy: evaluate it: improve it: 0.025 -0.175 -0.125 0.125 thus choose L more frequently than R

  14. Policy • value is too pessimistic • action is better than average

  15. actor/critic m1 m2 m3 mn dopamine signals to both motivational & motor striatum appear, surprisingly the same suggestion: training both values & policies

  16. Variants: SARSA Morris et al, 2006

  17. Variants: Q learning Roesch et al, 2007

  18. Summary • prediction learning • Bellman evaluation • actor-critic • asynchronous policy iteration • indirect method (Q learning) • asynchronous value iteration

  19. Sensory Decisions as Optimal Stopping consider listening to: decision: choose, or sample

  20. Optimal Stopping equivalent of state u=1 is and states u=2,3 is

  21. Transition Probabilities

  22. Evidence Accumulation Gold & Shadlen, 2007

  23. Current Topics • Vigour & tonic dopamine • Priors over decision problems (LH) • Pavlovian-instrumental interactions • impulsivity • behavioural inhibition • framing • Model-based, model-free and episodic control • Exploration vs exploitation • Game theoretic interactions (inequity aversion)

  24. Vigour • Two components to choice: • what: • lever pressing • direction to run • meal to choose • when/how fast/how vigorous • free operant tasks • real-valued DP

  25. vigour cost unit cost (reward) cost PR UR  LP S1 S2 NP S0 1time 2time Other Costs Rewards Costs Rewards choose (action,) = (LP,1) choose (action,)= (LP,2) The model how fast ? goal

  26. S1 S2 S0 1time 2time Costs Rewards Costs Rewards choose (action,) = (LP,1) choose (action,)= (LP,2) The model Goal: Choose actions and latencies to maximize the averagerate ofreturn (rewards minus costs per time) ARL

  27. Differential value of taking action L with latency  when in state x Average Reward RL Compute differential values of actions ρ = average rewards minus costs, per unit time Future Returns QL,(x) = Rewards – Costs + • steady state behavior (not learning dynamics) (Extension of Schwartz 1993)

  28. Average Reward Cost/benefit Tradeoffs 1. Which action to take? • Choose action with largest expected reward minus cost • How fast to perform it? • slow  less costly (vigour cost) • slow  delays (all) rewards • net rate of rewards = cost of delay (opportunity cost of time) • Choose rate that balances vigour and opportunity costs explains faster (irrelevant) actions under hunger, etc masochism

  29. 1st Nose poke seconds seconds since reinforcement Optimal response rates 1st Nose poke Niv, Dayan, Joel, unpublished Experimental data seconds seconds since reinforcement Model simulation

  30. low utility high utility energizing effect mean latency LP Other Effects of motivation (in the model) RR25 energizing effect

  31. response rate / minute directing effect 1 2 seconds from reinforcement response rate / minute low utility high utility energizing effect UR 50% mean latency seconds from reinforcement LP Other Effects of motivation (in the model) RR25

  32. less more Relation to Dopamine Phasic dopamine firing = reward prediction error What about tonic dopamine?

  33. Control DA depleted # LPs in 30 minutes Control DA depleted 2500 2000 Model simulation 1500 # LPs in 30 minutes 1000 500 1 4 16 64 Aberman and Salamone 1999 Tonic dopamine = Average reward rate • explains pharmacological manipulations • dopamine control of vigour through BG pathways NB. phasic signal RPE for choice/value learning • eating time confound • context/state dependence (motivation & drugs?) • less switching=perseveration

  34. $ ♫ $ $ ♫ ♫ $ $ ♫ $ ♫ …also explains effects of phasic dopamine on response times firing rate reaction time Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992 Tonic dopamine hypothesis

  35. Pavlovian & Instrumental Conditioning • Pavlovian • learning values and predictions • using TD error • Instrumental • learning actions: • by reinforcement (leg flexion) • by (TD) critic • (actually different forms: goal directed & habitual)

  36. Pavlovian-Instrumental Interactions • synergistic • conditioned reinforcement • Pavlovian-instrumental transfer • Pavlovian cue predicts the instrumental outcome • behavioural inhibition to avoid aversive outcomes • neutral • Pavlovian-instrumental transfer • Pavlovian cue predicts outcome with same motivational valence • opponent • Pavlovian-instrumental transfer • Pavlovian cue predicts opposite motivational valence • negative automaintenance

  37. -ve Automaintenance in Autoshaping • simple choice task • N: nogo gives reward r=1 • G: go gives reward r=0 • learn three quantities • average value • Q value for N • Q value for G • instrumental propensity is

  38. -ve Automaintenance in Autoshaping • Pavlovian action • assert: Pavlovian impetus towards G is v(t) • weight Pavlovian and instrumental advantages by ω – competitive reliability of Pavlov • new propensities • new action choice

  39. -ve Automaintenance in Autoshaping • basic –ve automaintenance effect (μ=5) • lines are theoretical asymptotes • equilibrium probabilities of action

  40. Impulsivity & Hyperbolic Discounting • humans (and animals) show impulsivity in: • diets • addiction • spending, … • intertemporal conflict between short and long term choices • often explained via hyperbolic discount functions • alternative is Pavlovian imperative to an immediate reinforcer • framing, trolley dilemmas, etc

  41. Kalman Filter • Markov random walk (or OU process) • no punctate changes • additive model of combination • forward inference

  42. Kalman Posterior ^ ε 

  43. Assumed Density KF • Rescorla-Wagner error correction • competitive allocation of learning • P&H, M

  44. Blocking • forward blocking: error correction • backward blocking: -ve off-diag

  45. Mackintosh vs P&H • under diagonal approximation: • for slow learning, • effect like Mackintosh E

  46. Summary • Kalman filter models many standard conditioning paradigms • elements of RW, Mackintosh, P&H • but: • downwards unblocking • negative patterning L→r; T→r; L+T→· • recency vs primacy (Kruschke) predictor competition stimulus/correlation rerepresentation (Daw)

  47. Uncertainty (Yu) • expected uncertainty - ignorance • amygdala, cholinergic basal forebrain for conditioning • ?basal forebrain for top-down attentional allocation • unexpected uncertainty – `set’ change • noradrenergic locus coeruleus • part opponent; part synergistic interaction

  48. Experimental Data • ACh&NEhave similar physiological effects • suppress recurrent & feedback processing • enhance thalamocortical transmission • boost experience-dependent plasticity (e.g. Kimura et al, 1995; Kobayashi et al, 2000) (e.g. Gil et al, 1997) (e.g. Bear & Singer, 1986; Kilgard & Merzenich, 1998) • ACh&NEhave distinct behavioral effects: • ACh boosts learning to stimuli with uncertain • consequences • NEboosts learning upon encountering global • changes in the environment (e.g. Bucci, Holland, & Gallagher, 1998) (e.g. Devauges & Sara, 1990)

  49. Model Schematics context expected uncertainty unexpected uncertainty top-down processing NE ACh cortical processing prediction, learning, ... bottom-up processing sensory inputs

  50. Attention Example 1: Posner’s Task cue cue high validity low validity stimulus location cue stimulus location target sensory input sensory input response (Phillips, McAlonan, Robb, & Brown, 2000) attentional selection for (statistically) optimal processing, above and beyond the traditional view of resource constraint 0.1s 0.1s 0.2-0.5s 0.15s generalize to the case that cue identity changes with no notice