1 / 36

Reinforcement Learning in the Control of Attention

Reinforcement Learning in the Control of Attention. Roderic A Grupen. Luiz M G Gonçalves. Laboratory for Analysis and Architecture of Systems (State University of Campinas-near future) www.laas.fr/~lmgarcia. Laboratory for Perceptual Robotics State University of Massachusetts (USA)

walter
Download Presentation

Reinforcement Learning in the Control of Attention

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning in the Control of Attention Roderic A Grupen Luiz M G Gonçalves Laboratory for Analysis and Architecture of Systems (State University of Campinas-near future) www.laas.fr/~lmgarcia Laboratory for Perceptual Robotics State University of Massachusetts (USA) www-robotics.cs.umass.edu

  2. Objective • To develop a robotic system to perform tasks involving attention and pattern categorization, integrating multi-modal (haptic and visual) information in a behaviorally cooperative active system.

  3. Motivation • Towards finding an useful robotic system able to: • foveate (verge) the eyes onto a ROI; • keep attention on the ROI if needed; • choose another ROI (shift focus of attention). • Result is a behaviorally cooperative active system, which provides on-line feedback to environmental stimuli in form of actions

  4. Method • Use of (real time) visual information from a stereo head and a simulator • Selective Attention (bottom-up salience maps) • Multi feature extraction (perceptual state) • Associative memory (pattern address identification) • Efficient topological mapping • Learn policies to program the system

  5. Task Specification (Objectives) • Visual Monitoring or Environment Inspection • Construction of an attentional map • Keep this map consistent with a current perception (update) • Categorize all patterns

  6. Processo Markoviano • Um processo estocástico cujo passado não influencia o futuro se o seu presente está completamente especificado • Ex: Jogo de damas, Xadrez

  7. Programação Dinâmica • Percorrer todos os estados possíveis, testando todas as possibilidades (executar todas as ações infinitamente) • Solução melhor (PD): • Reduzir a complexidade de um problema que pode ser resolvido em uma dimensão D para dois ou mais problemas em dimensões menores • Ex: Disparidade estéreo: • 1 problema em 3D (x,y,d) é reduzido para 2 problemas em 2D (x,d) e (y,d)

  8. Pavlov • Animal faz certo, ganha comida • Animal faz errado, apanha • Em teoria, é provado que apenas um deles (recompensa ou punição funciona): fez coisa errada, não ganha comida. • Assim: • robô fez certo => recompensa

  9. Reinforcement Learning(Related Work) • Watkins: Learning from Delayed Rewards (1989). • Sutton/Barto: Reinforcement Learning: An Introduction (1998). • Araujo: Learning a Control Composition in a Complex Environment (1996). • Huber: A Feedback Control Structure for On-line Learning tasks (1997). • Coelho: A Control Basis for Learning Multifingered Grasps (1997).

  10. Modelling a problem with delayed reinforcement as an MDP: • a set of states (estados) S, • a set of actions (operadores) A, • a reward function R:SxA, and • a state transition function T:SxA (S), which maps states transition to probabilities. • Q-learning equation:

  11. Q-learning equation • a = ação executada • r = recompensa • s’ = estado resultante de aplicar a • A = todas as ações possíveis a’ de serem executadas em s’ •  = learning rate (geralmente 0.1) •  = fator de disconto (geralmente 0.5)

  12. Observações • Uma transição no espaço de estados pode ser completamente caracterizada pelo vetor (s,a,r,s’) • Supondo que para todos os pares (s,a), Q(s,a) seja atualizado infinitamente (muitas vezes) para todo par (s,a), Q(s,a) converge com probabilidade 1 para a melhor recompensa possível para este par.

  13. Exploração e explotação • Exploração; randomicamente escolher uma ação • Explotação: após certo tempo, o sistema começa a convergir, assim, escolhe-se ações que sabe-se estejam contribuindo para a convergência • Balancear entre exploração e explotação • Temperatura (lembra Simulated Annealing) • Escolher randomicamente em função da temperatura (inicial alta, depois baixa) • Na prática, mesmo no final, ainda 10% randomico

  14. Algoritmo Q-learning • 1) Define current state s by decoding sensory information available; • 2) Use stochastic action selector to determine action a; • 3) Perform action a, generating new state s’ and a reinforcement r; • 4) Calculate temporal differencial error r’: • 5) Update Q-value of the state/action pair(s,a) • 6) Go to 1;

  15. Elegibility trace • Atualizar não apenas um par estado-ação de cada vez, mas sim uma seqüência de pares (após execução de uma série de ações). • Ganho em convergência

  16. Na prática • Uma tabela (Q-table) • Linhas são os estados (s) • Colunas são as ações (a) • Elemento Q(s,a) são os Q-values, valores dados pela função que avalia a utilidade de tomar a ação a quando o estado é s

  17. Roger-the-Crab

  18. Stereo Head Environment

  19. Degrees of Freedom (Controllers)

  20. System Control Architecture

  21. Low-level Control • Defining a target • Pre-attentional phase (stimuli + internal biased) • Shifting attention (saccade generation) • Fine saccade (using target model) • Verging eyes onto a target (correlation) • Movements are computed from errors to image centers

  22. Low-level Control • Identifying Objects • Selecting a region of interest • Extracting features • Associative memory match • Mapping objects and/or updating memory • Pre-attentional maps • Automatic supervised learning

  23. Behavioral Program

  24. A straight-forward control algorithm • Step 0: Initialize the associative memory and start the concurrent controllers of arms, neck, and eyes. • Step 1: Re-direct attention; if a representation is activated, update attentional maps and re-do this step (1). • Step 2: Try a visual improvement; if a representation is activated, update attentional maps and return to step 1. • Step 3: Try an arm improvement; if a representation is activated, update attentional maps and return to step 1; • Step 4: Activate “supervised learning” module, update attentional maps and return to step 1.

  25. Finite state machine

  26. Results • Q-learning convergence

  27. Partial Evaluation of strategies Attentional Shifts

  28. Partial Evaluation of strategies Visual/arm Improv

  29. Partial Evaluation of strategies Objects Identified

  30. Partial Evaluation of strategies New objects

  31. Global evaluation Mapped objects

  32. Task accomplishment Mapped objects

  33. Times for each phase or process • Phase Min(sec) Max(sec) Mean(sec) • Computing retina 0.145 0.189 0.166 • Transfer to host 0.017 0.059 0.020 • Total acquiring 0.162 0.255 0.186 • Pre-attention 0.139 0.205 0.149 • Salience map 0.067 0.134 0.075 • Total attention 0.324 0.395 0.334 • Total saccade 0.466 0.903 0.485 • Features for match 0.135 0.158 0.150 • Memory match 0.012 0.028 0.019 • Total matching 0.323 0.353 0.333

  34. Conclusions • The system can support other sensors. • Attention and categorization act together: tasks must be formulated • Inspection task succesfully done. • Currently support a 10-15 frame rate. • Reinforcement learning appr. worked well in simulation

  35. Future works • Consider focus for saccade generation and accomodation (vergence) • Test with partially ocluded objects • Derive policies (with Q-learning) for control of top-down attention • Increase the state space and/or the set of actions • Define other hierarchical tasks (several policies, each appropriate for a given task) • Test learning architecture on a real environment

  36. Thanks • Thanks to CNPQ, CAPES, FAPERJ, NSF and UMASS (USA) • To all of you for your patience • To Mimmo and Dr. Arcangelo Distante for hosting me:-).

More Related