1 / 42

Linking Vision, Brain and Language

Linking Vision, Brain and Language. Part I: Hearing Gazes – eye movements and visual scene recognition. An unexpected visitor (I.E. Repin, 1884-1888).

nate
Download Presentation

Linking Vision, Brain and Language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linking Vision, Brain and Language

  2. Part I: Hearing Gazes – eye movements and visual scene recognition An unexpected visitor (I.E. Repin, 1884-1888)

  3. SemRep (Semantic Representation) is a hierarchically organized graph-like representation for encoding semantics and concepts that are possibly anchored on a visual scene. edges to specify relations between nodes WOMAN HITTING ACTION objects or actions as nodes MAN BLUE CLOTHE Original image from: “Invisible Man Jangsu Choi”, Korean Broadcasting System (KBS) Another whole SemRep structure embedded in WOMAN node

  4. SemRep is proposed as a ‘bridge’ between the vision and the language system. “A woman in blue hits a man” “A woman hits a man and she wears a blue dress” “A man is hit by a woman who wears a blue dress” “A pretty woman is hitting a man” Language System Vision System • Dynamically changing (even for static scenes) • Cognitively important entities encoded only (different case by case) • Temporally stored in WM SemRep

  5. What’s happening? http://icanhascheezburger.files.wordpress.com/2008/10/funny-pictures-fashion-sense-cat-springs-into-action.jpg

  6. A single structure of SemRep usually encodes a cognitive event that corresponds to a ‘minimal subscene (Itti & Arbib, 2006)’, which can later be extended into an ‘anchored subscene’. 1. Firstly the man’s face captures attention (human faces are cognitively salient); minimal subscene 2. Then the high contrast of the cat’s paw captures attention  action recognition is performed and it biases the attention system onto the agent (cat) 3. The cat on the roof found as the agent of the paw action; incremental building 4. It completes a bigger subscene  building the Agent-Action-Patient propositional structure; anchored subscene 3 4 2 1

  7. The entities encoded in SemRep are NOT logical but perceptually grounded symbolic representations. Theories and findings on conceptual deficits (Damasio, 1989; Warrington & Shallice 1984; Caramazza & Shelton, 1998; Tyler & Moss, 2001) suggest modality-specific representations distributed categorically over the brain Perceptual Symbol System (PSS) (Barsalou, 1999): stimulus inputs are captured in the modality-specific feature maps, then integrated in an association area re-enactment (simulation) of the learned concept CAR Figure 2. Barsalou et al., 2003 learning (encoding) the concept CAR representing the encoded concepts as perceptual symbols (SemRep)

  8. The schematic structure (relations, hierarchies, etc.) of SemRep can be viewed as a cross-modal association between very high level sensori-motor integration areas in the brain. Conceptual Topography Theory (CTT) (Simmons & Barsalou, 2003): hierarchically organized convergence zones  the higher, the more symbolic and amodal X Figure 3. Simmons & Barsalou, 2003

  9. Since SemRep is still in its hypothetical stage, an eye-tracking study (but it only captures ‘overt’ attention) has been designed and conducted to evaluate its validity. Microphone • Capturing human eye movements and verbal description simultaneously while showing images/videos •  To investigate the temporal relationship between eye-movements and the verbal description

  10. Mainly four hypotheses are proposed and two types of tasks are suggested in order to verify each hypothesis. • Hypothesis I – Mental Preparation • Hypothesis II – Structural Message • Hypothesis III – Dynamic Interaction • Hypothesis IV – Spatial Anchor Task I – “describe what you are seeing as quickly as possible” Task II – “describe what you have seen during the scene display” Table 1. Possible experiment configurations between hypotheses and tasks/visual scenes

  11. ‘Task I’ is to investigate the eye movements for estimating possible SemRep structures built and their connection to the spoken description. • Instruction • “Describe what you are seeing as quickly as possible” • Real-time description (with speech recording) during the visual scene display • Expectations • 1) There would be rich dynamic interaction with the eye movements and the utterance • 2) Having the subjects speak as quickly as possible would reveal the structure of the internal structure that they are using for speaking NOTE: Experiments for only this with static images were conducted

  12. ‘Task II’ is to investigate the eye movements after the scene display for evaluating the validity of the spatial component of SemRep. • Instruction • “Describe what you have seen during the previous scene display” • Post-event description (with speech recording) after the visual scene display • Expectations • 1) The eye movements will sincerely reflect the remembered object locations being described • 2) The sentence structure of the description would be more well-formed, and the order of description and the order of fixations would be less congruent

  13. The Mental Preparation Hypothesis (Hypothesis I) asserts that gazes are not merely meaningless gestures but they actually reflect attention and mental effort in verbal description production. • Speakers plan (building an internal representation) what to say while gazing, and only after it is relatively complete they begin speaking (Griffin, 2005) • There is a tight correlation between gazing and description  usually gazing comes 100ms ~ 300ms earlier than the actual description (Griffin, 2001) An overlapping utterance pattern for two consecutive nouns A and B Figure 2. Griffin, 2001

  14. The experiment result suggests that there is a very strong correlation between the eye movements and the verbal description such that gazes come before the utterances, indicating what to be spoken next. • Eye gazes ‘right before’ verbal descriptions • 01_soldiers [fairman] • 4.08~5.2::“one soldier smiles at him”  gaze on a soldier’s face • 7.33~8.79::“he’s got sandals and shoes on”  gaze on person’s shoes • 05_basketball [fairman] • 2.82 ~ 3.52:: “he has the ball”  gaze on ball • 06_conversation [fairman] • 4.08~5.01:: “reading books”  gazes on books • 14.86~16.62:: “the woman on the right wears boots”  gazes on boots and a woman • …and a lot more!!! • Others • The delay between gazes and descriptions seem to be a bit longer than suggested by Griffin (2005; 2001) – about 500ms ~ 1,500ms  this might be due to the fact that Griffin’s case is in the word level but ours is in the sentential/clausal level • There seems to be a parallelism between the vision system (gaze) and the language system (description) since subjects already moved their gazes to some other locations during speaking about an object/event

  15. The Structural Message Hypothesis (Hypothesis II) asserts that the internal representation for speakers is an organized structure that can be directly translated into a proposition or clause. • van der Meulen (2003): speakers outline messages (some semantic representation similar to SemRep, used for producing utterances) approximately one proposition or clause at a time • Itti & Arbib (2006): for whatever reason, an agent, action or object may attract the viewer’s attention then attention would be directed to seek to place this ‘anchor’ in context within the scene, completing the ‘minimal subscene’ in an incremental manner; a minimal subscene can be extended by adding details to elements, resulting in ‘anchored subscene’  each subscene corresponds to a propositional structure that in turn corresponds to a clause (or sentence) • There must be a difference between the eye movement patterns for “a woman hits a man and she wears a hat” and “a woman wearing a hat hits a man” although the used lexicons are about the same

  16. The experiment result indicates that a relatively complex eye movements are followed by utterances either with more complex sentence structures or with more thematic information involved. • Eye gazes of a complex movement pattern • 03_athletic [fairman] • 2.81~3.93::“one team just lost”  after inspecting at the both groups • 3.93~5.52::“they are looking at the team that won”  this involves inspecting the both of the groups (whereas [karl] “they’re really skinny… got a baton”  only focuses on the left group) • 11_pool [fairman] • 6.26~7.57:: “sort of talking to some girls…”  after inspecting the guy and the girls at the bar • 7.57~9.62:: “but he's actually watching these two guys play pool..”  but right after switched the gaze between the guy and the pool table (hole) • 06_converstaion [fairman] • 9.26~11.07:: “...looks like he's wondering what they shared”  looking at both sides of the groups and the guy’s face (probably for figuring out the eye-gaze direction) and body • Eye gazes of a simple pattern • Conversely, eye movements involving with simple sentence structures (e.g. “there is a man...”, “He’s wearing sandals…”, “…got something”, etc.) are quite simple and they are more like a serial movement from one item to another whereas complex ones involve visiting several locations

  17. The Dynamic Interaction Hypothesis (Hypothesis III) asserts that the vision and language system are constantly interacting so that not only the visual attention guides the verbal expression but also does the other way. • Fixations followed by producing complex nouns are longer than those followed by simple nouns (Myer, 2005) Both the elemental (low-level elemental attention attraction) and structural (more high-level/top-down structure) considerations are done for producing sentences on the given scene (Bock et al., 2005) Passive sentence  simple causal schemas might have been involved a relatively complex structure lengthens the fixation duration Active sentence  occur more frequently and easier to produce Figure 1. Bock et al., 2005 Figure 1. Myer, 2005

  18. The tendency of longer fixation for complex expressions was found from the result; also, there were several cases where the description being produced seems to bias the gaze. • Construction requiring more information • 01_soldiers [fairman] • 5.21~5.72  5.72~7.33:: “His hat…”  “he’s wearing a military hat…” • 10.18~11.45  11.45~12.70:: “…and he’s got a bag”  “it’s like a crumpled up bag”  longer gazes that follow right after short gazes; probably the subjects thought the construction is too simple or missing crucial information to be produced as a sentence • More dynamic interaction between gaze and utterance • 08_disabled [fairman] • 4.5~5.68  5.68~7.42:: “...someone who just fell...”  “..black woman just fell...number seven”  probably the construction required more detailed information to fill in instead of just ‘someone’ • 05_basketball [fairman] • 10.35~12.28  13.41~14.39:: “...being guarded by another guy in the black who...”  firstly produced sentence seemed to require more information (more than the gender and the color of uniform) on the guy who blocks the player  “...looks kinda out of it...”  later it turned out to be irrelevant

  19. The Spatial Anchor Hypothesis (Hypothesis IV) asserts that the attentional focus serves the function of associating spatial pointers with elements perceived; accessing them in working memory (WM) later requires covert/overt attention shifts. • SemRep (Arbib & Lee, 2008): an internal structure that encodes semantic information associated with the elements of the scene by coarsely coded spatial coordinates • Hollywood Squares experiment (Richardson & Spivey, 2000) • The tropical fish experiment (Laeng & Teodorescu, 2002) NOTE: This experiment is not conducted yet More likely to see here! Figure 2. Richardson & Spivey, 2000

  20. The current study can be extended by including: (1) account for gist (covert attention), and (2) more thorough account for the language bias on the attentional shifts. • For almost all experiment results, ‘gist’ (without particular gazes involves) plays a role at least at the initial phase of the scene recognition  the current study focuses only on the overt-type attention, but the covert-type would be important as well • 05_basketball [fairman] • 1.04~1.79:: “basket ball”  after a short gaze only on a player • 06_converstation [fairman] • 1.48~2.53:: “Home situation”  there is no specific gaze found • This is usually at the beginning of the scene display  later description goes on to more specific objects/events in the scene • van der Meulen et al.(2001): even if an object is repeatedly mentioned, the subjects still looks back at the object although less frequently  a good example of the influence of the language system on the vision system (the Dynamic Interaction Hypothesis) • Firstly mentioned: 82%, Secondly: 67% (with noun) / 50% (with pronoun) •  It needs to devise a more thorough way on accounting for the hypothesis

  21. Part II: A New Approach – Construction Grammar and Template Construction Grammar (TCG) Sidney Harris’s Science Cartoons (S. Harris, 2003)

  22. Constructions are basically ‘form-meaning pairings’ that serve as basic building blocks for grammatical structure, thus blurring the distinction between semantic and syntax. • Goldberg (2003): Constructions are stored pairings of form and function, including morphemes, words, idioms, partially lexically filled and fully general linguistic patterns Table 1. Examples of constructions, varying in size and complexity; form and function are specified if not readily transparent (Goldberg, 2003) They are not grammatical components and might be different among individuals

  23. Seriously, there is virtually NO difference between semantic and grammatical components – semantic constraints and grammatical constraints are interchangable. • Examples where the distinction between semantic and syntactic constraints blurs out (a) Bill hit Bob’s arm. (b) Bill broke Bob’s arm. (a) Bill hit Bob on the arm. (b) *Bill broke Bob on the arm. [X V Y on Z]  V has to be a ‘contact verb (e.g. caress, kiss, lick, pat, stroke)’ not ‘break verb (e.g. break, rip, smash, spliter, tear)’ (Kemmerer, 2005) (a) John sent the package to Jane. (b) John sent Jane the package. (a) John sent the package to the dormitory. (b) *John sent the dormitory the package. [X transfer-verb Y Z] X and (especially) Y have to be ‘animate’ objects The semantic of the verb, not the syntactic category, is what decides the syntactic category of the sentence

  24. Construction grammar paradigm is also supported (1) from the neurophysiological perspective and (2) by the ontogenetic account. • Bergen & Wheeler (2005): not only the content words but also (or even more) the function words contribute to the context of a sentence • Vocabulary among each individual expands/shrinks through time  grammatical categories are at theoretical, not necessarily cognitive, convenience • (a) John is closing the drawer.  motor activation (strong ACE found) • (b) John has closed the drawer.  visual activation (less or no ACE found) • ACE: Action-sentence Compatibility Effect •  Although the same content words are used, the mental interpretations vary “want milk” “want X” (holophrase) (general construction) (Hill, 1983) “I’m-sorry” “I[NP] am[VP] sorry[ADJ]” (holophrase) (general construction)

  25. Constructions defined in Template Construction Grammar (TCG) represent the ‘form-meaning’ pairings as ‘templates’. class defines syntactic hierarchy level as well as the ‘competition and cooperation’ style construction name Template  SemRep as the semantic meaning and a lexical sequence (with slots) as the syntactic form

  26. There are broadly two types of constructions; one encodes rather simple lexical information while the other encodes a more abstract syntactic structure. Complex one encodes a higher level abstract structure  slots define syntactic connections while red lines indicate the expression order Simple one only encodes lexical information  it corresponds to a word

  27. In the production mode of TCG, constructions are trying to ‘cover’ the SemRep generated by the vision system, and the matching is done based on similarity. more than one construction is attached at this moment  it will be sorted out later

  28. Constructions interact each other under the ‘competition and cooperation’ paradigm; they compete when there are spatial conflicts (among the same classes). COMPETITION COMPETITION

  29. Constructions interact each other under the ‘competition and cooperation’ paradigm; they cooperate (conjoin) when a grammatical connection is possible. Conjoined constructions form a hierarchical structure that resembles a parse tree COOPERATION (all over the place)

  30. The proposed system basically consists of three main parts, the vision system, language system and visuo-linguistic working memory, with attention on top of them. building SemRep from visual scenes Three parts communicate via attention TCG works here SemRep

  31. Concurrency is the virtue of the system; the system is designed as a real-time multi-schema network, and it allows the system a great deal of flexibility. • The result of eye experiment strongly implicate parallelism • Constructions are basically schemas that run parallel (the C&C paradigm) The vision system provides parts of SemRep produced as the interpretation of the scene The language system attaches constructions and reads off the (partially) complete sentence at the moment VLWM holds SemRep (with constructions attached) changing dynamically X

  32. Attention is proposed as a computational resource, rather than a procedural unit, which affects all over the system – thus it can act as a communication medium. • Zoom Lens model of attention (Eriksen & Yeh, 1985): attention is a kind of multi-resolution focus whose magnification is inversely proportional to the level of detail • Low resolution  used for large region, encompassing more objects, fewer details; perceiving groups of entities as a coherent whole • High resolution  used for small region, fewer objects, more details; perceiving individual entities There is an attentional map that covers over the regions of the visual input via SemRep  this affects the operation of the vision system and VLWM Attention requesting more details on a certain part of SemRep by biasing attention distribution Vision System VLWM Language System More attention is required to retrieve more concrete details  it corresponds to the PSS (Barsalou, 1999) account There is a forgetting effect which discards SemRep components or constructions  a certain amount of computational resource for a certain time duration required total computational resource required = attention at each time unit X time duration

  33. Constructions are distributed, being centered around the perisylvian regions which include classical Broca’s (BA 45/44), Wernicke’s (BA 22) and the left superior temporal lobe areas. a ‘functional web’ for linking phonological information related to the articulatory and acoustic pattern of a word form is developed (Pullvermuler, 2001) a phonological rehearsal device as well as a working memory circuit for complex syntactic verbal processes (Aboitiz & Garcia, 1997) Kaan & Swaab 2002; Just et al. 1996; Keller et al. 2001; Stowe et al. 2002

  34. Concrete words, or the lexicon constructions are topographically distributed across brain areas associated with the process of corresponding categorical properties. concepts of tools or actions correlated to the motor and parietal areas  action and tool use concepts of animals mostly associated with the temporal  visual properties It is also supported by the category-specific deficit accounts (Warrington & McCarthy 1987; Warrington & Shallice 1984; Caramazza & Shelton 1998; Tyler & Moss 2001).

  35. The higher-level constructions that encode grammatical information might be distributed across the areas stretched from the inferior to medial part of the left temporal cortex. the left-inferior-temporal and fusiform gyri are activated during processing of pragmatic, semantic and syntactic linguistic information (Kuperberg et al., 2000) The perirhinal cortex is involved in object identification and its representation formation by integrating multi-modal attributes (Murray & Richmond 2001)

  36. The syntacti manipulation of constructions (competition and cooperation) mainly happens in Broca’s area, which is believed to be activated when handling complex symbolic structure. Broca’s area is activated more when handling sentences of complex syntactic structure than of simple structure (Stromwold et al., 1996) for handling the arrangement of lexical items for retrieving semantic or linguistic components during the matching and ordering process BA 45 is activated by both speech and signing during the production of language narratives done by bilingual subjects whereas BA 44 is activated by the generation of complex articulatory movements of oral/laryngeal or limb musculature (Horwitz et al., 2003)

  37. VLWM is built around the left dorsolateral prefrontal cortex (DLPFC; BA 9/46) stretched to the left posterior parietal cortex (BA40). the executive component for processing of the contents of working memory (Smith et al., 1998) the storage buffer for verbal working memory circuit (Smith et al., 1998) The monkey prefrontal cortex is found to involve in sustaining memory for object identity and location (Rainer et al., 1998)

  38. References • Aboitiz, F; Garcia, R (1997) The evolutionary origin of the language areas in the human brain. A neuroanatomical perspective, BRAIN RESEARCH REVIEWS 25 (3):381-396 • Arbib, MA; Lee, J (2008) Describing visual scenes: towards a neurolinguistics based on Construction Grammar, BRAIN RESEARCH 1225:146-162 • Barsalou, LW (1999) Perceptual symbol systems, BEHAVIORAL AND BRAIN SCIENCES 22 (4):577-+ • Price, C; Thierry, G; Griffiths, T (2005) Speech-specific auditory processing: where is it?, TRENDS IN COGNITIVE SCIENCES 9 (6):271-276 • Barsalou, LW; Simmons, WK; Barbey, AK; Wilson, CD (2003) Grounding conceptual knowledge in modality-specific systems, TRENDS IN COGNITIVE SCIENCES 7 (2):84-91 • Bergen, BK; Wheeler, KB (2005) Sentence understanding engages motor processes, PROCEEDINGS OF THE TWENTY-SEVENTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY • Bock, JK; Irwin, DE; Davidson, DJ (2005) Putting first things first, The Interface of Language, Vision, and Action (Henderson, JM; Ferreira F ed), Psychology Press:249-278 (Chapter 8) • Caramazza, A; Shelton, JR (1998) Domain-specific knowledge systems in the brain: The animate-inanimate distinction, JOURNAL OF COGNITIVE NEUROSCIENCE 10 (1):1-34

  39. References • Damasio, AR (1989) Time-locked multiregional retroactivation - A systems-level proposal for the neural substrates of recall and recognition, COGNITION 33 (1-2):25-62 • Eriksen, CW; Yeh, YY (1985) Allocation of attention in the visual-field, JOURNAL OF EXPERIMENTAL PSYCHOLOGY-HUMAN PERCEPTION AND PERFORMANCE 11 (5):583-597 • Goldberg, AE (2003) Constructions: a new theoretical approach to language, TRENDS IN COGNITIVE SCIENCES 7 (5):219-224 • Griffin, ZM (2001) Gaze durations during speech reflect word selection and phonological encoding, COGNITION 82 (1):B1-B14 • Griffin, ZM (2005) Why look? Reasons for eye movements related to language production, The Interface of Language, Vision, and Action (Henderson, JM; Ferreira F ed), Psychology Press:213-248 (Chapter 7) • Horowitz, TS; Wolfe, JM (1998) Visual search has no memory, NATURE 394 (6):575-577 • Itti, L; Arbib, MA (2006) Attention and the minimal subscene, Action to Language: via the Mirror Neuron System (Arbib, MA ed), Cambridge University Press:289-346 (Chapter 9) • Just, MA; Carpenter, PA; Keller, TA; Eddy, WF; Thulborn, KR (1996) Brain activation modulated by sentence comprehension, SCIENCE 274 (5284):114-116

  40. References • Kaan, E; Swaab, TY (2002) The brain circuitry of syntactic comprehension, TRENDS IN COGNITIVE SCIENCES 6 (8):350-356 • Keller, TA; Carpenter, PA; Just, MA (2001) The neural bases of sentence comprehension: a fMRI examination of syntactic and lexical processing, CEREBRAL CORTEX 11 (3):223-237 • Kuperberg, GR; McGuire, PK; Bullmore, ET; Brammer, MJ; Rabe-Hesketh, S; Wright, IC; Lythgoe, DJ; Williams, SCR; David, AS (2000) Common and distinct neural substrates for pragmatic, semantic, and syntactic processing of spoken sentences: an fMRI study, JOURNAL OF COGNITIVE NEUROSCIENCE 12 (2):321-341 • Laeng, B; Teodorescu, D (2002) Eye scanpaths during visual imagery reenact those of perception of the same visual scene, COGNITIVE SCIENCE 26 (2):207-231 • Meyer, AS (2005) The use of eye tracking in studies of sentence generation, The Interface of Language, Vision, and Action (Henderson, JM; Ferreira F ed), Psychology Press:191-211 (Chapter 6) • Murray, EA; Richmond, BJ (2001) Role of perirhinal cortex in object perception, memory, and associations, CURRENT OPINION IN NEUROBIOLOGY 11 (2):188-193 • Pulvermuller, F (2001) Brain reflections of words and their meaning, TRENDS IN COGNITIVE SCIENCES 5 (12):517-5242001

  41. References • Rainer, G; Asaad, WF; Miller, EK (1998) Memory fields of neurons in the primate prefrontal cortex, PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA 95 (25):15008-15013 • Richardson, DC; Spivey, MJ (2000) Representation, space and Hollywood Squares: looking at things that aren’t there anymore, COGNITION 76 (3):269-295 • Simmons, WK; Barsalou, LW (2003) The similarity-in-topography principle: Reconciling theories of conceptual deficits, COGNITIVE NEUROPSYCHOLOGY 20 (3-6):451-486 • Smith, EE; Jonides, J; Marshuetz, C; Koeppe, RA (1998) Components of verbal working memory: Evidence from neuroimaging, PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA 95 (3):876-882 • Stowe, L; Withaar, R; Wijers, A; Broere, C; Paans, A (2002) Encoding and storage in working memory during sentence comprehension. Sentence Processing and the Lexicon: Formal, Computational and Experimental Perspectives (Merlo, P; Stevenson, S eds), John Benjamins Publishing Company:181-206 • Stromswold, K; Caplan, D; Alpert, N; Rauch, S (1996) Localization of syntactic comprehension by positron emission tomography, BRAIN AND LANGUAGE 52 (3):452-473 • Tyler, LK; Moss, HE (2001) Towards a distributed account of conceptual knowledge, TRENDS IN COGNITIVE SCIENCES (6):244-252

  42. References • van der Meulen, FF; Meyer, AS; Levelt, WJ (2001) Eye movements during the production of nouns and pronouns, MEMORY AND COGNITION 29 (3):512-521 • van der Meulen, FF (2003) Coordination of eye gaze and speech in sentence production, Mediating • Warrington, EK; Shallice, T (1984) Category specific semantic impairments, BRAIN 107 (SEP):829-854 • Warrington, EK; McCarthy, RA (1987) Categories of knowledge: Further fractionations and an attempted integration, BRAIN 110 (5):1273-1296

More Related