1 / 19

The role of pitch range and facial gestures in conveying prosodic meaning

Joan Borràs -Comes. The role of pitch range and facial gestures in conveying prosodic meaning. Ph.D. Project Supervisor: Dr. Pilar Prieto. Introduction ( 1/3). Vision has a strong influence upon speech perception in normal verbal communication

dee
Download Presentation

The role of pitch range and facial gestures in conveying prosodic meaning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Joan Borràs-Comes The role of pitch range and facial gestures in conveying prosodic meaning Ph.D. Project Supervisor: Dr. PilarPrieto

  2. Introduction (1/3) • Vision has a strong influence upon speech perception in normal verbal communication • Gesture and speech form a fully integrated system • Gestures are framed by speech (Goldin-Meadow 2005) • Only by looking at bothwe can predict how people learn, remember, and solve problems • Gestures and speech are co-expressive but not redundant • Gesture allows speakers to convey thoughts that may not easily fit into the categorical system that their conventional language offers (McNeill 1992/2005) • Vision has a clear role for the perception of various aspects typically associated with verbal prosody • Audiovisual cues for prosodic functions such as focus (Dohen & Lœvenbruck 2009) and question intonation (Srinivasan & Massaro 2003) have been successfully investigated

  3. Introduction(2/3) • Visual cues (eyebrow flashes, head nods, beat gestures) boost the perceived prominence of the words they occur with (Cavé et al. 1996, Erickson et al. 1998, Hadar et al. 1983, Krahmer & Swerts 2004/2007, Swerts & Krahmer 2008) • Audiovisual cues for (traditional prosodic) functions as: • phrasing (Barkhuysen et al. 2008) • face-to-face grounding (Nakano et al. 2003) • question intonation (Srinivasan& Massaro2003) • …have been explored as have the audiovisual expressions of affective functions such as: • signaling basic emotions (Barkhuysen et al. 2009, de Gelderet al. 1999) • uncertainty (Krahmer & Swerts 2005, Swerts& Krahmer 2005) • frustration (Barkhuysen, Krahmer & Swerts, 2005) • In sign languages prosody, there is no “audio” component • the “articulatory effort” has shifted to the hands • Does visual prosody (eyebrow movements, eye blinks…) works in similar ways across the signed and non-signed languages? • different visual signs have specific prosodic functions • the combination of these visual signs gives rise to a subtle yet meaningful layer on top of the signing (Dachkovsky & Sandler 2009, for Israeli SL; Wilbur 2009, for American SL) Dachkovsky & Sandler 2009

  4. Introduction(3/3) • Most of the work has described a correlated mode of processing: • Vision partially duplicates acoustic information • Vision provides a powerful assist in decoding speech, e.g., in noisy environments • Many studies have found a weak visual effect relative to a robustly strong auditory effect • Dohen (2009): production & perception of the contrastive informational focus • Suprasegmental perception of speech is multimodal • Production reveal visible correlates to contrastive focus (Dohenet al. 2006) • Prosodic contrastive focus is detectable from the visual modality alone (Krahmer & Swerts 2006) • Perception cues partly correspond to those used in production (Dohen& Loevenbruck2005) • Srinivasan& Massaro (2003): discrimination between statements and questions • Much larger influence of the auditory cues than visual cues • Visual cues do not strongly signal interrogative intonation (House 2002) • In whispered speech (no F0) • auditory only perception is degraded • adding vision clearly improves prosodic focus detection • RTs: adding vision reduces processing time (Dohen & Loevenbruck 2009)

  5. Goals • Vision provides a powerful assist: • in noisy environments • in whispered speech • So… what happens when acoustic information is ambiguous? • In Catalan intonation: • Nuclear pattern L+H* L% can be used to express: statement, contrastive focus and echo question • Production analyses show that that these three sentence-types : • differ in their pitch accent height • may be distributed in three well-differentiated areas of the pitch range • L+¡H* L% can be used to express both contrastive foci and echo questions • Main hypothesis: • For more ambiguous or underspecified parts of the speech stream, a complementary mode of processing can be possible, whereby vision provides information more efficiently than hearing

  6. Structure of thethesis • How’s the participants’ identification of 3 pragmatic meanings across an auditory continuum? • Does the categorical contrast between 2 intonational contours elicit a specific MMN? • What’s the contribution of visual and acoustic cues in perceiving an acoustically ambiguous intonational contrast? • Which gestural elements guide speakers’ interpretations? • Future projects

  7. Study 1. Introduction • In Catalan, the same nuclear configuration L+H*L% is used to express 3 sentence-types: statements, contrastive foci, and echo questions • The peak height indicates sentence type. Our initial hypothesis was that these three sentence-types may be distributed in three well-differentiated areas of the pitch range • Comla vols, la cullera? — Petita[, sisplau]. • Voliesuna cullera gran, no? — Petita[, la vull, i no gran]. • Jo la vullpetita, la cullera. — Petita?[, n’estàssegur?]

  8. Study 1. Methodology • 20 native speakers • 2 semantically motivated identification tasks • Congruency test: participants’ acceptance of each stimulus ocurrying within each of the three communicative contexts • Identification test: participants had to identify each of the three meanings for each isolated stimulus • Stimuli: continuum of 11 steps by created by modifying the F0 height (distance between each one = 1.2 semitones) of the noun phrase petita

  9. Study 1. Results • Congruency test • One-way ANOVA • Effect of linguistic context on sentence interpretation (F(3582, 2) = 16.579, p < .001). • Tukey HSD post-hoc tests • Statement/Question (p < .001) • Correction/Question (p < .001) • Statement/Correction (p = .549) • Identification test • Statistical mode • S (Mo = 1), C (Mo = 4), Q (Mo = 10) • Wilcoxon signed-ranks test • Stat/Ques(T = 194964, p = .019, r = .001) • Corr/Ques(T = 178451.5, p = .001, r = .001) • Stat/Corr(T = 162770, p = .765, r < .001) 116 82 64

  10. Study 2. Introduction • Mismatch negativity (MMN) is sensible to categorical contrasts in segmental and supra-segmental phonology • The phonological role of a vowel stimulus has a decisive role in the elicitation of the MMN (Näätänenet al.1997, Kazanina et al. 2006, Shestakova et al. 2002) • A clear MMNmis found comparing the processing of lexical-tonal and intonational (statement vs. question) contrasts, although different activation patterns between groups (Fournier et al., accepted) • Does the categorical contrast between statements and echo questions elicit a specific MMN response? • Collaboration with CarlesEscera (UB) • MMN: electrophysiological response that can be measured by subtracting the averaged response to a set of standard stimuli from the averaged response to rarer deviant stimuli, and taking the amplitude of this difference wave in a given time window

  11. Study 2. Methodology • 24 Central Catalan native speakers participated in an ERP study • Subjects were instructed to watch a silent video movie and to ignore the auditory stimulation • Materials: 4 auditory stimuli (0, 5, 10, 15) • Selected from an behavioural identification task (statement vs. question) • Same physical distance between every pair of stimuli (3 semitones) • Two allophonic differences (0-5 are statements; 10-15 are questions) / One categorical difference (between 5-10) • Each pair of stimuli constitutes an oddball block in our ERP study • The lower pitch stimulus acted as a STD (80%), the higher acted as a DEV (20%). • Hypothesis: stimulus pair 05-10 should trigger the largest MMN amplitude

  12. Study 2. Results • Independent one sample t-tests: MMN at each condition • 1st Contrast (red), t18 = –2.476, p < 05 • 2nd Contrast (black), t18 = –6.119, p<10–5 • 3rd Contrast (green), t18 = –3.467, p < .005 • One-factor ANOVA (3 levels): effect of the acoustic contrast at the N1 time range, increasing in negativity with the pitch of the stimulus • F(2,36) = 3.633, p < .05) • Crucially, the pair of stimuli which implies the phonological-categorical change, elicited a greater MMN amplitude, although marginally significant (F(2,36) = 2.270, p = .118, including Fz, F4, FC1, FC2 and Cz electrodes) • Stronger MMN brain response when contrasting intonational contours that were phonologically contrastive • This suggest that intonational contrasts in the target language are encoded automatically in the auditory cortex. 0-5 5-10 10-15

  13. Study 3. Introduction • In Catalan, L+¡H* L% is partially ambiguous: it can be used to express both contrastive foci and echo questions • GOAL: to investigate the role of visual cues in disambiguating the meaning of two otherwise ambiguous F0 patterns • Our main hypothesis is that visual cues will play a crucial role in disambiguating the meaning of two otherwise ambiguous F0 patterns STATEMENT CONTRASTIVE FOCUS ECHO QUESTION

  14. Study 3. Methodology • Semantically motivated identification task dealing with the contrast between contrastive foci and echo questions • 20 native subjects • Audiovisual stimuli • Auditory stimuli: the same as in Study 1 (only 6 steps) • Visual stimuli: male native speaker videotaped pronouncing the 2 possible interpretations of the intonational contour • From those 2 video files, 3 static images were extracted • initial neutral gesture • the one simultaneous to the H intonational peak • one more representing the final state of the utterance • These three static images were associated in time with each syllable of the auditory stimuli

  15. (Visual materials) Focussequence Questionsequence

  16. Study 3. Results (1/2) • Identification rate • One-way ANOVA • Clear preference for visual cues in the listener’s main decisions(F (1175, 1) = 77.000, p < .001) • Also an effect of auditory stimulus(F (1175, 5) = 77.000, p < .001) • Reaction times • One-way ANOVA • Interaction between the auditory-visual information (F (1175, 5) = 1.716, p = .005) • When a question-based visual stimuly occurred with a low-pitched auditory stimulus, even if the identification response was “Question”, there appears an important time delay in the response. • This is also the case when focus-based visual stimuli occurred with high-pitch auditory stimuli. echo movie 50% focusmovie echo movie focusmovie

  17. Study 4 • Fine-grained analysis of the gestural cues involved in the perception of meaning • By means of 3D modelled stimuli • In order to find out which gestural elements are responsible for guiding the speakers’ interpretations • Collaboration with Josep Blat & Núria Sebastian-Gallés (Dept. Tecnologia, UPF) • Wierzbicka(2000) • Certain components of facial behaviour have constant context-independent meanings • Facial expressions can convey meanings comparable to the meanings of verbal utterances • Auditory stimuli: continuum of the name Marina • Visual stimuli: • Several computer-generated 3D avatars • in which will manipulate separately each facial gesture implied in the change of the pragmatic meaning • Each gesture element will appear in 3 degrees from its typical configuration

  18. Futureprojects • Study 5 • Additional experiment using the gating paradigm (Grosjean 1996) • Stimuli: set of gated utterances (Stat., Focus, Question), 3 modalities (A, V, AV) • Recognition point • Found first in AV condition (between 1st and 4th gates), closely followed by V condition • Responses to A condition were late (after 9th gate) • (Audio)Visual • Echo questions recognized immediately (since the 1st gate) • no differences depending on the presence of simultaneous auditory input • Statements and contrastive foci are discriminated later (after the 5thgate) • when participants perceive the gestural configuration as eminently marked and therefore belonging to a focused type • Study 6 • We have explored the audiovisual influence in pitch accents. • What about boundary tones? • L+H* LM% (obviousnessstatement) • L+H* LH% (anti-expectationalquestion)

  19. Thank you! Joan Borràs-Comes The role of pitch range and facial gestures in conveying prosodic meaning Ph.D. Project Supervisor: Dr. PilarPrieto

More Related