1 / 44

Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

Getting to the Heart of the Matter ; (or, “ Speech is more than just the expression of text or language ” ). Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp. LREC 2004. Overview ….

lauren
Download Presentation

Nick Campbell ATR Network Informatics Labs Keihanna Science City, Kyoto, Japan nick@atr.jp

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Getting to the Heart of the Matter;(or, “Speech is more than just the expression of text or language”) Nick Campbell ATR Network Informatics LabsKeihanna Science City, Kyoto, Japannick@atr.jp LREC 2004

  2. Overview… • This talk addresses the current needs for so-called ‘emotion’ in speech, but points out that the issue is better described as the expression of ‘relationships’ and ‘attitudes’ rather than by the currently held raw (or big-six) emotional states.

  3. Comment • For decades now, we have been producing and improving methods for the input and output of speech signals by computer, but the market seems very slow to take up these technologies. • In spite of the early promises for human-computer voice-based interactions, the man or woman in the street has yet to make much use of this technology in their daily lives. • There are already applications where computers mediate in human spoken communications, but in only a few limited domains. • Our technology appears to have fallen short of its earlier promises!

  4. The latest buzz-word in speech technology research : ‘emotion’ • Why is it that the latest promises make so much of the word ‘emotion’? • Perhaps because the current technology is based so much upon written text as the core for its processing? • Speech recognition is evaluated by the extent to which it can ‘accurately’ transliterate a spoken utterance; and speech synthesis is driven, in the majority of case, just from the input text alone.

  5. Real interactive speech(cf read-speech) • “spontaneous speech is ill-formed and often includes redundant information such as disfluencies, fillers, repetitions, repairs, and word fragments” S. Furui 2003(and many others) • But we don’t just talk text! • natural speech is interactive, so we show relationships as much as we give information … • And we don’t just talk sentences … – grunts are common!

  6. Example Dialogue(a person talking to a robot) • The human speaks • The robot employs speech recognition • (and presumably some form of processing) then replies using speech synthesis • (which the human supposedly understands) • The interaction is ‘successful’ if the robot responds in an intended manner

  7. Example dialogue 1 • Excuse me • Yes, can I help you? • Errm, I’d like you to come here and take a look at this … • Certainly, wait a minute please. • Can you see the problem? • No, I’m afraid I don’t understand what you are trying to show me. • But look at this, here … • Oh yes, I see what you mean!

  8. Example dialogue 2 • Oi! • Uh? • Here ! • Oh … • Okay? • Eh? • Look! • Ah!

  9. Which do we want? • As engineers: • The former – we can do it now • As humans: • The latter – it’s what we are used to • And the robots? • They should behave in the least obtrusive way – naturally!

  10. How should we talk with robots? • First, let’s take a look at how we talk with each other … not using actors – but real people • in everyday conversational situations … • Labov : the Observer’s Paradox • interactions lose their naturalness when an observer intrudes!

  11. Overcoming the Observer’s Paradox analysis of a very large corpus of spoken interaction • The JST/Crest ESP project

  12. 科学技術振興事業団報 第131号 : 「高度メディア社会の生活情報技術」 「表現豊かな発話音声のコンピュータ処理システム」 JST/CREST ESP Project表現豊かな発話様式 Nick Campbell ATR人間情報科学研究所 研究代表者

  13. Project Goals • Speech technology • Speech synthesis with 'feeling' • Speaking-style feature analysis/detection • Corpus of spontaneous speech • 1000 hours of natural speech • Scientific contribution • Paralinguistics & communication

  14. Progress to date • More than 1000 hours recorded • 500 hours speech collected • 250 hours transcribed • 75 hours labelled • 25 voices • Interfaces & specs are evolving • We foresee some very new unit-selection techniques being developed

  15. The ‘Pirelli-calendar’ approach in 1970 a team of photographers took 1000 rolls of 36-exposure film on location to an island in the Pacific in order to produce a calendar of twelve (glamour) images. -> similarly, if we record an ‘almost infinite’ corpus of speech, and develop techniques to extract the interesting portions, then we will produce data which is both representative and sufficient for studying the full range of speaking-styles used in ordinary human communication.

  16. long-term recordings:daily interactive speech • MD & small head-mounted lavalier mic • conversations with parents/husband/ friends/colleagues/clinic/others • Japanese native-language speakers both sexes, mixed ages, mixed scenarios • Recording over a continuing period, • speaking-style correlates of changes in familiarity/interlocutor to be studied.

  17. 問題 解決 提案

  18. Transcription 匿名

  19. Labelling emotion Free input

  20. Discourse Act Labelling a あいさつ greeting b 会話終了 closing c 自己紹介 introduce-self d 話題紹介 introduce-topic e 情報提供 give-information f 意見、希望 give-opinion g 応答肯定 affirm h 応答否定 negate i 受け入れ accept j 拒絶 reject k 了解、理解、納得 acknowledge l 割り込み, 相づち interject m 感謝 thank n 謝罪 apologize o 反論 argue p 提案、申し出 suggest, offer q 気づき notice s つなぎ connector r 依頼、命令 request-action t 文句 complain u 褒める flatter w 独り言 talking-to-self x 言い詰まり disfluency y 演技 acting z 繰り返し repeat r* 要求 request (a~z) v* 確認を与える verify (a~z) *? よく分からない場合 (see LREC 2004, 09-SE Wednesday 4pm)

  21. Acoustic Analysis / Visualisation tool Quasi-syllable boundaries Boundaries of quasi-syllabic Nuclei Phonetic labels (if available) Sonorant Energy contour F0 contour (a) Variance in delta-Cepstrum (b) Formant / FFT Cepstral distance Composite (a&b) measure of reliability Glottal AQ pressedbreathy Estimated vocal-tract area-functions

  22. Voice Quality & Affect • 13,604 conversational utterances • 1 female Japanese speaker (age 32) • listener/speech-act/emotion labels • Interlocutor: ChildFamily FriendsOthersSelf 13936239044632116

  23. Listenerrelations Talking to: • child • family • friends • others • self

  24. NAQ & F0 by family m1 - mother m2 - father m3 - baby girl m4 - husband m5 - big sister m6 - nephew m8 - aunt

  25. Meaningful speech is a uniquely human characteristic, but … • Apes use gestural communication, but not for communicating propositional content. • Birds and seals can mimic human sounds, but their tunes don't contain semantic meaning. • Bees can communicate precise geographical locations with their dances …. • African wild dogs show a high degree of social organisation, and they use body postures and the prosody of their barks to guide the hunt and keep the pack together.

  26. Human language development • It is likely that early humans used their voices in similar ways to the hunting dogs, and that the use of voice to complement or replace face-to-face communication (and touch) for social interaction and reassurance pre-dated propositional communication. • In this case, prosody would have been a precursor to meaningful speech, which developed later.

  27. Language as Distal Communication • The ‘park or ride’ hypothesis (Ross, 2001) • the development of language in humans. “Human mothers would have had to put down their helpless but heavy babies (who had difficulty in clinging on by themselves) in order to forage for food, but they maintained contact with each other through voice, ortone-of-voice”(my italics) • This distal communication would have reassured both mother and child that all was well, even though they might actually be out of direct sight of each other.

  28. Non-linguistic speech “it is all too tempting to think of language as consisting of a set (infinite, of course) of independent meaning-form pairs. This way of thinking has become habitual in modern linguistics”(Hurford 1999) • But part of being human, and of taking one's place in a social network, also involves making inferences about the feelings of others and having an empathy for those feelings. (me)

  29. “Motherese” “If the origins of human language, or distal communication, can be traced back to the music of motherese, or infant-directed prosody, then it is easy to speculate that the sounds of the human voice replaced the vision of the face (and body) for the identification of social and security-related information.” (Falk, 2003, my italics).

  30. Prosody and Cognitive Neurology “Just as stereoscopic vision yields more than the simple sum of input from the two eyes alone, so binaural listening probably gives us more than just the sum of the text and its linguistic prosody alone”(Auchlin 2003) • Language may be predominantly processed in the left brain, but much of its prosody is processed in the right.

  31. Right-brain prosody • Several studies have confirmed that understanding of propositional content activates the prefrontal cortex bilaterally, more on the left than on the right, and that, in contrast, responding to emotional prosody activates the right prefrontal cortex more. • “the right frontal lobe is perhaps particularly critical, maybe because of its central role in the neural network, for social cognition, including inferences about feelings of others and empathy for those feelings.”(Pandya et al, 1996) • See also Monrad-Krohn (1945~), etc …

  32. Binaural Speech Processing(an extreme view!) • Information coming into the right ear and the left ear is processed separately in the brain before being perceived as speech. • Speculation: • if the left brain (right ear) is more tuned for linguistic processing, and the right brain (left ear) more tuned for affective processing, then it is likely that the separate activation of the two hemispheres gives an extra-linguistic ‘depth’ to an utterance. (but cf telephones!)

  33. A two-tiered view of speech communication two types of utterance : • I-type express linguistic information • A-type express affect • The former can be well described by their text alone;but the latter also need prosodic info. • any utterance may contain both I-type and A-type information, but is primarily of one type or the other. • The expressivity of an utterance is realised through a socially-motivated framework that determines its physical characteristics.

  34. a framework for utterance specificationSelf + Other + Event • An utterance is realised as an event(=E*) taking place within the framework of mood and interest(=S) and friend and friendly(=O) constraints • mood & interest, friend & friendly : • If motivation or interest in the content of the utterance is high, then the speech is typically more expressive. If the speaker is in a good mood then more so … • If the listener (other) is a friend, then the speech is more relaxed, and in a friendly situation, then even more so * I-type or A-type

  35. Realising Conversational Speech Utterances

  36. Discussion • Our analysis of a very large corpus of natural spontaneous conversational speech indicates that both Information & Affect may be realised in parallel in speech, for both social and historical reasons • Speech synthesis (and recognition) should soon start to take these two different types of communication into consideration - i.e., not emotion, but function & interaction

  37. Conclusion • Speech conveys multiple tiers of information, not all of which are considered in present linguistic or speech technology research. • Prosody has an important and extra-linguistic communicative function which can be explained by language evolution and cognitive neurology • If speech technology is to consider ‘language-in-action’ as well as ‘language-as-system’ then those levels of information which cannot be accurately portrayed by a transcription of the speech alone, must be taken into consideration.

  38. Thank you

  39. speech language expression noise

  40. Monrad-Krohn • uses of speech prosody categorised into four main groups: • i) Intrinsic prosody, for the intonation contours which distinguish e.g., a declarative from an interrogative sentence, • ii) Intellectual prosody, for the intonation which gives a sentence its particular situated meaning by placing emphasis on certain words rather than others, • iii) Emotional prosody, for expressing anger, joy, and the other emotions, and • iv) Inarticulate prosody, which consists of grunts or sighs and conveys approval or hesitation

  41. Estimated glottal-flow waveforms… 1 – Breathy phonation ~ “effective decay time” (Fant et al., 1994) 2 – Pressed phonation Definition of the Glottal AQ (Amplitude Quotient) -- figures taken from Alku et al. (JASA, August 2002) -- AQ = fac / dpeak = T2 Stylised, triangular glottal-flow waveform glottal-flow waveform glottal-flow derivative With thanks to Parham Mokhtari

  42. Normalised Amplitude Quotient (NAQ) -- Alku et al. (2002) -- NAQ = AQ / T0 = AQ . F0 • Normalises the approximately inverse-relationship with F0 • Yields a parameter more closely associated with phonation quality • NAQ is closely related to the glottal Closing Quotient (CQ), but is more reliably measured than CQ (Alku et al., 2002) ! Speaker FIA Speaker FAN

More Related