1 / 36

Centrum - PowerPoint PPT Presentation

  • Uploaded on

T. T. Centrum. för talteknologi. Multi-modal expression of Swedish prominence Björn Granström Centre for Speech Technology, Department of Speech, Music and Hearing, KTH, Stockholm, Sweden. Historical background. Prosody for speech synthesis at KTH, together with Rolf Carlson

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Centrum' - jennis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript




för talteknologi

Multi-modal expression of Swedish prominenceBjörn Granström Centre for Speech Technology, Department of Speech, Music and Hearing, KTH, Stockholm, Sweden

Historical background
Historical background

  • Prosody for speech synthesis at KTH, together with Rolf Carlson

  • The Lund intonation model – Gösta Bruce et al.

Several joint projects
Several joint projects

Profs – Prosodic phrasing in Swedish ~1989-1992

Gösta Bruce, Björn Granström and more

First reference: G. Bruce and B. Granström. Modelling Swedish intonation in a text-to-speech system. STL-QPSR, 30(1):17-21, 1989. (on the KTH web)

Several joint projects cont
Several joint projects, cont. location

Prosodiag - Prosodic Segmentation and Structuring of Dialogue (HSFR + NUTEK) 1993 –1996

Gösta Bruce, Björn Granström, Kjell Gustafson, David House, Paul Touati

Project Description

The object of study is the prosody of dialogue in a language technology framework. The primary goal of the project is to increase our understanding of how prosodic aspects of speech are exploited interactively in dialogue and on the basis of this increased knowledge to be able to create a more powerful prosody model.

Late reference: Gösta Bruce, Johan Frid, Björn Granström, Kjell Gustafson, Merle Home, and David House. Prosodic segmentation and structuring of dialogue. TMH-QPSR, 37(3):1-6, 1996.

More than 20 joint publications – and then?

Is prosody more than sound
Is prosody more than sound? next:

  • Our bias: communication is multi-modal

  • Traditionally prosodic functions are signaled by “gestures”, perceived by “eye and ear”

  • This concerns both body and face gestures

  • Preliminary hypothesis: F0~eyebrow height - e.g. Cavé et al. (1996)

  • Easy to put to a test with multimodal speech synthesis

Eyebrow vs intonation
Eyebrow vs intonation next:

1 No eyebrow motion

2 Eyebrow motion

controlled by the

fundamental frequency

of the voice

3 Eyebrow motion at

focal accents +

4 Eyebrow motion at

the first focal accent +

“Jag heter Axel, inte Axell” (translation: “My name is Axel, not Axell”). In Sweden Axel is a first name as opposed to Axell, which is a family name.

Goals and research context
Goals and research context next:

  • How are visual expressions used to convey and strengthen prosodic functions?

  • Understand interactions between visual expressions, dialog functions and speech acoustics

  • Context: animated talking agent

    • Realistic communicative behavior using multimodal speech synthesis

Visual prosodic functions

Prominence next:




Utterance type



Dialogue functions

back channeling




Visual prosodic functions

Visual prosody cont
Visual prosody cont. next:

  • What is underlying?

  • How tight is the AV connection?

  • What are the important visual gestures?

  • More optional than acoustic prosodic parameters?

  • Individual and cultural variation

  • Reinforcing or qualifying acoustics?


Formal experiment next:Prominence due to eyebrow rise5 content words: ”När pappafiskar stör piper Putte”When dad is fishing sturgeon, Putte is whimpering

Example of stimuli next:Task: “which word is most prominent” (identical acoustics – varied location of eyebrow movement)

Eyebrow movement

No eyebrow movement (neutral)

Feedback experiment
Feedback experiment next:

  • Mini dialogues (two turns)

  • Travel agent application

  • Both visual and acoustic feedback cues

  • Affirmative cues – agent understands/accepts the request

  • Negative cues – agent is unsure about the request (seeks confirmation)

  • Six cues hypothesised

    Granström, House & Swerts (2002)

Pos neg feedback experiment
Pos/Neg feedback experiment next:

(Granström, House & Swerts 2002)

Recording of communicative interactions
Recording of communicative interactions next:

Automatic tracking of reflective spots in 3D (Qualisys)

Interactions emotion and articulation resynthesis from av speech database eu pf star project
Interactions: emotion and articulation (resynthesis) next:(from AV speech database – EU/PF_STAR project)

Measurement points for lip coarticulation analysis
Measurement points for lip coarticulation analysis next:

Vertical distance

left mouth corner

Lateral distance

The expressive mouth
The expressive mouth next:

”left mouth corner”

  • All vowels


    • Encouraging

    • Happy

    • Angry

    • Sad

    • Neutral

(Svanfeldt et al. 2003)

Prompted read speech database
Prompted read speech database next:

  • Expressive modes:

    • Confirming, questioning, certain, uncertain, happy, (angry)

  • 39 short, content neutral sentences with three possible focal accent positions each, e.g.

    • Båten seglade förbi (The boat sailed by)

    • Dom flyttade möblerna (They moved the furniture)

  • Nonsense words (VCV, VCCV, CVC)

  • Digits

  • Centrum

    Nose marker traces with automatic (blue) and two human (red) next:

    annotated head nods (adapted from Cerrato & Svanfeldt 2006)

    Examples from the database
    Examples from the database next:

    Focal accent on: Båten seglade förbi

    Confirming Happy

    Exploitation of visual parameters
    Exploitation of visual parameters next:

    • Visual cues exploited at focal accent

    • Mouth cues

      • Happy, encouraging

    • Eyebrow cues

      • Happy, questioning

    • Vertical head nods

      • Confirming

    Analysis in terms of fap and fmq
    Analysis in terms of FAP and FMQ next:

    MPEG-4 Facial Animation Parameter (FAP) A subset of 31 FAPs out of the 68 FAPs defined in the MPEG-4 standard, including only the ones that we were able to calculate directly from our measured point data

    Focal Motion Quotient, FMQ, defined as the standard deviation of a FAP parameter taken over a word in focal position, divided by the average standard deviation of the same FAP in the same word in non-focal position.


    The focal motion quotient, FMQ, averaged across all sentences, for all measured MPEG-4 FAPs for several expressive modes

    articulation I smile I brows I head

    The effect of focus on the variation of several groups of MPG-4 /FAP parameters, for different expressive modes

    FMQ (Focal Motion Quotient)

    The effect of focal accent on selected parameter variations in certain and uncertain readings
    The effect of focal accent on selected parameter variations in Certain and Uncertain readings

    FMQ (Focal Motion Quotient)

    What s next
    What´s next? in Certain and Uncertain readings

    • Better recordings

    • Detailed analysis of the eye region: ”Gaze and wrinkles”

    • Use in applications, e.g. spoken dialogue systems

    • And more audible prosody…….

    New cooperative project
    New cooperative project in Certain and Uncertain readings

    SIMULEKT - Simulering av svenskans prosodiska dialekttyper (Simulating intonational varieties of Swedish)

    VR 2007-2009

    And finally………..

    Congratulations well done g sta
    Congratulations! in Certain and Uncertain readingsWell done Gösta!