Cues to emotion language
1 / 19

Cues to Emotion: Language - PowerPoint PPT Presentation

  • Updated On :

Cues to Emotion: Language. Suzanne Yuen Monday Oct 5, 2009 COMS 6998 . Overview. Two-Stream Emotion Recognition for Call Center Monitoring Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis. Two Stream Emotion Recognition for Call Center Monitoring.

Related searches for Cues to Emotion: Language

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Cues to Emotion: Language' - abram

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cues to emotion language l.jpg

Cues to Emotion: Language

Suzanne Yuen

Monday Oct 5, 2009

COMS 6998

Overview l.jpg

  • Two-Stream Emotion Recognition for Call Center Monitoring

  • Voice Quality and f0 Cues for Affect Expression: Implications for Synthesis

Two stream emotion recognition for call center monitoring l.jpg
Two Stream Emotion Recognition for Call Center Monitoring

  • Background: To aid supervisors in the evaluation of agents at call centers*

  • Objective: To present a two stream processing technique to detect strong emotion

  • Previous Work:

    • Fernandez categorized affect into four main components: intonation, loudness, rhythm, and voice quality

    • Yang studied feature selection methods in text categorization and suggested that information gain should be used

    • Petrushin and Yacoub examined agitation and calm states in people-machine interaction

*Typical medium-sized call-center receives about 100,000 calls per day

Two stream recognition l.jpg
Two-Stream Recognition

  • Semantic Stream

  • Performed speech-to-text conversion

  • Text classification algorithms identified phrases such as “pleasure,” “thanks,” “useless,” & “disgusting.”

  • Acoustic Stream

  • Extracted features based on pitch and energy

  • Trained on 900 calls, ~60hrs of speech

  • Vocabulary system of more than 10 000 words

  • TF-IDF scheme = Term Frequency – Inverse Document Frequency

Implementation l.jpg

  • Method:

    • Two streams analyzed separately:

      • speech utterance/acoustic features

      • spoken text/semantics/speech recognition of conversation

    • Confidence levels of two streams combined

    • Examined 3 emotions

      • Neutral

      • Hot-anger

      • Happy

  • Tested two data sets:

    • LDC data

    • 20 real-world call-center calls

Two stream conclusion l.jpg
Two Stream - Conclusion

  • Table 2 suggested that two-stream analysis is more accurate than acoustic or semantic alone

  • LDC data recognition significantly higher than real-world data

  • Neutral emotions had less accuracy

  • Combination of two-stream processing showed improvement (~20%) in identification of “happy” and “anger” emotions

  • Low acoustic stream accuracy may be attributed to length of sentences in real-world data. Normal people do not exhibit different emotions significantly in long sentences

Discussion l.jpg

  • Gupta analyzed three emotions (happy, neutral, hot-anger): Why break it down into these categories? Implications? Can this technique be applied to a wider range of emotions? For other applications?

  • Speech to text may not translate the complete conversation. Would further examination greatly improve results? What are the pros and cons?

  • Pitch range was from 50-400Hz. Research may not be applicable outside this range. Do you think it necessary to examine other frequencies?

  • In this paper, TF-IDF (Term Frequency – Inverse Document Frequency) technique is used to classify utterances. Accuracy for acoustics only is about 55%. Previous research suggest that alternative techniques may be better. Would implementation better results? What are the pros and cons of using the TF-IDF technique?

Voice quality and f 0 cues for affect expression implications for synthesis l.jpg
Voice Quality and f0 Cues for Affect Expression: Implications for Synthesis

  • Previous work:

    • 1995; Mozziconacci suggested that VQ combined with f0 combined could create affect

    • 2002; Gobl suggested synthesized stimuli with VQ can add affective coloring. Study suggested that “VQ + f0” stimuli is more affective than “f0 only”

    • 2003; Gobl tested VQ with large f0 range. Did not examine contribution of affect-related f0 contours

  • Objective: To examine affects of VQ and f0 on affect expression

Voice quality and f 0 cues for affect expression implications for synthesis9 l.jpg
Voice Quality and f0 Cues for Affect Expression: Implications for Synthesis

  • 3 series of stimuli of Sweden utterance – “jaadjo”:

    • Stimuli exemplifying VQ

    • Stimuli with modal voice quality with different affect-related f0 contours

    • Stimuli combining both

  • Tested parameters exemplifying 5 voice quality (VQ):

    • Modal voice

    • Breathy voice

    • Whispery voice

    • Lax-creaky voice

    • Tense voice

  • 15 synthesized stimuli test samples (see Table 1)

What is voice quality phonation gestures l.jpg
What is Voice Quality? Phonation Gestures

  • Derived from a variety of laryngeal and supralaryngeal features

  • Adductive tension: interarytenoid muscles adduct the arytenoid muscles

  • Medial compression: adductive force on vocal processes- adjustment of ligamental glottis

  • Longitudinal pressure: tension of vocal folds

  • Tense voice l.jpg
    Tense Voice

    • Very strong tension of vocal folds, very high tension in vocal tract

    Whispery voice l.jpg
    Whispery Voice

    • Very low adductive tension

    • Medial compression moderately high

    • Longitudinal tension moderately high

    • Little or no vocal fold vibration

    • Turbulence generated by friction of air in and above larynx

    Creaky voice l.jpg
    Creaky Voice

    • Vocal fold vibration at low frequency, irregular

    • Low tension (only ligamental part of glottis vibrates)

    • The vocal folds strongly adducted

    • Longitudinal tension weak

    • Moderately high medial compression

    Breathy voice l.jpg
    Breathy Voice

    • Tension low

      • Minimal adductive tension

      • Weak medial compression

    • Medium longitudinal vocal fold tension

    • Vocal folds do not come together completely, leading to frication

    Modal voice l.jpg
    Modal Voice

    • “Neutral” mode

    • Muscular adjustments moderate

    • Vibration of vocal folds periodic, full closing of glottis, no audible friction

    • Frequency of vibration and loudness in low to mid range for conversational speech

    Voice quality and f 0 cues for affect expression implications for synthesis16 l.jpg
    Voice Quality and f0 Cues for Affect Expression: Implications for Synthesis

    • Six sub-tests with 20 native speakers of Hiberno-English.

    • Rated on 12 different affective attributes:

      • Sad – happy

      • Intimate – formal

      • Relaxed – stressed

      • Bored – interested

      • Apologetic – indignant

      • Fearless – scared

    • Participants asked to mark their response on scale



    No affective load

    Voice quality and f 0 test conclusion l.jpg
    Voice Quality and f0 Test: Conclusion

    • Categorized results into 4 groups. No simple one-to-one mapping between quality and affect

    • “Happy” was most difficult to synthesis

    • Suggested that, in addition to f0 ,VQ should be used to synthesis affectively colored speech. VQ appears to be crucial for expressive synthesis

    Voice quality and f 0 test discussion l.jpg
    Voice Quality and f0 Test: Discussion

    • If the scale is on a 1-7, then 3.5 should be “neutral”; however, most ratings are less than 2. Do the conclusions (see Fig 2) seem strong?

    • In terms of VQ and f0, the groupings in Fig 2 seem to suggest that certain affects are closely related. What are the implications of this? For example, are happy and indignant affects closer than relaxed or formal? Do you agree?

    • Do you consider an intimate voice more “breathy” or “whispery?” Does your intuition agree with the paper?

    • Yanushevskaya found that the VQ accounts for the highest affect ratings overall. How to compare range of voice quality with frequency? Do you think they are comparable? Is there a different way to describe these qualities?