190 likes | 404 Views
Cues to Emotion: Language. Suzanne Yuen Monday Oct 5, 2009 COMS 6998 . Overview. Two-Stream Emotion Recognition for Call Center Monitoring Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis. Two Stream Emotion Recognition for Call Center Monitoring.
E N D
Cues to Emotion: Language Suzanne Yuen Monday Oct 5, 2009 COMS 6998
Overview • Two-Stream Emotion Recognition for Call Center Monitoring • Voice Quality and f0 Cues for Affect Expression: Implications for Synthesis
Two Stream Emotion Recognition for Call Center Monitoring • Background: To aid supervisors in the evaluation of agents at call centers* • Objective: To present a two stream processing technique to detect strong emotion • Previous Work: • Fernandez categorized affect into four main components: intonation, loudness, rhythm, and voice quality • Yang studied feature selection methods in text categorization and suggested that information gain should be used • Petrushin and Yacoub examined agitation and calm states in people-machine interaction *Typical medium-sized call-center receives about 100,000 calls per day
Two-Stream Recognition • Semantic Stream • Performed speech-to-text conversion • Text classification algorithms identified phrases such as “pleasure,” “thanks,” “useless,” & “disgusting.” • Acoustic Stream • Extracted features based on pitch and energy • Trained on 900 calls, ~60hrs of speech • Vocabulary system of more than 10 000 words • TF-IDF scheme = Term Frequency – Inverse Document Frequency
Implementation • Method: • Two streams analyzed separately: • speech utterance/acoustic features • spoken text/semantics/speech recognition of conversation • Confidence levels of two streams combined • Examined 3 emotions • Neutral • Hot-anger • Happy • Tested two data sets: • LDC data • 20 real-world call-center calls
Two Stream - Conclusion • Table 2 suggested that two-stream analysis is more accurate than acoustic or semantic alone • LDC data recognition significantly higher than real-world data • Neutral emotions had less accuracy • Combination of two-stream processing showed improvement (~20%) in identification of “happy” and “anger” emotions • Low acoustic stream accuracy may be attributed to length of sentences in real-world data. Normal people do not exhibit different emotions significantly in long sentences
Discussion • Gupta analyzed three emotions (happy, neutral, hot-anger): Why break it down into these categories? Implications? Can this technique be applied to a wider range of emotions? For other applications? • Speech to text may not translate the complete conversation. Would further examination greatly improve results? What are the pros and cons? • Pitch range was from 50-400Hz. Research may not be applicable outside this range. Do you think it necessary to examine other frequencies? • In this paper, TF-IDF (Term Frequency – Inverse Document Frequency) technique is used to classify utterances. Accuracy for acoustics only is about 55%. Previous research suggest that alternative techniques may be better. Would implementation better results? What are the pros and cons of using the TF-IDF technique?
Voice Quality and f0 Cues for Affect Expression: Implications for Synthesis • Previous work: • 1995; Mozziconacci suggested that VQ combined with f0 combined could create affect • 2002; Gobl suggested synthesized stimuli with VQ can add affective coloring. Study suggested that “VQ + f0” stimuli is more affective than “f0 only” • 2003; Gobl tested VQ with large f0 range. Did not examine contribution of affect-related f0 contours • Objective: To examine affects of VQ and f0 on affect expression
Voice Quality and f0 Cues for Affect Expression: Implications for Synthesis • 3 series of stimuli of Sweden utterance – “jaadjo”: • Stimuli exemplifying VQ • Stimuli with modal voice quality with different affect-related f0 contours • Stimuli combining both • Tested parameters exemplifying 5 voice quality (VQ): • Modal voice • Breathy voice • Whispery voice • Lax-creaky voice • Tense voice • 15 synthesized stimuli test samples (see Table 1)
What is Voice Quality? Phonation Gestures • Derived from a variety of laryngeal and supralaryngeal features • Adductive tension: interarytenoid muscles adduct the arytenoid muscles • Medial compression: adductive force on vocal processes- adjustment of ligamental glottis • Longitudinal pressure: tension of vocal folds
Tense Voice • Very strong tension of vocal folds, very high tension in vocal tract
Whispery Voice • Very low adductive tension • Medial compression moderately high • Longitudinal tension moderately high • Little or no vocal fold vibration • Turbulence generated by friction of air in and above larynx
Creaky Voice • Vocal fold vibration at low frequency, irregular • Low tension (only ligamental part of glottis vibrates) • The vocal folds strongly adducted • Longitudinal tension weak • Moderately high medial compression
Breathy Voice • Tension low • Minimal adductive tension • Weak medial compression • Medium longitudinal vocal fold tension • Vocal folds do not come together completely, leading to frication
Modal Voice • “Neutral” mode • Muscular adjustments moderate • Vibration of vocal folds periodic, full closing of glottis, no audible friction • Frequency of vibration and loudness in low to mid range for conversational speech
Voice Quality and f0 Cues for Affect Expression: Implications for Synthesis • Six sub-tests with 20 native speakers of Hiberno-English. • Rated on 12 different affective attributes: • Sad – happy • Intimate – formal • Relaxed – stressed • Bored – interested • Apologetic – indignant • Fearless – scared • Participants asked to mark their response on scale Intimate Formal No affective load
Voice Quality and f0 Test: Conclusion • Categorized results into 4 groups. No simple one-to-one mapping between quality and affect • “Happy” was most difficult to synthesis • Suggested that, in addition to f0 ,VQ should be used to synthesis affectively colored speech. VQ appears to be crucial for expressive synthesis
Voice Quality and f0 Test: Discussion • If the scale is on a 1-7, then 3.5 should be “neutral”; however, most ratings are less than 2. Do the conclusions (see Fig 2) seem strong? • In terms of VQ and f0, the groupings in Fig 2 seem to suggest that certain affects are closely related. What are the implications of this? For example, are happy and indignant affects closer than relaxed or formal? Do you agree? • Do you consider an intimate voice more “breathy” or “whispery?” Does your intuition agree with the paper? • Yanushevskaya found that the VQ accounts for the highest affect ratings overall. How to compare range of voice quality with frequency? Do you think they are comparable? Is there a different way to describe these qualities?