1 / 12

Voice quality: functions, analysis and synthesis VOQUAL’03 Geneva, August 27-29, 2003

Institute of Cognitive Sciences and Technologies - CNR Department of Phonetics and Dialectology - Padova. Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser.

biana
Download Presentation

Voice quality: functions, analysis and synthesis VOQUAL’03 Geneva, August 27-29, 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Institute of Cognitive Sciences and Technologies - CNR Department of Phonetics and Dialectology - Padova Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Voice quality: functions, analysis and synthesisVOQUAL’03Geneva, August 27-29, 2003

  2. Outline Objectives and motivations Voice material Acoustic indexes and statistical analysis Neutral to emotive utterance mapping Experimental results

  3. Objectives and motivations Long-term goals: - emotive speech analysis/synthesis - improvement of ASR/TTS systems Short-term goal: - preliminary evaluation of processing tools for the reproduction of different voice qualities Focus of talk: - analysis/synthesis of different voice qualities corresponding to different emotive intentions Method: - analysis of voice quality acoustic correlates - definition of a sinusoidal modeling framework to control voice timbre and phonation quality

  4. Voice material anger (A), joy (J), fear (F), sadness (SA), disgust (D), surprise (SU). An emotive voice corpus was recorded with the following characteristics: two phonological structures ’VCV: /’aba/ and /’ava/. neutral (N) plus six emotional states: 1 speaker, 7 recordings for each emotive intention, for each word.

  5. disgust (D) surprise (SU) neutral (N) anger (A) joy (J) fear (F) sadness (SA) Analysis of emotive speech: acoustic correlates Cue extraction and analysis: • Intensity, duration, pitch, pitch range, formants. • F0 stressed vowel mean and F0 mid values are strongly correlated. F0 mean (global and for stressed vowel), F0 “mid”, and F0 range

  6. Analysis of emotive speech: acoustic correlates Cue extraction and analysis (acoustic correlates of voice quality): • Shimmer, Jitter • HNR • Hammarberg’s index (HammI) • difference between energy max in the 0-2000 Hz and 2000-5000 Hz frequency bands • Spectral flatness (SFM) • ratio of the geometric to the arithmetic mean • Drop-off of spectral energy above 1000 Hz (Do1000) • LS approx. of the spectral tilt above 1000 Hz • High- versus low-frequency range relative energy amount (Pe1000)

  7. Analysis of emotive speech: voice quality Voice quality patterns (distance from Neutral): Voice quality characterization: Anger: harsh voice (/’a/) Disgust: creaky voice (/a/) Joy, Fear, Surprise : breathy voice Classification matrix for stressed vowel: Discriminant analysis: classification scores: 60/70 % for stressed and unstressed vowel Best score: Fear, Anger

  8. Neutral Emotive transformation based on sinusoidal modeling and spectral processing Processing of emotive speech: method Spectral conversion function design: Spectral conversion model Neutral Emotion j gaussian mixture model Spectral conversion model(Stylianou et Al., 1998) conversion parameters Neutral sinus. spectral envelope Neutral sinus. spectral envelope after pitch shift Emotive sinus. spectral envelope after Ts Spectral envelope conversion function ( : mfcc from )

  9. Processing of emotive speech: method Neutral Emotive transformation based on trained model Target Disgust Neutral Disgust (Ps+Ts) Disgust Sadness (Ps+Ts) Sadness Target Sadness

  10. Processing of emotive speech: results Neutral Emotive transformation based on sinusoidal modeling: Neutral Ps+Ts Ps+Ts+Sc Target Anger Disgust Joy Fear Surprise Sadness

  11. Processing of emotive speech: results Neutral • Results: • Time-stretch and (formant preserving) pitch shift alone can’t account for the principal emotion related cues • Spectral conversion can account for some of the emotion cues • In general, the method can’t account for cues related to period-to-period variability (i.e., Shimmer, Jitter) • The inclusion of a noise model is required to evaluate the effect on HNR

  12. Conclusions • Sinusoidal framework was found adequate to process emotive information • Need refinements (e.g. noise model, harshness model) to account forall the acoustic correlates of emotions • Results of processing are perceptually good Future work • Refinements of the model (i.e., noise model) • Adaptation to TTS system • Search for the existence of speaker-independent transformation patterns (using multi-speaker corpora).

More Related