Analysis and Synthesis of Shouted Speech

Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio

Shout • Shout is the loudest mode of vocal communication • It is used for increasing the signal-to-noise ratio (SNR) when communicating • over an interfering noise • over a distance • Shouting is also used for expressing emotions or intentions

Properties of shout • Shout is produced by raising the subglottal pressure and increasing the vocal fold tension • In effect, shout is characterized by • Increased sound pressure level (SPL) • Increased fundamental frequency (f0) • Increased amplitudes in mid-frequencies (1—4 kHz) • Increased duration and energy of vowels • Decreased duration and energy of consonants • Less accurate articulation

Why perform shout synthesis? • Fortunately, shouting is used rarely, but it is an essential part of human vocal communication • Shout synthesis may be required e.g. for creating speech with emotional content, and it can be used in human-computer interaction or in creating virtual worlds and characters

In this study… • In this study • Normal and shouted speech was recorded • Properties of normal and shouted speech were analyzed • Methods for producing natural sounding HMM-based synthetic shout are investigated

Recording of normal and shouted speech • Normal and shouted speech was recorded in an anechoid chamber • 22 Finnish speakers • 24 sentences of speech and shout from each speaker • A total of 1056 sentences • Subjects were asked to use very loud voice in shouting • In addition, a larger shouting corpus of 100 sentences was recorded from one male and one female for TTS purposes

Acoustic analysis of shout • The following acoustic properties were analyzed from the recorded shouted and normal speech: • sound pressure level (SPL) • duration • fundamental frequency (f0) • spectrum • properties of the voice source: • shape of the glottal pulse • H1-H2 parameter • NAQ parameter

Acoustic analysis of shout – Results • On average (speech  shout) • SPL increased 21 dB for females and 22 dB for males • Sentence duration increased 20% for females and 24% for males • f0 increased 71% for females and 152% for males • Spectrum was emphasized in the 1–4 kHz area

Female Male Overall Voiced Unvoiced

Problems… • Differences between normal speech and shout are large • This induces problems in many speech processing algorithms: • Due to high f0, the accurate estimation of speech spectrum is difficult • This is due to the biasing effect of the sparse harmonic structure of the shouted voice source • Especially linear prediction (LP) is prone to this type of bias

Spectrum estimation of shout • The biasing effect of the harmonics must be reduced • For this purpose, e.g. weighted linear prediction (WLP)can be used • In WLP, the effect of the excitation to spectrum is reduced • This is done by weighting the squared residual with a specific function

LP vs. weighted linear prediction (WLP) Conventional LP: Weighted LP:

Spectrum estimation of shout • Following spectrum estimation methods were compared for normal speech and shout: • Conventional linear prediction (LP) • WLP with STE weight (STE-WLP) • WLP with AME weight (AME-WLP) • STE – short time energy • AME – attenuation of the main excitation

LP vs. WLP in resynthesis • Subjective listening tests indicate that • WLP-AME performs best with normal speech • WLP-STE performs best with shout LP WLP-STE WLP-AME

LP vs. WLP in HMM-based speech synthesis • Subjective listening tests indicate that WLP-STE is preferred in the synthesis of shout (by adaptation) Female Male

Synthesis of shout (1) • HMM-based synthesis is a very flexible means to produce different speaking styles, such as shout Text Speech data Synthetic speech Synthesis Training Statistical model

Synthesis of shout (2) • It is difficult to obtain large amounts of shout data, enough for constructing a TTS voice Shout data

Synthesis of shout (3) • Statistical adaptation of the normal speech model was used to generate synthetic shouted speech Synthetic shout Text Speech data Synthesis Training Statistical model Adaptation Shout data

Synthesis of shout (4) • Alternatively, using simple voice conversion technique, the synthetic speech can be converted into shouted speech Synthetic shout Text Speech data Synthesis Training Statistical model Voice conversion Shout data

Evaluation (1) • The following speech types were selected for the test: • Natural normal speech • Natural shout • Synthetic normal speech • Synthetic shout (adapted) • Synthetic shout (voice conversion)

Evaluation (2) • MOS style listening test: the following properties were rated: • How would you rate the quality of the speech sample? • How much the sample resembles shouting? • How much effort did speaker use for producing speech? • Scale from 1 to 5 with verbal anchors • Loudness of the speech samples was normalized so that the ratings are based on other aspects than SPL • 11 test subjects evaluated 50 samples each

Results – Naturalness • Shout synthesis is rated lower in quality compared to normal speech synthesis (as expected) Normal synthesis Shout synthesis 26

Results – Impression of shouting • The impression of shouting is, however, fairly well preserved Natural shout Synthetic shout 27

Results – Vocal effort • Adaptation produces better impression of the used vocal effort compared to voice conversion method Adapted shout Voice conversion shout 28

Summary (1) • Synthesis of shout is challenging for many reasons: • It is difficult to obtain large amounts of shout data with consistent quality • Differences between normal speech and shout are large, which induces problems in many speech processing algorithms • In this work, the biasing effect of high-pitched shout was reduced by using weighted linear predictive (WLP) methods • Subjective listening tests show the that WLP models work better with shout than conventional LP

Summary (2) • In this study, synthetic shout was produced with two different techniques: • Adaptation • Voice conversion of the synthetic normal speech • Methods were rated equal in quality • Impression of shouting and the use of vocal effort were better preserved in the adapted shout

Samples Male Female Thank you!

Analysis and Synthesis of Shouted Speech

Analysis and Synthesis of Shouted Speech

Presentation Transcript

Speech synthesis

Speech Processing Text to Speech Synthesis

Synthesis and Analysis

SPEECH PRODUCTION,RECOGNITION, ANALYSIS, AND SYNTHESIS

Synthesis and Analysis of Aspirin

3. SPEECH RECOGNITION, ANALYSIS, AND SYNTHESIS

3. SPEECH RECOGNITION, ANALYSIS, AND SYNTHESIS

Speech Synthesis

Combined Gesture-Speech Analysis and Synthesis

Speech Synthesis

Combined Gesture-Speech Analysis and Synthesis

Speech Synthesis Technology

Speech Synthesis

Application of Speech Recognition, Synthesis, Dialog

Visible Speech Synthesis

4. Speech Synthesis

5- Speech Synthesis