Analysis and synthesis of shouted speech
This presentation is the property of its rightful owner.
Sponsored Links
1 / 31

Analysis and Synthesis of Shouted Speech PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on
  • Presentation posted in: General

Analysis and Synthesis of Shouted Speech. Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio. Shout. Shout is the loudest mode of vocal communication It is used for increasing the signal-to-noise ratio ( SNR) when communicating over an interfering noise

Download Presentation

Analysis and Synthesis of Shouted Speech

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Analysis and synthesis of shouted speech

Analysis and Synthesis of Shouted Speech

Tuomo Raitio

Jouni Pohjalainen

Manu Airaksinen

Paavo Alku

Antti Suni

Martti Vainio


Analysis and synthesis of shouted speech

Shout

  • Shout is the loudest mode of vocal communication

  • It is used for increasing the signal-to-noise ratio (SNR) when communicating

    • over an interfering noise

    • over a distance

  • Shouting is also used for expressing emotions or intentions


Analysis and synthesis of shouted speech

Properties of shout

  • Shout is produced by raising the subglottal pressure and increasing the vocal fold tension

  • In effect, shout is characterized by

    • Increased sound pressure level (SPL)

    • Increased fundamental frequency (f0)

    • Increased amplitudes in mid-frequencies (1—4 kHz)

    • Increased duration and energy of vowels

    • Decreased duration and energy of consonants

    • Less accurate articulation


Analysis and synthesis of shouted speech

Why perform shout synthesis?

  • Fortunately, shouting is used rarely, but it is an essential part of human vocal communication

  • Shout synthesis may be required e.g. for creating speech with emotional content, and it can be used in human-computer interaction or in creating virtual worlds and characters


Analysis and synthesis of shouted speech

In this study…

  • In this study

  • Normal and shouted speech was recorded

  • Properties of normal and shouted speech were analyzed

  • Methods for producing natural sounding HMM-based synthetic shout are investigated


Analysis and synthesis of shouted speech

Recording of normal and shouted speech

  • Normal and shouted speech was recorded in an anechoid chamber

  • 22 Finnish speakers

  • 24 sentences of speech and shout from each speaker

  • A total of 1056 sentences

  • Subjects were asked to use very loud voice in shouting

  • In addition, a larger shouting corpus of 100 sentences was recorded from one male and one female for TTS purposes


Analysis and synthesis of shouted speech

Acoustic analysis of shout

  • The following acoustic properties were analyzed from the recorded shouted and normal speech:

    • sound pressure level (SPL)

    • duration

    • fundamental frequency (f0)

    • spectrum

    • properties of the voice source:

      • shape of the glottal pulse

      • H1-H2 parameter

      • NAQ parameter


Analysis and synthesis of shouted speech

Acoustic analysis of shout – Results

  • On average (speech  shout)

    • SPL increased 21 dB for females and 22 dB for males

    • Sentence duration increased 20% for females and 24% for males

    • f0 increased 71% for females and 152% for males

    • Spectrum was emphasized in the 1–4 kHz area


Analysis and synthesis of shouted speech

Female

Male

Overall

Voiced

Unvoiced


Analysis and synthesis of shouted speech

Problems…

  • Differences between normal speech and shout are large

  • This induces problems in many speech processing algorithms:

    • Due to high f0, the accurate estimation of speech spectrum is difficult

    • This is due to the biasing effect of the sparse harmonic structure of the shouted voice source

    • Especially linear prediction (LP) is prone to this type of bias


Analysis and synthesis of shouted speech

Spectrum estimation of shout

  • The biasing effect of the harmonics must be reduced

  • For this purpose, e.g. weighted linear prediction (WLP)can be used

  • In WLP, the effect of the excitation to spectrum is reduced

    • This is done by weighting the squared residual with a specific function


Analysis and synthesis of shouted speech

LP vs. weighted linear prediction (WLP)

Conventional LP:

Weighted LP:


Analysis and synthesis of shouted speech

Spectrum estimation of shout

  • Following spectrum estimation methods were compared for normal speech and shout:

    • Conventional linear prediction (LP)

    • WLP with STE weight (STE-WLP)

    • WLP with AME weight (AME-WLP)

  • STE – short time energy

  • AME – attenuation of the main excitation


Analysis and synthesis of shouted speech

LP vs. WLP in resynthesis

  • Subjective listening tests indicate that

    • WLP-AME performs best with normal speech

    • WLP-STE performs best with shout

LP

WLP-STE

WLP-AME


Analysis and synthesis of shouted speech

LP vs. WLP in HMM-based speech synthesis

  • Subjective listening tests indicate that WLP-STE is preferred in the synthesis of shout (by adaptation)

Female

Male


Analysis and synthesis of shouted speech

Synthesis of shout (1)

  • HMM-based synthesis is a very flexible means to produce different speaking styles, such as shout

Text

Speech data

Synthetic speech

Synthesis

Training

Statistical model


Analysis and synthesis of shouted speech

Synthesis of shout (2)

  • It is difficult to obtain large amounts of shout data, enough for constructing a TTS voice

Shout data


Analysis and synthesis of shouted speech

Synthesis of shout (3)

  • Statistical adaptation of the normal speech model was used to generate synthetic shouted speech

Synthetic shout

Text

Speech data

Synthesis

Training

Statistical model

Adaptation

Shout data


Analysis and synthesis of shouted speech

Synthesis of shout (4)

  • Alternatively, using simple voice conversion technique, the synthetic speech can be converted into shouted speech

Synthetic shout

Text

Speech data

Synthesis

Training

Statistical model

Voice conversion

Shout data


Analysis and synthesis of shouted speech

Evaluation (1)

  • The following speech types were selected for the test:

    • Natural normal speech

    • Natural shout

    • Synthetic normal speech

    • Synthetic shout (adapted)

    • Synthetic shout (voice conversion)


Analysis and synthesis of shouted speech

Evaluation (2)

  • MOS style listening test: the following properties were rated:

    • How would you rate the quality of the speech sample?

    • How much the sample resembles shouting?

    • How much effort did speaker use for producing speech?

  • Scale from 1 to 5 with verbal anchors

  • Loudness of the speech samples was normalized so that the ratings are based on other aspects than SPL

  • 11 test subjects evaluated 50 samples each


  • Analysis and synthesis of shouted speech

    Results – Naturalness

    • Shout synthesis is rated lower in quality compared to normal speech synthesis (as expected)

    Normal synthesis

    Shout synthesis

    26


    Analysis and synthesis of shouted speech

    Results – Impression of shouting

    • The impression of shouting is, however, fairly well preserved

    Natural shout

    Synthetic shout

    27


    Analysis and synthesis of shouted speech

    Results – Vocal effort

    • Adaptation produces better impression of the used vocal effort compared to voice conversion method

    Adapted shout

    Voice conversion shout

    28


    Analysis and synthesis of shouted speech

    Summary (1)

    • Synthesis of shout is challenging for many reasons:

      • It is difficult to obtain large amounts of shout data with consistent quality

      • Differences between normal speech and shout are large, which induces problems in many speech processing algorithms

    • In this work, the biasing effect of high-pitched shout was reduced by using weighted linear predictive (WLP) methods

    • Subjective listening tests show the that WLP models work better with shout than conventional LP


    Analysis and synthesis of shouted speech

    Summary (2)

    • In this study, synthetic shout was produced with two different techniques:

      • Adaptation

      • Voice conversion of the synthetic normal speech

    • Methods were rated equal in quality

    • Impression of shouting and the use of vocal effort were better preserved in the adapted shout


    Analysis and synthesis of shouted speech

    Samples

    Male

    Female

    Thank you!


  • Login