analysis and synthesis of shouted speech
Download
Skip this Video
Download Presentation
Analysis and Synthesis of Shouted Speech

Loading in 2 Seconds...

play fullscreen
1 / 31

Analysis and Synthesis of Shouted Speech - PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on

Analysis and Synthesis of Shouted Speech. Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio. Shout. Shout is the loudest mode of vocal communication It is used for increasing the signal-to-noise ratio ( SNR) when communicating over an interfering noise

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Analysis and Synthesis of Shouted Speech' - ilana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
analysis and synthesis of shouted speech
Analysis and Synthesis of Shouted Speech

Tuomo Raitio

Jouni Pohjalainen

Manu Airaksinen

Paavo Alku

Antti Suni

Martti Vainio

slide2

Shout

  • Shout is the loudest mode of vocal communication
  • It is used for increasing the signal-to-noise ratio (SNR) when communicating
    • over an interfering noise
    • over a distance
  • Shouting is also used for expressing emotions or intentions
slide3

Properties of shout

  • Shout is produced by raising the subglottal pressure and increasing the vocal fold tension
  • In effect, shout is characterized by
    • Increased sound pressure level (SPL)
    • Increased fundamental frequency (f0)
    • Increased amplitudes in mid-frequencies (1—4 kHz)
    • Increased duration and energy of vowels
    • Decreased duration and energy of consonants
    • Less accurate articulation
slide4

Why perform shout synthesis?

  • Fortunately, shouting is used rarely, but it is an essential part of human vocal communication
  • Shout synthesis may be required e.g. for creating speech with emotional content, and it can be used in human-computer interaction or in creating virtual worlds and characters
slide5

In this study…

  • In this study
  • Normal and shouted speech was recorded
  • Properties of normal and shouted speech were analyzed
  • Methods for producing natural sounding HMM-based synthetic shout are investigated
slide6

Recording of normal and shouted speech

  • Normal and shouted speech was recorded in an anechoid chamber
  • 22 Finnish speakers
  • 24 sentences of speech and shout from each speaker
  • A total of 1056 sentences
  • Subjects were asked to use very loud voice in shouting
  • In addition, a larger shouting corpus of 100 sentences was recorded from one male and one female for TTS purposes
slide8

Acoustic analysis of shout

  • The following acoustic properties were analyzed from the recorded shouted and normal speech:
    • sound pressure level (SPL)
    • duration
    • fundamental frequency (f0)
    • spectrum
    • properties of the voice source:
      • shape of the glottal pulse
      • H1-H2 parameter
      • NAQ parameter
slide9

Acoustic analysis of shout – Results

  • On average (speech  shout)
    • SPL increased 21 dB for females and 22 dB for males
    • Sentence duration increased 20% for females and 24% for males
    • f0 increased 71% for females and 152% for males
    • Spectrum was emphasized in the 1–4 kHz area
slide11

Female

Male

Overall

Voiced

Unvoiced

slide13

Problems…

  • Differences between normal speech and shout are large
  • This induces problems in many speech processing algorithms:
    • Due to high f0, the accurate estimation of speech spectrum is difficult
    • This is due to the biasing effect of the sparse harmonic structure of the shouted voice source
    • Especially linear prediction (LP) is prone to this type of bias
slide14

Spectrum estimation of shout

  • The biasing effect of the harmonics must be reduced
  • For this purpose, e.g. weighted linear prediction (WLP)can be used
  • In WLP, the effect of the excitation to spectrum is reduced
    • This is done by weighting the squared residual with a specific function
slide15

LP vs. weighted linear prediction (WLP)

Conventional LP:

Weighted LP:

slide17

Spectrum estimation of shout

  • Following spectrum estimation methods were compared for normal speech and shout:
    • Conventional linear prediction (LP)
    • WLP with STE weight (STE-WLP)
    • WLP with AME weight (AME-WLP)
  • STE – short time energy
  • AME – attenuation of the main excitation
slide18

LP vs. WLP in resynthesis

  • Subjective listening tests indicate that
    • WLP-AME performs best with normal speech
    • WLP-STE performs best with shout

LP

WLP-STE

WLP-AME

slide19

LP vs. WLP in HMM-based speech synthesis

  • Subjective listening tests indicate that WLP-STE is preferred in the synthesis of shout (by adaptation)

Female

Male

slide20

Synthesis of shout (1)

  • HMM-based synthesis is a very flexible means to produce different speaking styles, such as shout

Text

Speech data

Synthetic speech

Synthesis

Training

Statistical model

slide21

Synthesis of shout (2)

  • It is difficult to obtain large amounts of shout data, enough for constructing a TTS voice

Shout data

slide22

Synthesis of shout (3)

  • Statistical adaptation of the normal speech model was used to generate synthetic shouted speech

Synthetic shout

Text

Speech data

Synthesis

Training

Statistical model

Adaptation

Shout data

slide23

Synthesis of shout (4)

  • Alternatively, using simple voice conversion technique, the synthetic speech can be converted into shouted speech

Synthetic shout

Text

Speech data

Synthesis

Training

Statistical model

Voice conversion

Shout data

slide24

Evaluation (1)

  • The following speech types were selected for the test:
      • Natural normal speech
      • Natural shout
      • Synthetic normal speech
      • Synthetic shout (adapted)
      • Synthetic shout (voice conversion)
slide25

Evaluation (2)

  • MOS style listening test: the following properties were rated:
      • How would you rate the quality of the speech sample?
      • How much the sample resembles shouting?
      • How much effort did speaker use for producing speech?
  • Scale from 1 to 5 with verbal anchors
  • Loudness of the speech samples was normalized so that the ratings are based on other aspects than SPL
  • 11 test subjects evaluated 50 samples each
slide26

Results – Naturalness

  • Shout synthesis is rated lower in quality compared to normal speech synthesis (as expected)

Normal synthesis

Shout synthesis

26

slide27

Results – Impression of shouting

  • The impression of shouting is, however, fairly well preserved

Natural shout

Synthetic shout

27

slide28

Results – Vocal effort

  • Adaptation produces better impression of the used vocal effort compared to voice conversion method

Adapted shout

Voice conversion shout

28

slide29

Summary (1)

  • Synthesis of shout is challenging for many reasons:
    • It is difficult to obtain large amounts of shout data with consistent quality
    • Differences between normal speech and shout are large, which induces problems in many speech processing algorithms
  • In this work, the biasing effect of high-pitched shout was reduced by using weighted linear predictive (WLP) methods
  • Subjective listening tests show the that WLP models work better with shout than conventional LP
slide30

Summary (2)

  • In this study, synthetic shout was produced with two different techniques:
    • Adaptation
    • Voice conversion of the synthetic normal speech
  • Methods were rated equal in quality
  • Impression of shouting and the use of vocal effort were better preserved in the adapted shout
slide31

Samples

Male

Female

Thank you!

ad