Joint prosody prediction and unit selection for concatenative speech synthesis
Download
1 / 9

Joint Prosody Prediction - PowerPoint PPT Presentation


  • 351 Views
  • Updated On :

Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis. Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University of Washington, Seattle. Limited Domain Synthesis. Standard Approach. Our Approach. Concept. Canonical Pronunciation.

Related searches for Joint Prosody Prediction

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Joint Prosody Prediction ' - LionelDale


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Joint prosody prediction and unit selection for concatenative speech synthesis l.jpg

Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis

Ivan Bulyko and Mari Ostendorf

Electrical Engineering Department

University of Washington, Seattle


Limited domain synthesis l.jpg
Limited Domain Synthesis

Standard Approach

Our Approach

Concept

Canonical Pronunciation

A Network of Pronunciations

Prosody Prediction

return[H*]

Seattle[L*]

Boston[L*][H-H%]

to

H*

L*

L*

H-H%

Will you

from

Will you return from Seattle to Boston?

to

Prosodic Target

return[L*+H]

Seattle[none]

Boston[H*][H-H%]

Find best path

Unit Selection

Compose

Unit

DB

Dynamic Search

from

Seattle[L*]

Sequence of Units

...

...

...

...

Waveform Concatenation

C(i,j)

Speech


Choice of units and prosodic categories l.jpg
Choice of Units and Prosodic Categories

Will you return from Seattle to Boston

H*

L*

L*

H-H%

  • Why symbolic prosodic targets?

    • They capture categorical perceptual differences

Boundary Tones:

L-L%

L-H%

H-L%

H-H%

Pitch Accents:

high H*, L+H*

low L*, L*+H

downstepped !H*, L+!H*, H+!H*


Modeling prosody with wfsts l.jpg
Modeling Prosody with WFSTs

Will you return from Seattle to Boston

low/high

low/none

low/high

H-H%

Seattle[low] / 0.5

to

Boston[low][H-H%]

Will you return[high] from / 0.4

template

Will you return[low] from / 1.2

Seattle[none] / 0.9

to

Boston[high][H-H%]

+

Union

from[none] / 0.2

Seattle[none] / 1.2

from[low] / 1.8

Seattle[low] / 0.3

...

...

prosody prediction

Seattle[high] / 0.8

from[high] / 2.2

from[ds] / 2.7

Seattle[ds] / 2.1


Representing decision trees with wfsts l.jpg
Representing Decision Trees with WFSTs

a:s/c(0.8)

a:t/c(0.2)

F=a

F=b

b:s/c(0.3)

P(X=s)=0.8

P(X=t)=0.2

P(X=s)=0.3

P(X=t)=0.7

b:t/c(0.7)

c(p) = -log(p)


Modular structure of prosody model l.jpg
Modular Structure of Prosody Model

Prosody Prediction WFST

Phrase Break Template

Prosody WFST

Utterance

level

+

Phrase breaks

Prosody Prediction WFST

Accent & Tone Template

Prosody WFST

Phrase

level

+

Accents

Tones

Other levels (if necessary)


Representing unit db as wfst l.jpg
Representing Unit DB as WFST

Seattle

to

Boston

uk

ui

to:uk/C(ui,uk)

ui

ui+1

Concatenation Cost:

C(ui,uk)=0.5(d1+d2)

d1

d2

uk-1

uk


Experiments l.jpg
Experiments

  • 14 target utterances in 3 versions:

    A. no prosody prediction, unit selection is based entirely on the concatenation costs

    B. only one zero-cost prosodic target in the template (all others have very high and equal costs)

    C. a prosody template that allows alternative paths weighted according to their relative frequency

  • Travel domain corpus from University of Colorado (~2hrs)

    • Automatically segmented

    • Annotated with ToBI labels (220 utterances)

  • 4 subjects - native speakers of American English


Conclusions and future work l.jpg
Conclusions and Future Work

  • Combining prosody prediction and unit selection improves naturalness

  • The WFST architecture is

    • flexible : accommodates variable size units and different forms of prosody generation

    • efficient : composition and finding the best path are fast operations, allowing real-time synthesis

  • Future work will focus on making these techniques applicable to subword units


ad