1 / 26

Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009

Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009. Yuk On Kong, Lukas Latacz, Werner Verhelst Laboratory for Digital Speech and Audio Processing Vrije Universiteit Brussel. Introduction. To Record or Not to Record: That’s the question.

Download Presentation

Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Synthesis in the SPACE Reading TutorClosing Symposium of the SPACE Project06 FEB 2009 Yuk On Kong, Lukas Latacz, Werner VerhelstLaboratory for Digital Speech and Audio ProcessingVrije Universiteit Brussel

  2. Introduction

  3. To Record or Not to Record: That’s the question. • Pre-recorded speech in existing reading tutors • Advantages / disadvantages?

  4. Application-specific TTS • Speaker / voice • Material in speech corpus • How to synthesize • Any extra mode necessary? • the child is too slow… • How to maximize quality

  5. Speaker / Voice Speaker • Appealing to children • Female speaker • Standard Flemish pronunciation, no noticeable regional accent • Experienced speaker

  6. Material inSpeech Corpus Database (about 6 hours) • Material from stories for children • Words expected at 6 years of age • Diphones

  7. How to synthesize • Based on the general unit selection paradigm. • Heterogeneous units: units could be of various sizes • Bases: • Use of longer chunks leads to quality improvement. • Used for synthesizing domain-specific utterances. oma o ma _-o o-m m-a Fig. Word “oma” to synthesize and multi-tier segmentation in word, syllable and segment

  8. How to synthesize • Basic algorithm: • Search top-down and select longest sequence of targets at each level and go to lower levels if no candidates are found. • Coarticulation: • Even across word boundaries • Level: diphone, syllable, word, phrase

  9. How to synthesize • Front-end Als het flink vriest, kunnen we schaatsen. Tokenisation Text Normalisation Phrase and Pause Prediction Part of speech Word Pronunciation Silence Insertion ToDI Intonation Word Accent Back-end Unit Selection Unit Concatenation Speech DB

  10. How to synthesize Those with a * are also calculated for the neighboring segments, syllables or words. “Neighboring syllables” are restricted to the syllables of the current word. As for segments & words, three neighbors on the left and three on the right are taken into account. Target prosody is described symbolically Best sequence of units is selected • Weighted sum of target and join costs • Viterbi search Joins: • Costs based on spectrum, pitch, energy, duration and adjacency • PSOLA-based algorithm with optimal coupling

  11. Extra Modes? Phoneme-by-phoneme mode • Stress Syllable mode

  12. Extra Modes? Demonstration: • Phoneme-by-phoneme • Stress on first phoneme • Syllable • Normal mode

  13. The Child is Too Slow… Choosing the appropriate reading speed for the child • Uniform WSOLA time-scaling • Insertion of additional silences between neighboring words Reading along

  14. The Child is Too Slow… Synthesizer Assessment Errordetection Tracking Teacher’smodule Synthesis module Readingtutor Playback module Commands & Timing Info Audio Cygwin Windows XP

  15. How to Maximize Quality Major synthesis problems • Join artifacts • Inappropriate prosody Interactive tuning of synthesis • Assisted by quality management • User can make small changes to the input text

  16. How to Maximize Quality Approach: • For each word, calculate average target and join costs • Predictor: • : threshold based on max and min of cost c • uj usually lies between 0 & 1 because of training settings. • Accept if uj < 0.5 and reject otherwise. • Weights: linear regression • Best alphas found iteratively (maximizing f-score)

  17. Other Special Aspects • Phrase and Silence Prediction • Context-dependent Weight Training

  18. Phrase and Silence Prediction Type of pauses: heavy, medium and light • Phrase breaks: both heavy and medium pauses Training • No manual labeling, but based on the pauses automatically labeled in the speech database • Iterative classification based on these pauses • Training of memory-based learner (features such as POS, punctuation, ...)

  19. Context-dependent Weight Training Automatic adaptation (tuning) of weights Context-dependent weights • Context is described symbolically per phone Training: • Optimizing weights • Clustering optimized weights (decision trees)

  20. Context-dependent Weight Training 7 subjects 4 conditions • Randomly selected corpus; Context-dependent weights • Randomly selected corpus; Untrained weights • Corpus selected based on word frequency; Context-dependent weights • Corpus selected based on word frequency; Untrained weights 25 test utterances, AVI 1-5 (5 utt./level) Results:

  21. Demonstration Hierarchical unit selection: • AVI 1: “Dit is te gek, gilt ze.” • AVI 3: “Toch had hij liever de hond gehad.” • AVI 5: “Roel ligt nog een paar dagen in het ziekenhuis.” • AVI 7: “De kleine huizen staan dicht tegen elkaar aan.” • AVI 9: “Nou Henk, zie je nu wel dat je moeder hier fantastisch verzorgd wordt!”

  22. WSOLA Top: original signalBottom: WSOLA time-scaling Illustration of the WSOLA strategy

  23. Other Application • Audio-visual TTS • Example: “The sentence you hear is made out of many combinations of original sound and video, selected from the recordings of natural speech.” • Database containing about 20 minutes (LIPS Challenge ’08) • For better audio quality, the database should be much larger

  24. Future Work • Optimizing synthesis • User feedback • Expressive speech synthesis • Automated prosodic annotations • Quality Management • Evaluation & optimization of the algorithm • Compare with the perceived quality of synthesized sentences (MOS)

  25. Questions? • Thank you for your attention. • Acknowledgments: • Prof. Wivine Decoster (our speaker) • Jacques, Leen and other SPACE members • Wesley and other DSSP people • IWT

  26. THE END

More Related