1 / 53

HMM-based speech synthesis: the new generation of artificial voices

HMM-based speech synthesis: the new generation of artificial voices. Thomas Drugman thomas.drugman@umons.ac.be. TCTS Lab. « Laboratoire de Théorie des Circuits et de Traitement du Signal » 25 people : 3 Profs, 10 PhD Students. TCTS Lab. Image & Video. Numerical Arts. Audio & Speech.

eben
Download Presentation

HMM-based speech synthesis: the new generation of artificial voices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HMM-based speech synthesis: the new generation of artificial voices Thomas Drugman thomas.drugman@umons.ac.be

  2. TCTS Lab « Laboratoire de Théorie des Circuits et de Traitement du Signal » 25 people : 3 Profs, 10 PhDStudents TCTS Lab Image & Video Numerical Arts Audio & Speech Drugman Thomas

  3. Content • Speech synthesis: history • HMM-based speech synthesis • Parametricmodeling of speech • Statisticalgeneration • Conclusions

  4. Content • Speech synthesis: history • HMM-based speech synthesis • Parametricmodeling of speech • Statisticalgeneration • Conclusions

  5. Speech Synthesis Text-to-speech system « Hello » GOAL : Produce the lecture of an unknowntexttyped by the user Drugman Thomas

  6. Challenges • Naturalness • Intelligibility • Cost-effectiveness • Expressivity Drugman Thomas

  7. Challenge 3 : Cost-effectiveness Industry expects Intelligibility + Naturalness + … • Small footprint : a few Megs • Small CPU requirements (embedded market) • Easy extension to other languages • Possibility to create new voices as fast as possible • Through automatic recording/segmentation process • Through efficient voice conversion • Possibility to bootstrap an existing TTS voice into any voice Drugman Thomas

  8. Challenge 4 (new) : Expressivity =“Emotional speech synthesis” (art!) • Being able to render an expressive voice • In terms of prosody • In terms of voice quality • Knowing when to do it (yet unsolved) • Today’s holy grail for the industry • Strategic advantage for whoever gets it first • News markets (ebooks?) Drugman Thomas

  9. Methods for Speech Synthesis • Expert-based (rule-based) approach • Corpus-based approach • Diphone concatenation • Unit Selection • Statistical parametric synthesis (“HMM-based synthesis”) Drugman Thomas

  10. Main bellows Nostrils Mouth Small bellows 'S' pipe 'S' lever 'Sh' lever 'Sh' pipe Von Kempelen’s talking machine (1791) Prof. Thierry Dutoit

  11. Omer Dudley’s Voder (Bell Labs, 1936) Prof. Thierry Dutoit

  12. And other developments in articulatory synthesis • Work by : K. Stevens, G. Fant, P. Mermelstein, R. Carré (GNUSpeech), S. Maeda, J. Shroeter & M. Sondhi… • More recently : O. Engwall, S. Fels (ArtiSynth), Birkholz and Kröger, A. Alwan & S. Narayanan (MRI)… Prof. Thierry Dutoit

  13. Rule-based synthesis IntelligibilityNaturalnessMem/CPU/VoicesExpressivity Prof. Thierry Dutoit

  14. Methods for Speech Synthesis • Expert-based (rule-based) approach • Corpus-based approach • Diphone concatenation • Unit Selection • Statistical parametric synthesis (“HMM-based synthesis”) Drugman Thomas

  15. Diphone concatenation IntelligibilityNaturalness~Mem/CPU/VoicesExpressivity 

  16. Unit selection IntelligibilityNaturalness Mem/CPU/Voices ~ Expressivity~

  17. Content • Speech synthesis: history • HMM-based speech synthesis • Parametricmodeling of speech • Statisticalgeneration • Conclusions

  18. Statistical Parametric Speech Synthesis DATABASE Speech Parameters Statistical Modeling Speech Analysis TRAINING SPS Synthesizer SYNTHESIS Speech Parameters Speech Processing Statistical Generation « Hello !» Hello!

  19. HMM-based speech synthesis http://hts.sp.nitech.ac.jp/ IntelligibilityNaturalness ?Mem/CPU/Voices Expressivity?

  20. TRAINING OF THE HMM-BASED SYNTHESIZER

  21. Parameter extraction

  22. Parameter extraction Pulse train Synthetic Speech Filter White noise

  23. Labels

  24. Labels Labels consist of phoneticenvironment description • Contextualfactors: • Phone identity • Syntaxicalfactors • Stress-relatedfactors • Locational , …

  25. Labels Example

  26. HMM training

  27. System architecture Contextualfactorsmay affect duration, source and filterdifferently ContextOrientedClustering usingDecisionTrees

  28. System architecture State Duration Model HMM for Source and Filter Decision tree for State Duration Decision trees for Filter Decision trees for Source

  29. Training decision trees An exhaustive list of possible questions is first drawn up Example : QS "LL-Nasal" {m^*,n^*,en^*,ng^*} QS "LL-Fricative" {ch^*,dh^*,f^*,hh^*,hv^*,s^*,sh^*,th^*,v^*,z^*,zh^*} QS "LL-Liquid" {el^*,hh^*,l^*,r^*,w^*,y^*} QS "LL-Front" {ae^*,b^*,eh^*,em^*,f^*,ih^*,ix^*,iy^*,m^*,p^*,v^*,w^*} QS "LL-Central" {ah^*,ao^*,axr^*,d^*,dh^*,dx^*,el^*,en^*,er^*,l^*,n^*,r^*,s^*,t^*,th^*,z^*,zh^*} QS "LL-Back" {aa^*,ax^*,ch^*,g^*,hh^*,jh^*,k^*,ng^*,ow^*,sh^*,uh^*,uw^*,y^*} QS "LL-Front_Vowel" {ae^*,eh^*,ey^*,ih^*,iy^*} QS "LL-Central_Vowel" {aa^*,ah^*,ao^*,axr^*,er^*} QS "LL-Back_Vowel" {ax^*,ow^*,uh^*,uw^*} QS "LL-Long_Vowel" {ao^*,aw^*,el^*,em^*,en^*,en^*,iy^*,ow^*,uw^*} QS "LL-Short_Vowel" {aa^*,ah^*,ax^*,ay^*,eh^*,ey^*,ih^*,ix^*,oy^*,uh^*} QS "LL-Dipthong_Vowel" {aw^*,axr^*,ay^*,el^*,em^*,en^*,er^*,ey^*,oy^*} QS "LL-Front_Start_Vowel" {aw^*,axr^*,er^*,ey^*} Total: about 1500 questions

  30. Training decision trees Decisiontrees are trainedusing a Maximum Likelihoodcriterion Example :

  31. Emission likelihood and training Finally, eachleafismodeled by a Gaussian Mixture Model (GMM) Training isguided by the Viterbi and Baum-Welchre-estimation algorithms

  32. SYNTHESIS BY THE HMM-BASED SYNTHESIZER

  33. Text analysis

  34. Parameters generation

  35. Parameters generation Given the sequence of labels, durations are determined by maximizing the state sequencelikelihood A trajectorythroughcontext-dependent HMM states isknown !

  36. Parameters generation Usingthistrajectory, source and filterparameters are generated by maximizing the output probability Dynamicfeaturesevolution more realistic and smooth

  37. Speech synthesizers comparison

  38. Speech synthesizers comparison Quality Unit Selection HTS Diphone Concatenation Footprint 200Mb 5Mb <1Mb

  39. Content • Speech synthesis: history • HMM-based speech synthesis • Parametricmodeling of speech • Statisticalgeneration • Conclusions

  40. Problem positioning Parametric speech synthesizers generallysufferfrom a typicalbuzzinessas encountered in LPC-likevocoders Source–Filterapproach: Enhance the excitation signal Pulse train Synthetic Speech Filter White noise

  41. Proposed solution SOURCE FILTER T.Drugman, G.Wilfart, T.Dutoit, « A Deterministic plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis », Interspeech09

  42. Results Traditional: Proposed:

  43. Content • Speech synthesis: history • HMM-based speech synthesis • Parametricmodeling of speech • Statisticalgeneration • Conclusions

  44. Problem of oversmoothing Drugman Thomas

  45. Compensation of oversmooting Drugman Thomas

  46. Global Variance Drugman Thomas

  47. Global Variance Drugman Thomas

  48. Results Drugman Thomas

  49. Content • Speech synthesis: history • HMM-based speech synthesis • Parametricmodeling of speech • Statisticalgeneration • Conclusions

  50. Speech synthesizers comparison Rule-based synthesis IntelligibilityNaturalnessMem/CPU/VoicesExpressivity Diphone concatenation IntelligibilityNaturalness~Mem/CPU/VoicesExpressivity  Unit selection IntelligibilityNaturalness Mem/CPU/Voices ~ Expressivity~ HMM-based speech synthesis IntelligibilityNaturalness ?Mem/CPU/Voices Expressivity?

More Related