1 / 22

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011. Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya Institute of Technology 2 September, 2011. Background. HMM-based speech synthesis Quality of synthesized speech depends on acoustic models

caia
Download Presentation

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of NIT HMM-basedspeech synthesis systemfor Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya Institute of Technology 2 September, 2011

  2. Background • HMM-based speech synthesis • Quality of synthesized speech depends on acoustic models • Model estimation is one of the most important problem • Appropriate training algorithm is required • Deterministic annealing EM (DAEM) algorithm • To overcome the local maxima problem • Step-wise model selection • To perform the joint optimization of model structures and state sequences

  3. Outline • HMM-based speech synthesis system • Deterministic annealing EM (DAEM) algorithm • Step-wise model selection • Experiments • Conclusion & future work

  4. Overview of HMM-based system Speech signal Speech database Excitation parameters extraction Spectral parameters extraction Label Training of HMM Contest-dependent HMMs & duration models Training part Synthesis part TEXT Parameter generation from HMM Label Text analysis Excitation parameters Spectral parameters Excitation generation Synthesized speech Synthesis filter

  5. Base techniques • Hidden semi-Markov Model (HSMM) • HMM with explicit state duration probability dist. • Estimate state output and duration probability dists. • STRAIGHT • A high quality speech vocoding method • Spectrum, F0, and aperiodicity measures • Parameter generation considering GV • Calculate GV features from only speech region excluding silence and pause • Context dependent GV models

  6. Outline • HMM-based speech synthesis system • Deterministic annealing EM (DAEM) algorithm • Step-wise model selection • Experiments • Conclusion & future work

  7. EM algorithm • Maximum likelihood (ML) criterion • Expectation Maximization (EM) algorithm : Model parameter : Training data : HMM state seq. ・E-step: ・M-step: Occur the local maxima problem

  8. DAEM algorithm • Posterior probability • Model update process ・E-step: ・M-step: : Temperature parameter ・Increase temperature parameter

  9. Optimization of state sequence • Likelihood function in the DAEM algorithm State output probability State transition probability Time All state sequences have uniform probability

  10. Optimization of state sequence • Likelihood function in the DAEM algorithm State output probability State transition probability Time Change from uniform to sharp

  11. Optimization of state sequence • Likelihood function in the DAEM algorithm State output probability State transition probability Time Estimate reliable acoustic models

  12. Outline • HMM-based speech synthesis system • Deterministic annealing EM (DAEM) algorithm • Step-wise model selection • Experiments • Conclusion & future work

  13. Problem of context clustering • Context-dependent model • Appropriate model structures are required • Decision tree based context clustering • Assumption: state occupancies are not changed • State occupancies depend on model structures • State sequences and model structures should be optimized simultaneously Vowel? /a/? Silence?

  14. Step-wise model selection • Gradually change the size of decision tree • Perform joint optimization of model structures and state sequences • Minimum Description Length (MDL) criterion : Tuning parameter : Amount of training data assigned to the root node : Number of nodes : Dimension of feature vec.

  15. Model training process • Estimate monophone models (DAEM) • # of temperature parameter updates is 10 • # of EM-steps at each temperature is 5 • Select decision trees by the MDL criterion using the tuning parameter • Estimate context-dependent models (EM) • # of EM-steps is 5 • Decrease the tuning parameter • Tuning parameter decreases as 4, 2, 1 • Repeat from step. 2

  16. Outline • HMM-based speech synthesis system • Deterministic annealing EM (DAEM) algorithm • Step-wise model selection • Experiments • Conclusion & future work

  17. Speech analysis conditions

  18. Likelihood & model structure • Average log likelihood of monophone model • Number of leaf nodes • Phone set: Unilex (58 phoneme) • Number of leaf nodes (Full-context): 6,175,466

  19. Experimental results • Compare with the benchmark HMM-based system • NIT system achieved the same performance • High intelligibility • Compare with the benchmark unit-selection system • Worse in speaker similarity • Better in intelligibility

  20. Speech samples • Generate high intelligible speech • Include voiced/unvoiced errors • Need to improve feature extraction and excitation

  21. Conclusion • NIT HMM-based speech synthesis system • DAEM algorithm • Overcome the local maxima problem • Step-wise model selection • Perform joint optimization of state sequences and model structures • Generate high intelligible speech • Future work • Improve feature extraction and excitation • Investigate the schedule of temperature parameters and step-wise model selection

  22. Thank you

More Related