1 / 23

Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis. João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi. The Centre for Speech Technology Research The University of Edinburgh. Outline. Introduction Voice source model System

hedwig
Download Presentation

Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi The Centre for Speech Technology Research The University of Edinburgh

  2. Outline • Introduction • Voice source model • System • Perceptual evaluation • Concluding remarks • Future work

  3. Training speech F0 extraction Spectral features estimation Text analysis Text HMMs F0 spectrum Pulse train Synthetic Speech Synthesis filter + Noise component IntroductionHMM-based speech synthesizer [Tokuda et al]

  4. Voice source modelObtaining the glottal source signal • Source-filter model: • Inverse filtering: Source Ug Vocal tract A(z) Lip radiation d/dz Speech Lip radiation cancellation (∫) Inverse Filter 1/A(z) Speech

  5. Voice source modelLiljencrants-Fant model (LF-model) T : period to : opening instant tp : instant of max airflow te : instant of max excitation ta : return phase duration tc : closing instant Ee : excitation amplitude

  6. Voice source modelOther parameters of the LF-model Open quotient: Speed quotient: Return quotient:

  7. Fg glottal spectral peak Fc spectral tilt Voice source modelDescription of the LF-model spectrum Linear stylization of the LF-model spectrum [Doval and d’Alessandro]

  8. Voice source modelFeatures extraction • utterances sampled at 16 kHz • pitch-synchronous analysis (ESPS tools) • LPCs calculated with windows centered at the glottal epochs and duration 20ms • inverse filtering to estimate DGS • pre-emphasis filter (α=0.97) • low-pass filtering of the residual at 4 kHz

  9. Voice source modelEstimation of te and Ee • te and Ee are estimated from the pitch-marks

  10. Voice source modelEstimation of tc, tp and to [Gobl & Chasaide]

  11. Voice source modelEstimation of ta Fs : sampling frequency m : slope of the tangent at t=te

  12. Voice source modelExamples of the estimated parameters Curves of the LF-parameters for 2 voiced regions of an utterance

  13. SystemGeneral description - Nitech-HTS 2005 system - STRAIGHT method for analysis and synthesis - mixed multi-band excitation with phase manipulation / pulse train - Mel Log Spectrum Approximation (MLSA) filter How was the LF-model integrated in the synthesizer?

  14. SystemGeneration of the periodic excitation (pulse signal) • Pulse centered within the frame • multiplied by asymmetric widows • summed with Gaussian noise

  15. SystemPeriodic excitation with the LF-model • 2 LF-waveforms centered at the instant te • multiplied by asymmetric widows • summed with Gaussian noise

  16. Solution: Post-filter Linear phase FIR filter: -6dB/dec 1Hz ≤ f≤ Fg (Hz) +6dB/dec Fg < f ≤ Fc (Hz) +12dB/dec Fc < f ≤ 16 kHz SystemTechnical problem • Problem: the synthesis filter assumes the excitation to have a flat spectrum like the pulse train

  17. SystemEffect of the post-filtering

  18. Perceptual evaluationGeneration of the stimuli • Built US-English voice EM001 provided by ATR for the Blizzard Challenge • Glottal parameters were measured in 8 utterances and the mean values were calculated • Simple excitation, without multi-band noise or phase manipulation • Ten utterances were synthesized, using the LF-model and the pulse model

  19. Perceptual evaluationExperiment • Forced-choice test • Presented via a web-interface browser • Subjects were asked if they used headphones or speakers, and if they were native speakers (U.K./U.S.) • 18 listeners (7 native speakers of English) • Listeners panel was mainly university students and staff Example of test speech signals: Pulse: LF-model:

  20. Perceptual evaluationResults

  21. Conclusions • Nitech-HTS 2005 speech synthesizer was implemented with the LF-model for the voice source • Results showed that the LF-model can give better speech quality than the traditionally used pulse train • Direct methods used for the estimation of the mean LF-parameters seemed to perform well • A technical problem with the integration of the LF-model in the system was solved using a post-filter

  22. Future work • To find better analysis/synthesis methods to use with the LF-model in the HMM-based speech synthesis • To evaluate the speech quality when using the mixed excitation with the LF-model • To implement voice quality transformations using the LF-model • To evaluate the parameterization methods • To model the glottal parameters with HMMs

  23. Acknowledgements This work was financially supported by the Marie Curie EdSST programme. Thank you!

More Related