Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi The Centre for Speech Technology Research The University of Edinburgh

Outline • Introduction • Voice source model • System • Perceptual evaluation • Concluding remarks • Future work

Training speech F0 extraction Spectral features estimation Text analysis Text HMMs F0 spectrum Pulse train Synthetic Speech Synthesis filter + Noise component IntroductionHMM-based speech synthesizer [Tokuda et al]

Voice source modelObtaining the glottal source signal • Source-filter model: • Inverse filtering: Source Ug Vocal tract A(z) Lip radiation d/dz Speech Lip radiation cancellation (∫) Inverse Filter 1/A(z) Speech

Voice source modelLiljencrants-Fant model (LF-model) T : period to : opening instant tp : instant of max airflow te : instant of max excitation ta : return phase duration tc : closing instant Ee : excitation amplitude

Voice source modelOther parameters of the LF-model Open quotient: Speed quotient: Return quotient:

Fg glottal spectral peak Fc spectral tilt Voice source modelDescription of the LF-model spectrum Linear stylization of the LF-model spectrum [Doval and d’Alessandro]

Voice source modelFeatures extraction • utterances sampled at 16 kHz • pitch-synchronous analysis (ESPS tools) • LPCs calculated with windows centered at the glottal epochs and duration 20ms • inverse filtering to estimate DGS • pre-emphasis filter (α=0.97) • low-pass filtering of the residual at 4 kHz

Voice source modelEstimation of te and Ee • te and Ee are estimated from the pitch-marks

Voice source modelEstimation of tc, tp and to [Gobl & Chasaide]

Voice source modelEstimation of ta Fs : sampling frequency m : slope of the tangent at t=te

Voice source modelExamples of the estimated parameters Curves of the LF-parameters for 2 voiced regions of an utterance

SystemGeneral description - Nitech-HTS 2005 system - STRAIGHT method for analysis and synthesis - mixed multi-band excitation with phase manipulation / pulse train - Mel Log Spectrum Approximation (MLSA) filter How was the LF-model integrated in the synthesizer?

SystemGeneration of the periodic excitation (pulse signal) • Pulse centered within the frame • multiplied by asymmetric widows • summed with Gaussian noise

SystemPeriodic excitation with the LF-model • 2 LF-waveforms centered at the instant te • multiplied by asymmetric widows • summed with Gaussian noise

Solution: Post-filter Linear phase FIR filter: -6dB/dec 1Hz ≤ f≤ Fg (Hz) +6dB/dec Fg < f ≤ Fc (Hz) +12dB/dec Fc < f ≤ 16 kHz SystemTechnical problem • Problem: the synthesis filter assumes the excitation to have a flat spectrum like the pulse train

SystemEffect of the post-filtering

Perceptual evaluationGeneration of the stimuli • Built US-English voice EM001 provided by ATR for the Blizzard Challenge • Glottal parameters were measured in 8 utterances and the mean values were calculated • Simple excitation, without multi-band noise or phase manipulation • Ten utterances were synthesized, using the LF-model and the pulse model

Perceptual evaluationExperiment • Forced-choice test • Presented via a web-interface browser • Subjects were asked if they used headphones or speakers, and if they were native speakers (U.K./U.S.) • 18 listeners (7 native speakers of English) • Listeners panel was mainly university students and staff Example of test speech signals: Pulse: LF-model:

Perceptual evaluationResults

Conclusions • Nitech-HTS 2005 speech synthesizer was implemented with the LF-model for the voice source • Results showed that the LF-model can give better speech quality than the traditionally used pulse train • Direct methods used for the estimation of the mean LF-parameters seemed to perform well • A technical problem with the integration of the LF-model in the system was solved using a post-filter

Future work • To find better analysis/synthesis methods to use with the LF-model in the HMM-based speech synthesis • To evaluate the speech quality when using the mixed excitation with the LF-model • To implement voice quality transformations using the LF-model • To evaluate the parameterization methods • To model the glottal parameters with HMMs

Acknowledgements This work was financially supported by the Marie Curie EdSST programme. Thank you!

Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

Presentation Transcript

Parametric Statistical Inference

NONLINEAR STATISTICAL MODELING OF SPEECH

Emerging Directions in Statistical Modeling in Speech Recognition

Parametric Modeling

Statistical Parametric Mapping

Nonlinear Statistical Modeling of Speech

NONLINEAR STATISTICAL MODELING OF SPEECH

Statistical Parametric Mapping

Statistical Parametric Mapping

Statistical Parametric Mapping

Towards Synthesis of Focus in Mandarin Text-to-speech System

Statistical Parametric Mapping

Statistical Parametric Mapping

PARAMETRIC STATISTICAL INFERENCE

Statistical Parametric Mapping

Parametric Modeling

PARAMETRIC STATISTICAL INFERENCE

PARAMETRIC STATISTICAL INFERENCE

PARAMETRIC STATISTICAL INFERENCE

Statistical Parametric Mapping

Parametric Modeling

Statistical Parametric Mapping