170 likes | 329 Views
A Database of Vocal Tract Resonance Trajectories for Research in Speech Processing. L. Deng (Microsoft Research, Redmond) X. Cui & A. Alwan (U. California, Los Angeles) R. Pruvenok (Georgia Institute of Tech, Atlanta) J. Huang (Carnegie Mellon U., Pittsburg) S. Momen (Princeton U., Princeton)
E N D
A Database of Vocal Tract Resonance Trajectories for Research in Speech Processing L. Deng (Microsoft Research, Redmond) X. Cui & A. Alwan (U. California, Los Angeles) R. Pruvenok (Georgia Institute of Tech, Atlanta) J. Huang (Carnegie Mellon U., Pittsburg) S. Momen (Princeton U., Princeton) Y. Chen (Cornell U., Ithaca)
Introduction • Joint research project between MSR & IPAM of UCLA • Carried out during 2005 NSF-RIPS summer program • Main Goals: • Create a database of VTR/formant trajectories for research in speech processing (ground truth). • Quantitatively assess various existing automatic VTR/formant tracking algorithms
Background • Vocal tract resonance (VTR or formant-I) --- acoustic resonance in the human tract in speech production • May differ from spectral peaks measured from the speech signal (formant-II) • Importance of VTR/formants for speech perception and production • Many techniques for automatic VTR or formant-II extraction
Background (cont’d) • Difficulty of automatic VTR/formant tracking • When two formants are close to each other (e.g., /iy,y,uw,r/) • Consonant sounds when VTRs are not directly visible from spectrogram (e.g., nasals, fricatives, stops) • CV or VC transitions • Lack of standard database for quantitative evaluation of tracking algorithms • Requirement for extensive human expertise
Data Selection • Subset of TIMIT utterances • 538 utterances in total • 192 utterances in core test set • 346 utterances in training set (173 speakers; one SX & one SI for each) • Balance of speaker, dialect, gender, & phoneme distributions
VTR Trajectory Labeling • Start from the results of a previous VTR tracking algorithm (ICASSP 2004 paper) • Develop a software tool for manual error correction using spectrogram display • Use human expertise
Human Expertise • Prior knowledge of nominal VTR target values for individual phones • Contextual effects of VTR values (target directed trajectories) • Overall spectral properties across entire utterance (same phones at diff times) • Effects of anti-resonances in splitting VTRs of nasalized vowels • Special formant movement patterns (e.g., velar pinch, etc.) • Etc.
Two Automatic Algorithms • WaveSurfer http://www.speech.kth.se/wavesurfer) (same algorithm as ESPS/xwaves, Talkin et.al) • based on LPC analysis and dynamic programming • MSR Hidden dynamic model based algorithm • Implemented by Kalman filter/smoother • Piecewise-linearized mapping from VTR to cepstra • By-product of a speech recognizer • Typing all phone VTR targets • Details in ICASSP 2004 paper
Comparisons of Two Algorithms His failure to open the store by eight cost him his job
Comparisons of Two Algorithms We always thought we would die with our boots on
Computing Formant Tracking Errors--- Focusing on transitions
Summary and Conclusion • VTR/Formants are critical for speech production, perception, and processing • Prior to this work, lack of standard database • Creating a database using human expertise • Immediate application: quantitative evaluation of automatic VTR/formant tracking algorithms • Second-pass verification & correction at MSR recently completed • Data soon to be publicly released from both MSR and UCLA sites.