290 likes | 431 Views
Recent Work on Acoustic Modeling for CTS at ISL. Florian Metze , Hagen Soltau, Christian Fügen, Hua Yu Interactive Systems Laboratories Universität Karlsruhe, Carnegie Mellon University. Overview. ISL‘s RT-03 system revisited System combination of Tree-150 & Tree-6
E N D
Recent Work on Acoustic Modeling for CTS at ISL Florian Metze, Hagen Soltau, Christian Fügen, Hua Yu Interactive Systems Laboratories Universität Karlsruhe, Carnegie Mellon University
Overview • ISL‘s RT-03 system revisited System combination of Tree-150 & Tree-6 • Richer Acoustic Modeling • Across-phone Clustering • Gaussian Transition Modeling • Modalities • Articulatory Features EARS Workshop, December 2003, St. Thomas
Decoding Strategy • System Combination • Combine tree-150, tree-6; 8ms, 10ms output • Confusion networks over multiple lattices and Rover • Confidences computed from combined CNs • Best single output (Tree-150): 25.4 • CNC + Rover: 24.9 • Results on eval03 • Tree-150 single system: 24.2 • CNC + Rover: 23.4 EARS Workshop, December 2003, St. Thomas
Vocabulary • Vocabulary Size 41k vocabulary selected from SWB, BN, CNN • Pronunciation Variants 95k entries generated by rule-based approach • Pronunciation Probabilities From frequencies (forced alignment of training data) • Viterbi decoding: penalties (e.g. max = 1) • Confusion networks: real probabilities (e.g. sum = 1) EARS Workshop, December 2003, St. Thomas
Clustering • Entropy-based Divisive Clustering • Standard way : • Grow tree for each context independent HMM state • 50 phones, 3 states : 150 trees • Alternative : clustering across phones • Global tree parameter sharing across phones • Computationally expensive to cluster 6 trees (begin, middle, end for vowels and consonants) • Quint-phone context EARS Workshop, December 2003, St. Thomas
Motivation for Alternative Clustering • Pronunciation modeling is important forrecognizing conversational speech • Adding pronunciation variants often givesmarginal improvements due to increased confuseability • Case study: Flapping of /T/ BETTER B EH T AXRBETTER(2) B EH DX AXR • Dictionary only contains single pronunciation and the phonetic decision tree chooses whether or not to flap /T/ EARS Workshop, December 2003, St. Thomas
Clustering Across Phones:Tree construction • How to grow a single tree?We expand the question set to allow questions regarding the substate identity and center phone identity. • Computationally expensiveon 600kSWB quint-phones • Two dictionaries: • conventional dictionary with 2.2 variants per word • (almost) single pronunciation dictionary with 1.1 variants per word A simple procedure is used to reduce the number of pronunciationvariants. Variants with a relative frequency of <20% are removed. For unobserved words, only the baseform is kept. EARS Workshop, December 2003, St. Thomas
0=vowel? 0=obstruent? 0=begin-state? -1=syllabic? 0=mid-state? -1=obstruent? 0=end-state? AX-b IX-m AX-m Clustering Across Phones • Allows better parameter tying (tying now possible across phones and sub-states) • Alleviates lexical problems: over-specification and inconsistencies no need for an optimal phone set, preferable for multi-lingual / non-native speech recognition • Implicitly models subtle reduction in sloppy speech EARS Workshop, December 2003, St. Thomas
Clustering Across Phones: Experiments • Cross-substate clustering doesn’t make any difference • Cross-phone clustering with 6 trees: {vowel|consonant}-{b|m|e} • Single pronunciation lexicon has 1.1 variants per word(instead of 2.2 variants per word) EARS Workshop, December 2003, St. Thomas
-1=voiced? Vowel-b -1=consonant? 0=high-vowel? 1=front-vowel? 0=high-vowel? -1=obstruent? 0=L | R | W? Analysis • Flexible tying works better with single pronunciation lexicon: • Higher consistency, data-driven approach • Significant cross-phone sharing: ~30% of the leaf nodes are shared by multiple phones • Commonly tied vowels: AXR & ER, AE & EH, AH & AX ~ consonants: DX & HH, L & W, N & NG EARS Workshop, December 2003, St. Thomas
Gaussian Transition Modeling • A linear sequence of GMMs may contain a mix of different model sequences. • To further distinguish these paths, we can model transitions between Gaussians in adjacent states. EARS Workshop, December 2003, St. Thomas
Frame-independence Assumption • HMM assumes each speech frames to be conditionally independent given the hidden state sequence … … frames … … models HMM as a generative model EARS Workshop, December 2003, St. Thomas
Gaussian Transition Modeling GTM models transition probabilities between Gaussians EARS Workshop, December 2003, St. Thomas
GTM for Modeling Sloppy Speech • Partial reduction/realization may be better modeled at sub-phoneme level • GTM can be thought of as pronunciation network at the Gaussian level • GTM can handle a large number of trajectories • Advantages over Parallel Path HMMs/ Segmental HMMs • Number of paths is very limited • Hard to determine the right number of paths EARS Workshop, December 2003, St. Thomas
Experiments • GTM can be readily trained using Baum-Welch algorithm • Data sufficiency an issue since we are modeling 1st order variable • Pruning transitions is important (backing-off) EARS Workshop, December 2003, St. Thomas
Experiments II • GTM offers better discrimination between trajectories • All trajectories are nonetheless still allowed. • Pruning away unlikely transitions leads to a more compact and prudent model. • However, we need to be careful not to prune away unseen trajectories due to a limited training set. • Using a first-order acoustic model in decoding requires maintaining the left history, which is expensive at word boundaries. Viterbi approximation is used in current implementation. • Log-Likelihood improvements during Baum-Welch training:-50.67 to -49.18 EARS Workshop, December 2003, St. Thomas
Modalities • Would like to include additional information into divisive clustering, e.g.: • Gender • Signal-noise-ratio • Speaking rate • Speaking style (normal vs hyper-articulated) • Dialect • Show-type, Data-type (CNN, NBC, ...) • Data-driven approach: sharing still possible EARS Workshop, December 2003, St. Thomas
-1=vowel? -1=obstruent? 0=bavarian? -1=syllabic? 0=suabian? -1=obstruent? 0=female? Modalities II • Suitable for different corpora? • Example: • German Dialects • Male/ Female EARS Workshop, December 2003, St. Thomas
Modalities III • Tested on German Verbmobil data • Not enough time to test on SWB/ RT-03 • Proved beneficial in several applications • Labeled data needed • Our tests were not done on highly optimized systems (VTLN) • Hyperarticulation: -1.7% for Hyper +0.3% for Normal EARS Workshop, December 2003, St. Thomas
Modalities Results EARS Workshop, December 2003, St. Thomas
Articulatory Features • Idea: combine very specific sub-phone models with generic models • Articulatory Features: Linguistically Motivated/F/ = UNVOICED, FRICATIVE, LAB-DNT, ... • Introduce new Degrees of Freedom for • Modeling • Adaptation • Integrate into existing architecture, use existing training techniques (GMMs) for feature detectors • Articulatory (Voicing) Features in Front-end did not help EARS Workshop, December 2003, St. Thomas
Articulatory Features • Output from Feature Detectors: p(FEAT)-p(NON_FEAT)+p0 EARS Workshop, December 2003, St. Thomas
Articulatory Features A-symmetric Stream Setup: ~4k models • ~4k GMMs in stream 0 • 2 GMMs in stream 1...N („Feature Streams“) EARS Workshop, December 2003, St. Thomas
Articulatory Features Results I • Test on Read Speech (BN-F0)13.4% 11.6% with Articulatory Features • Test on Multilingual Data13.1% 11.5% (English with ML detectors) • Significant Improvements also seen on • Hyper-Articulated Speech • Spontaneous, Clean Speech (ESST) EARS Workshop, December 2003, St. Thomas
Articulatory Features Results II • Test on Switchboard (RT-03 devset) Sub Del Ins WER • Baseline | 72.5 20.0 7.5 4.4 31.9 67.2 | • Features | 68.3 18.3 13.4 2.2 33.9 68.4 | • Result: • Substitutions, Insertions • Deletions • No overall improvement yet will work on setup EARS Workshop, December 2003, St. Thomas
Thank You, ... the ISL team! EARS Workshop, December 2003, St. Thomas
Related Work • D. Jurafsky, et al.: What kind of pronunciation variation is hard for triphones to model? ICASSP’01 • T. Hain: Implicit pronunciation modeling in ASR. ISCA Pronunciation Modeling Workshop, 2002 • M. Saraclar, et al.: Pronunciation modeling by sharing Gaussian densities across phonetic models. Computer Speech and Language, Apr. 2000 EARS Workshop, December 2003, St. Thomas
Related Work • R. Iyer, et al.: Hidden Markov models for trajectory modeling, ICSLP’98 • M. Ostendorf, et al.: From HMMs to segment models: A unified view of stochastic modeling for speech recognition, IEEE trans. Sap, 1996 EARS Workshop, December 2003, St. Thomas
Publications • F. Metze and A. Waibel: A Flexible Stream Architecture for ASR using Articulatory Features; ICSLP 2002; Denver, CO • C. Fügen and I. Rogina: Integrating Dynamic Speech Modalities into Context Decision Trees; ICASSP 2000; Istanbul, Turkey • H. Yu and T. Schultz: Enhanced Tree Clustering with Single Pronunciation Dictionary for Conversational Speech Recognition; Eurospeech 2003; Geneva • H. Soltau, H. Yu, F. Metze, C. Fügen, Q. Jin, and S. Jou: The ISL transcription system for conversational telephony speech; submitted to ICASSP 2004; Vancouver • ISL web page: http://isl.ira.uka.de EARS Workshop, December 2003, St. Thomas