Discriminative Feature Optimization for Speech Recognition

Discriminative Feature Optimization for Speech Recognition Bing Zhang College of Computer & Information Science Northeastern University

Outline • Introduction • Problem to attack • Methodology • Region-dependent feature transform • Discriminative optimization of the feature transform • Implementation • System description & results • Conclusions

Introduction • Speech recognition • Goal: transcribe speech into text • Performance measurement: word error rate (WER) • Typical approach: • Training: statistically model the acoustic and linguistic knowledge • Recognition: search for the most probable word sequence using the models • Speechfeatureextraction • Reason: raw signals cannot be robustly modeled due to high-dimensionality, therefore compact features have to be extracted • Two stages of feature extraction: • speech analysis  cepstral coefficients • speech feature transformation • In this thesis: A better feature transformation approach is developed to reduce the WER of the speech recognition system

Acoustic Model Feature Extraction Features Search Engine Word Sequence Speech Signal Language Model Introduction (cont.) A typical speech recognition system Word Sequence Features Acoustic Model Language Model

Language Model • N-grams • Models the conditional probability of any word given N-1 words in history • The product of N-gram probabilities can be used to approximate the probability of a sequence of words • P(w1, w2, …, wk) ≈ P(w1 ) P(w2 | w1) P(w3 | w1,w2) … P(wN | w1, …, wN-1) … P(wk-1 | wk-N, ..., wk-2) P(wk | wk-(N-1), ..., wk-1) • Special cases: • Unigram: P(wi) • Bigram: P(wi | wi-1) • Trigram: P(wi | wi-2,wi-1)

HMM-based Acoustic Model • Repository of unit HMMs (Hidden Markov Model) • Each HMM is a probabilistic finite state machine with outputs at each hidden state • Transition probabilities • Observation probabilities (modeled by a mixture of Gaussians for each state) • Each HMM represents a basic unit of speech, e.g., phoneme, crossword/non-crossword multiphones • HMM state-clusters: specify which HMM states can share which parameters • Pronunciation dictionary: phonetic spelling of the words

o1 o2 o3 o4 o5 o6 Example of an HMM a11 a33 a22 a44 HMM a12 a23 a34 Start 1 2 3 4 End a13 a24 Observations

Example of an HMM a11 a33 a12 a23 a34 Start 1 2 3 4 End b1(o1) b1(o2) b2(o3) b3(o4) b3(o5) b4(o6) o1 o2 o3 o4 o5 o6 a22 a44 a12 Start 1 2 4 End a24 b1(o1) b2(o2) b2(o3) b2(o4) b4(o5) b4(o6) o1 o2 o3 o4 o5 o6

HMM-based Acoustic Model • Repository of unit HMMs (Hidden Markov Model) • Each HMM is a probabilistic finite state machine with outputs at each hidden state • Transition probabilities • Observation probabilities (modeled by a mixture of Gaussians for each state) • Each HMM represents a basic unit of speech, e.g., phoneme, crossword/non-crossword multiphones • HMM state-clusters: specify which HMM states can share which parameters • Pronunciation dictionary: phonetic spelling of the words

quest a sense the is SIL guest SIL is sentence this SIL the SIL sentence this is a test Acoustic Training • Maximum likelihood (ML) training • Objective: maximize the conditional likelihood of the observed features given the model • Algorithm: Expectation-maximization (EM) • Discriminative training • Objective: train the model to distinguish the correct word sequence from other hypotheses • Criterion • Minimum phoneme error (MPE) • Representation of hypotheses: lattices • Algorithm: Extended EM

Feature Extraction • Speech analysis • Deals with the problem of extracting distinguishing characteristics (e.g., formant locations) of speech from digital signals • Examples: MFCC (Mel-frequency cepstral coefficients), PLP (perceptual linear prediction) • Resulting features: cepstral coefficients • Speech feature transformation • Applied on top of the cepstral coefficients • Transform the cepstral features to better fit the model • help the HMM to model the trajectory of the cepstral features • fit the diagonal covariance assumption of the Gaussian components

Commonly Used Feature Transforms • LDA (linear discriminant analysis) • Transform the features to maximize the distance between different classes while keeping each class as compact as possible • Assumes the all classes have equal covariance • HLDA (heteroscedastic linear discriminant analysis) • Remove the equal covariance assumption of LDA • Find the feature transform that maximizes the likelihood of the data with respect to the acoustic model in the transformed space • Others • HDA (heteroscedastic discriminant analysis) • MLLT (maximum likelihood linear transform)

Drawbacks of Traditional Feature Transforms • Inaccurate assumptions about the acoustic model • LDA assumes equal-class covariance • HDA & LDA ignore the diagonal covariance assumption • Linear transform • Linear transform has limited power for feature extraction • Using more powerful transforms can be risky when the criterion does not correlate with the WER • The criteria do not correlate with the WER • Performance degrades on high-dimensional input features • Experimental results in the thesis • Performance degrades on highly-correlated input features • Example on the next slide

Example The data has linear dependency between two dimensions such that: Z=2X Z Z Y X X If projected to 1-D • HLDA will map all samples to one single point • LDA will fail to find the answer at all because the covariance matrix of each class is singular

A Better Approach • Region-dependent transform • Nonlinear • Computationally inexpensive to train • Discriminative training of the feature transform • Criterion correlates well with the WER • Detailed acoustic model in feature training

f1 r1 r2 rN fN f2 Region Dependent Transform (RDT) • RDT: • Divides the acoustic space to multiple regions • e.g., r1, r2, …, rN • Applies a different transform based on which region the input feature vector belongs to • e.g., f1, f2, …, fN To avoid making hard decisions when choosing which transform to apply, the posterior probabilities of the regions are used to interpolate the transformed results:

More Details of RDT • Input features: long-span features • A long span feature vector is formed by concatenating the cepstral features from consecutive frames, centered at the current frame • Advantage: contains information about the acoustic context of the current frame • Division of the regions: global Gaussian mixture model (GMM) • Trained via unsupervised clustering • Each Gaussian component in the GMM corresponds to a region • Region-specific transforms • In general, they can be any projections of long-span feature vectors • In this thesis, linear projections are studied

Special Cases of RDT RDT Generic projection RDLT Linear projection MPE-HLDA fMPE# SPLICE Mean-offset fMPE# Only one region Only offset Rotation matrix plus offset P is not region-dependent Note (#): fMPE also includes a context-expansion layer, which does not fit this categorization. (see thesis for details)

Projections vs. Offsets in RDT The projection and the offset in RDT: Different regions can share the same projections and/or offsets. So the unique number of projections/offsets can be less than the number of regions. Projection Offset

Optimization Criterion of RDT • Minimum Phoneme Error (MPE) criterion • Gives significant gains when used to train the HMM • Correlates well with WER • Can be rewritten as a function of the feature transform: WER MPE Score O, Or: original feature vectors; λ: the HMM; FRDT: the feature transform; α(Wrk): the accuracy score of hypothesized word sequence Wrk

HMM Updating Methods • In MPE, the HMM depends on the transformed features, so we should define how it is updated • When we choose the HMM updating methods, the concern is to make the trained transform be more generic, i.e., reusable for different training setups including: • both ML and MPE training • different types of HMMs • If we can make the feature transform focus on separating the data, this goal can be achieved • To ensure that, the HMM should better describe the data rather than anything else

HMM Updating Methods (cont.) • If the HMM is updated discriminatively, e.g., under MPE • Some Gaussians in the HMM will model decision boundaries, being away from the mass of the data • The feature transform will be misled from separating the real data • The resulting transform is less generic • This method is OK if there is only one HMM to train • If the HMM is updated under ML • The Gaussians will stay on the data • The feature transform will also focus on the data • The resulting transform is more generic • This method is preferred if there are different HMMs to train • We assume ML updating of the HMM in this thesis

Example ML Model Discriminative Model Before transform After transform Since the model is already discriminative, nothing needs to be done here.

Training the Feature Transform • The transform is trained using a numerical optimization algorithm • Derivative of MPE with respect to the transform • Two terms in the derivative • MPE depends on the transformed features directly  direct derivative • MPE depends on the transform through the HMM, which in turn depends on transformed features  indirect derivative • Two passes of data processing • The first pass computes the direct derivative using lattices • The second pass computes the indirect derivative using reference transcripts

Apply Transform Update RDT Original features RDT Projected features Derivative Train/Update HMM Compute MPE Derivative HMM Reference transcripts Lattices Training Procedure Iterative update of RDT using numerical optimization

Implementation • Feature transform network • A directed acyclic network of primitive components • Design goals: • reuse primitive components (e.g., linear projection, frame-concatenation) • reuse the algorithm that applies the transform or computes the derivative • easy to extend to other transforms • efficient usage of CPU time & memory • Impact: • enables numerical optimization of any differentiable components including but not limited RDT • simplifies the BBN system by providing a unified representation of various transforms • added flexibility to the front-end processing in the BBN system Cepstra Concatenation Projection Gauss. Mixture RDT

RDT and the State-of-the-art System • The state-of-the-art system at BBN • Two sub-systems • Speaker-independent (SI) system • Speaker-adaptive (SA) system • Two phases of training • ML (initialize MPE training) • MPE • Three pass decoding • Three tied-mixture acoustic models • How RDT interacts with the system • Trained once, used in three types of acoustic models • Integrated with speaker adaptation

RDT Training RDT & HMM RDT in Speaker-independent (SI) Training Bootstrapping SI training baseline SI training with RDT LDA+MLLT Initial Transform ML Training ML-SI HMM Lattice Generation Lattices MPE Training MPE-SI HMM

Experimental Setup • Data • Training: English Conversational Telephone Speech (CTS), 2300 hours SWB+Fisher • Testing: Eval03+Dev04, 3 hours SWB-II, 6 hours Fisher • Analysis • 14 Perceptual Linear Prediction (PLP) cepstral coefficients and normalized energy • Vocal Tract Length Normalization (VTLN) • RDT • 15-frame long-span features projected to 60 dimensions • initialized from LDA+MLLT • 1000 regions, one linear projection per region • crossword state-cluster tied model (SCTM), 7K clusters. • number of Gaussians per state-cluster in the HMM varies in different experiments

SI Results (ML) • Description • Two RDTs were trained using the HMMs with 12 Gaussians per state-cluster (GPS) and 44 GPS, respectively • For decoding, several ML crossword SCTM models with different sizes were trained using either LDA+MLLT or RDT • Only the lattice-rescoring pass was run in decoding for simplicity • (#): After other two models (STM, SCTM-NX) were retrained, the WER was further reduced to 20.4%, i.e., 9.3% relatively better than the LDA+MLLT result

SI Results (MPE) • Description • Same as the ML experiments, except that the final models were trained under MPE • (#): After other two models (STM, SCTM-NX) were trained, the WER was further reduced to 19.2%, i.e., 5.8% relatively better than the LDA+MLLT result

S(1) Model S(3) Model A(1) A(N) SI Model A(3) A(2) S(N) Model S(2) Model Speaker Adaptation • Speaker adaptation (figure) • Assumption: the speaker-dependent models are linearly transformed from an SI model • Variations • MLLR: assume that only Gaussian means are transformed • CMLLR: both means & covariances are transformed  equivalent to applying the inverse transform to features while keeping model fixed • Speaker-Adaptive Training (SAT) • The SI model is not optimal for adaptation • SAT tries to estimate a better model that when transformed gives the best likelihood of the data

RDT in Speaker-adaptive Training (SAT) • Use SI-RDT transparently • Simple • But RDT is not optimized for SAT Straightforward approach Train SI RDT SI RDT & HMM CMLLR Estimation SD Transforms ML SAT ML-SAT HMM MPE Training MPE-SAT HMM

RDT in Speaker-adaptive Training (SAT) • Alternately update RDT and the speaker- dependent (SD) transforms • Back-propagation is used to compute the derivative, since SD transforms are applied on top of RDT • RDT is optimized for SAT Train SI RDT Iterative approach (SA-RDT) SI RDT & HMM CMLLR Estimation SD Transforms ML SAT ML-SAT HMM Update RDT SA RDT & HMM MPE Training MPE-SAT HMM

Adapted Results • Description • Same training & testing data, state-cluster and LM as the unadapted experiments • 10.9% relative WER reduction for the ML system • 7.0% relative WER reduction for the MPE system

Alternative Procedure for SA-RDT • Similar to the original SA-RDT • But the speaker-dependent transforms are estimated using the baseline model & features Simplified SA-RDT SI LDA+MLLT & HMM CMLLR Estimation SD Transforms ML SAT ML-SAT HMM Update RDT SA RDT & HMM MPE Training MPE-SAT HMM

Adapted Results • Description • 500 hours of training data • Another set of SD transforms were used before LDA/RDT • SA-RDT1 was using the simplified procedure • SA-RDT2 was using the original procedure • The simplified procedure gave 2/3 of the gain by training the RDT only once

Conclusions • Original work • Region-dependent transform • Improved discriminative feature training that leads to more generic feature transform • Improved SAT procedure using RDT • Impact • RDT encompasses several other feature transforms, including MPE-HLDA, SPLICE and the core of fMPE and mean-offset fMPE • The method gives significant WER reduction: 7% relative reduction to the SAT-MPE English CTS system • The method is potentially helpful for exploring novel acoustic features • We do not have to worry about the negative effect when we add new features to the input of the feature transform, because the training will decide whether to use the new features and how to use them based on a criterion that is correlated to WER

Publications • B. Zhang, S. Matsoukas, J. Ma, and R. Schwartz. Long span features and minimum phoneme heteroscedastic linear discriminant analysis. In Proceedings of EARS RT-04 Workshop, 2004. • B. Zhang and S. Matsoukas. Minimum phoneme error based heteroscedastic linear discriminant analysis for speech recognition, In Proceedings of ICASSP, 2005. • B. Zhang, S. Matsoukas and R. Schwartz. Discriminatively trained region-dependent transform for speech recognition. In Proceedings of ICASSP, 2006. • Nominated for the Student Paper Award • Awarded the Spoken Language Processing Grant by the IEEE Signal Processing Society • B. Zhang, S. Matsoukas and R. Schwartz. Recent progress on the discriminative region-dependent transform for speech feature extraction. In Proceedings of ICSLP, 2006.

Discriminative Feature Optimization for Speech Recognition

Discriminative Feature Optimization for Speech Recognition

Presentation Transcript

Distinctive Feature Detection For Automatic Speech Recognition

Speech Recognition

Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Using Speech Recognition for Speech Therapy

Discriminative and Generative Recognition

Speech recognition

Large-Margin Feature Adaptation for Automatic Speech Recognition

Speech Recognition

Speech Recognition

Linear Discriminant Feature Extraction for Speech Recognition

Articulatory Feature-Based Speech Recognition

Articulatory Feature-Based Speech Recognition

Discriminatively Trained Region Dependent Feature Transforms for Speech Recognition

Discriminative Training Approaches for Continuous Speech Recognition

Large scale discriminative training for speech recognition

SPEECH RECOGNITION:

Single and Multi Channel Feature Enhancement for Distant Speech Recognition

Speech Recognition

Articulatory Feature-Based Speech Recognition

Articulatory Feature-Based Speech Recognition

A Feature Weighting Method for Robust Speech Recognition

Speech Recognition