- 94 Views
- Uploaded on
- Presentation posted in: General

Discriminative Feature Optimization for Speech Recognition

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Discriminative Feature Optimization for Speech Recognition

Bing Zhang

College of Computer & Information Science Northeastern University

- Introduction
- Problem to attack
- Methodology
- Region-dependent feature transform
- Discriminative optimization of the feature transform

- Implementation
- System description & results
- Conclusions

- Speech recognition
- Goal: transcribe speech into text
- Performance measurement: word error rate (WER)
- Typical approach:
- Training: statistically model the acoustic and linguistic knowledge
- Recognition: search for the most probable word sequence using the models

- Speechfeatureextraction
- Reason: raw signals cannot be robustly modeled due to high-dimensionality, therefore compact features have to be extracted
- Two stages of feature extraction:
- speech analysis cepstral coefficients
- speech feature transformation

- In this thesis: A better feature transformation approach is developed to reduce the WER of the speech recognition system

Acoustic Model

Feature

Extraction

Features

Search Engine

Word Sequence

Speech Signal

Language Model

A typical speech recognition system

Word Sequence

Features

Acoustic Model

Language Model

- N-grams
- Models the conditional probability of any word given N-1 words in history
- The product of N-gram probabilities can be used to approximate the probability of a sequence of words
- P(w1, w2, …, wk) ≈ P(w1 ) P(w2 | w1) P(w3 | w1,w2) … P(wN | w1, …, wN-1)
… P(wk-1 | wk-N, ..., wk-2) P(wk | wk-(N-1), ..., wk-1)

- P(w1, w2, …, wk) ≈ P(w1 ) P(w2 | w1) P(w3 | w1,w2) … P(wN | w1, …, wN-1)
- Special cases:
- Unigram: P(wi)
- Bigram: P(wi | wi-1)
- Trigram: P(wi | wi-2,wi-1)

- Repository of unit HMMs (Hidden Markov Model)
- Each HMM is a probabilistic finite state machine with outputs at each hidden state
- Transition probabilities
- Observation probabilities (modeled by a mixture of Gaussians for each state)

- Each HMM represents a basic unit of speech, e.g., phoneme, crossword/non-crossword multiphones

- Each HMM is a probabilistic finite state machine with outputs at each hidden state
- HMM state-clusters: specify which HMM states can share which parameters
- Pronunciation dictionary: phonetic spelling of the words

o1

o2

o3

o4

o5

o6

a11

a33

a22

a44

HMM

a12

a23

a34

Start

1

2

3

4

End

a13

a24

Observations

a11

a33

a12

a23

a34

Start

1

2

3

4

End

b1(o1)

b1(o2)

b2(o3)

b3(o4)

b3(o5)

b4(o6)

o1

o2

o3

o4

o5

o6

a22

a44

a12

Start

1

2

4

End

a24

b1(o1)

b2(o2)

b2(o3)

b2(o4)

b4(o5)

b4(o6)

o1

o2

o3

o4

o5

o6

- Repository of unit HMMs (Hidden Markov Model)
- Each HMM is a probabilistic finite state machine with outputs at each hidden state
- Transition probabilities
- Observation probabilities (modeled by a mixture of Gaussians for each state)

- Each HMM represents a basic unit of speech, e.g., phoneme, crossword/non-crossword multiphones

- Each HMM is a probabilistic finite state machine with outputs at each hidden state
- HMM state-clusters: specify which HMM states can share which parameters
- Pronunciation dictionary: phonetic spelling of the words

quest

a

sense

the

is

SIL

guest

SIL

is

sentence

this

SIL

the

SIL

sentence

this

is

a

test

- Maximum likelihood (ML) training
- Objective: maximize the conditional likelihood of the observed features given the model
- Algorithm: Expectation-maximization (EM)

- Discriminative training
- Objective: train the model to distinguish the correct word sequence from other hypotheses
- Criterion
- Minimum phoneme error (MPE)

- Representation of hypotheses: lattices
- Algorithm: Extended EM

- Speech analysis
- Deals with the problem of extracting distinguishing characteristics (e.g., formant locations) of speech from digital signals
- Examples: MFCC (Mel-frequency cepstral coefficients), PLP (perceptual linear prediction)
- Resulting features: cepstral coefficients

- Speech feature transformation
- Applied on top of the cepstral coefficients
- Transform the cepstral features to better fit the model
- help the HMM to model the trajectory of the cepstral features
- fit the diagonal covariance assumption of the Gaussian components

- LDA (linear discriminant analysis)
- Transform the features to maximize the distance between different classes while keeping each class as compact as possible
- Assumes the all classes have equal covariance

- HLDA (heteroscedastic linear discriminant analysis)
- Remove the equal covariance assumption of LDA
- Find the feature transform that maximizes the likelihood of the data with respect to the acoustic model in the transformed space

- Others
- HDA (heteroscedastic discriminant analysis)
- MLLT (maximum likelihood linear transform)

- Inaccurate assumptions about the acoustic model
- LDA assumes equal-class covariance
- HDA & LDA ignore the diagonal covariance assumption

- Linear transform
- Linear transform has limited power for feature extraction
- Using more powerful transforms can be risky when the criterion does not correlate with the WER

- The criteria do not correlate with the WER
- Performance degrades on high-dimensional input features
- Experimental results in the thesis

- Performance degrades on highly-correlated input features
- Example on the next slide

- Performance degrades on high-dimensional input features

The data has linear dependency between two dimensions such that: Z=2X

Z

Z

Y

X

X

If projected to 1-D

- HLDA will map all samples to one single point
- LDA will fail to find the answer at all because the covariance matrix of each class is singular

- Region-dependent transform
- Nonlinear
- Computationally inexpensive to train

- Discriminative training of the feature transform
- Criterion correlates well with the WER

- Detailed acoustic model in feature training

f1

r1

r2

rN

fN

f2

- RDT:
- Divides the acoustic space to multiple regions
- e.g., r1, r2, …, rN

- Applies a different transform based on which region the input feature vector belongs to
- e.g., f1, f2, …, fN

- Divides the acoustic space to multiple regions

To avoid making hard decisions when choosing which transform to apply, the posterior probabilities of the regions are used to interpolate the transformed results:

- Input features: long-span features
- A long span feature vector is formed by concatenating the cepstral features from consecutive frames, centered at the current frame
- Advantage: contains information about the acoustic context of the current frame

- Division of the regions: global Gaussian mixture model (GMM)
- Trained via unsupervised clustering
- Each Gaussian component in the GMM corresponds to a region

- Region-specific transforms
- In general, they can be any projections of long-span feature vectors
- In this thesis, linear projections are studied

RDT

Generic projection

RDLT

Linear projection

MPE-HLDA

fMPE#

SPLICE

Mean-offset

fMPE#

Only one region

Only offset

Rotation matrix plus offset

P is not region-dependent

Note (#): fMPE also includes a context-expansion layer,

which does not fit this categorization. (see thesis for details)

The projection and the offset in RDT:

Different regions can share the same projections and/or offsets. So the unique number of projections/offsets can be less than the number of regions.

Projection

Offset

- Minimum Phoneme Error (MPE) criterion
- Gives significant gains when used to train the HMM
- Correlates well with WER
- Can be rewritten as a function of the feature transform:

WER

MPE Score

O, Or: original feature vectors; λ: the HMM; FRDT: the feature transform;

α(Wrk): the accuracy score of hypothesized word sequence Wrk

- In MPE, the HMM depends on the transformed features, so we should define how it is updated
- When we choose the HMM updating methods, the concern is to make the trained transform be more generic, i.e., reusable for different training setups including:
- both ML and MPE training
- different types of HMMs

- If we can make the feature transform focus on separating the data, this goal can be achieved
- To ensure that, the HMM should better describe the data rather than anything else

- When we choose the HMM updating methods, the concern is to make the trained transform be more generic, i.e., reusable for different training setups including:

- If the HMM is updated discriminatively, e.g., under MPE
- Some Gaussians in the HMM will model decision boundaries, being away from the mass of the data
- The feature transform will be misled from separating the real data
- The resulting transform is less generic
- This method is OK if there is only one HMM to train

- If the HMM is updated under ML
- The Gaussians will stay on the data
- The feature transform will also focus on the data
- The resulting transform is more generic
- This method is preferred if there are different HMMs to train

- We assume ML updating of the HMM in this thesis

ML Model

Discriminative Model

Before

transform

After

transform

Since the model is already discriminative, nothing needs to be done here.

- The transform is trained using a numerical optimization algorithm
- Derivative of MPE with respect to the transform
- Two terms in the derivative
- MPE depends on the transformed features directly direct derivative
- MPE depends on the transform through the HMM, which in turn depends on transformed features indirect derivative

- Two passes of data processing
- The first pass computes the direct derivative using lattices
- The second pass computes the indirect derivative using reference transcripts

- Two terms in the derivative

Apply

Transform

Update

RDT

Original

features

RDT

Projected features

Derivative

Train/Update

HMM

Compute MPE

Derivative

HMM

Reference transcripts

Lattices

Iterative update of RDT using numerical optimization

- Feature transform network
- A directed acyclic network of primitive components
- Design goals:
- reuse primitive components (e.g., linear projection, frame-concatenation)
- reuse the algorithm that applies the transform or computes the derivative
- easy to extend to other transforms
- efficient usage of CPU time & memory

- Impact:
- enables numerical optimization of any differentiable components including but not limited RDT
- simplifies the BBN system by providing a unified representation of various transforms
- added flexibility to the front-end processing in the BBN system

Cepstra

Concatenation

Projection

Gauss. Mixture

RDT

- The state-of-the-art system at BBN
- Two sub-systems
- Speaker-independent (SI) system
- Speaker-adaptive (SA) system

- Two phases of training
- ML (initialize MPE training)
- MPE

- Three pass decoding
- Three tied-mixture acoustic models

- Two sub-systems
- How RDT interacts with the system
- Trained once, used in three types of acoustic models
- Integrated with speaker adaptation

RDT Training

RDT & HMM

Bootstrapping

SI training baseline

SI training with RDT

LDA+MLLT

Initial Transform

ML Training

ML-SI HMM

Lattice Generation

Lattices

MPE Training

MPE-SI HMM

- Data
- Training: English Conversational Telephone Speech (CTS), 2300 hours SWB+Fisher
- Testing: Eval03+Dev04, 3 hours SWB-II, 6 hours Fisher

- Analysis
- 14 Perceptual Linear Prediction (PLP) cepstral coefficients and normalized energy
- Vocal Tract Length Normalization (VTLN)

- RDT
- 15-frame long-span features projected to 60 dimensions
- initialized from LDA+MLLT
- 1000 regions, one linear projection per region
- crossword state-cluster tied model (SCTM), 7K clusters.
- number of Gaussians per state-cluster in the HMM varies in different experiments

- Description
- Two RDTs were trained using the HMMs with 12 Gaussians per state-cluster (GPS) and 44 GPS, respectively
- For decoding, several ML crossword SCTM models with different sizes were trained using either LDA+MLLT or RDT
- Only the lattice-rescoring pass was run in decoding for simplicity
- (#): After other two models (STM, SCTM-NX) were retrained, the WER was further reduced to 20.4%, i.e., 9.3% relatively better than the LDA+MLLT result

- Description
- Same as the ML experiments, except that the final models were trained under MPE
- (#): After other two models (STM, SCTM-NX) were trained, the WER was further reduced to 19.2%, i.e., 5.8% relatively better than the LDA+MLLT result

S(1) Model

S(3) Model

A(1)

A(N)

SI Model

A(3)

A(2)

S(N) Model

S(2) Model

- Speaker adaptation (figure)
- Assumption: the speaker-dependent models are linearly transformed from an SI model
- Variations
- MLLR: assume that only Gaussian means are transformed
- CMLLR: both means & covariances are transformed equivalent to applying the inverse transform to features while keeping model fixed

- Speaker-Adaptive Training (SAT)
- The SI model is not optimal for adaptation
- SAT tries to estimate a better model that when transformed gives the best likelihood of the data

- Use SI-RDT transparently
- Simple
- But RDT is not optimized for SAT

Straightforward approach

Train SI RDT

SI RDT & HMM

CMLLR Estimation

SD Transforms

ML SAT

ML-SAT HMM

MPE Training

MPE-SAT HMM

- Alternately update RDT and the speaker- dependent (SD) transforms
- Back-propagation is used to compute the derivative, since SD transforms are applied on top of RDT
- RDT is optimized for SAT

Train SI RDT

Iterative approach (SA-RDT)

SI RDT & HMM

CMLLR Estimation

SD Transforms

ML SAT

ML-SAT HMM

Update RDT

SA RDT & HMM

MPE Training

MPE-SAT HMM

- Description
- Same training & testing data, state-cluster and LM as the unadapted experiments
- 10.9% relative WER reduction for the ML system
- 7.0% relative WER reduction for the MPE system

- Similar to the original SA-RDT
- But the speaker-dependent transforms are estimated using the baseline model & features

Simplified SA-RDT

SI LDA+MLLT & HMM

CMLLR Estimation

SD Transforms

ML SAT

ML-SAT HMM

Update RDT

SA RDT & HMM

MPE Training

MPE-SAT HMM

- Description
- 500 hours of training data
- Another set of SD transforms were used before LDA/RDT
- SA-RDT1 was using the simplified procedure
- SA-RDT2 was using the original procedure
- The simplified procedure gave 2/3 of the gain by training the RDT only once

- Original work
- Region-dependent transform
- Improved discriminative feature training that leads to more generic feature transform
- Improved SAT procedure using RDT

- Impact
- RDT encompasses several other feature transforms, including MPE-HLDA, SPLICE and the core of fMPE and mean-offset fMPE
- The method gives significant WER reduction: 7% relative reduction to the SAT-MPE English CTS system
- The method is potentially helpful for exploring novel acoustic features
- We do not have to worry about the negative effect when we add new features to the input of the feature transform, because the training will decide whether to use the new features and how to use them based on a criterion that is correlated to WER

- B. Zhang, S. Matsoukas, J. Ma, and R. Schwartz. Long span features and minimum phoneme heteroscedastic linear discriminant analysis. In Proceedings of EARS RT-04 Workshop, 2004.
- B. Zhang and S. Matsoukas. Minimum phoneme error based heteroscedastic linear discriminant analysis for speech recognition, In Proceedings of ICASSP, 2005.
- B. Zhang, S. Matsoukas and R. Schwartz. Discriminatively trained region-dependent transform for speech recognition. In Proceedings of ICASSP, 2006.
- Nominated for the Student Paper Award
- Awarded the Spoken Language Processing Grant by the IEEE Signal Processing Society

- B. Zhang, S. Matsoukas and R. Schwartz. Recent progress on the discriminative region-dependent transform for speech feature extraction. In Proceedings of ICSLP, 2006.