Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

Overview • Hidden Markov models (HMMs): advantages and limitations • Overcoming limitations with segment-based HMMs • Modelling trajectories of acoustic features • Theory of trajectory-based segmental HMMs • Experimental investigations: comparing performance of different segmental HMMs • Choice of parameters for trajectory modelling: recognition using formant trajectories • A “unified” model for both recognition and synthesis • Challenges and further issues

Typical speech spectral characteristics s i k s th r ee o ne • Each sound has particular spectral characteristics. • Characteristics change continuously with time. • Patterns of change give cues to phone identity. • Spectrum includes speaker identity information.

Useful properties of HMMs 1. Appropriate general structure • Underlying Markov process allows for time-varying nature of utterances. • Probability distributions associated with states represent short-term spectral variability. • Can incorporate speech knowledge - e.g. context-dependent models, choice of features. 2. Tractable mathematical framework • Algorithms for automatically training model parameters from natural speech data. • Straightforward recognition algorithms.

observations model time t time t+1 time t+2 Modelling observations with an HMM

Conventional HMM assumptions • Piece-wise stationarity Assume speech produced by piece-wise stationary process with instantaneous transitions between stationary states. • Independence Assumption Probability of an acoustic vector given a model state depends ONLY on the vector and the state. Assume no dependency of observations, other than through the state sequence. • Duration model State duration conforms to geometric pdf (given by self-loop transition probability).

Limitations of HMM assumptions • Speech production is not a piece-wise stationary process, but a continuous one. • Changes are mostly smoothly time varying. • Constraints of articulation are such that any one frame of speech is highly correlated with previous and following frames. • Time derivatives capture correlation to some extent - but not within the model. • Long-term correlations, e.g. speaker identity. • Speech sounds have a typical duration, with shorter and longer durations being less likely, and limitations on maximum duration.

Addressing HMM limitations AIMS WERE TO: • retain advantages of HMMs: • automatic and tractable algorithms for training to model quantity of speech data; • manageable recognition algorithms (principle of dynamic programming). • improve the underlying model structure to address HMM shortcomings as models of speech. ACHIEVING THE AIMS: • Associate states with sequencesof feature vectors => SEGMENTAL HMMS

time t (d=3) time t+3 (d=2) time t+5 (d=5) Modelling observations with Segmental HMMs

Segmental HMMs • Associate states with sequencesof feature vectors, where these sequences can vary in duration. • Each state is associated with meaningful acoustic-phonetic event (phones or parts of phones). • Can easily incorporate realistic duration model. • Enable relationship between frames comprising a segment to be modelled explicitly. • Characterize dynamic behaviour during a segment.

time 1 2 3 4 5 6 7 Recognition calculations with HMMs • Compute most likely path through model (or sequence of models). • Evaluate efficiently using dynamic programming (Viterbi algorithm). • To compute probability of emitting observations up to a given frame time, for any one state need only consider states which could be occupied at previous frame.

time 1 2 3 4 5 6 7 Segmental HMM recognition calculation • Principle of dynamic programming still applies. • BUT, is more complex and computationally intensive. • For probability in any one state at any given frame time: • assume that represents last frame of a segment • consider all possible segment durations from 1 to some maximum D • therefore, must consider all possible previous states at all possible previous frame times from t-1 up to t-D.

feature value t Trajectory-based segmental HMMs • Approximate relation between successive feature vectors by some trajectory through feature space. • Simple trajectory-based segmental HMM: associate a state with a single mean trajectory, in place of (static) single mean value used for a standard HMM.

Segmental HMM probability calculations • Generate observations independently, but conditioned on the trajectory. • Aim to provide constraining model of dynamics without requiring a complex model of correlations. • BUT, trajectory may be different for different utterances of the same sound. • So, if a single trajectory is used to represent all examples of a given model unit, will not be a very accurate representation for any one example. • One possible solution is a mixture of trajectories, but needs many components to capture all different trajectories.

feature value t Intra- and Extra-segmental variability • Model feature dynamics across all segment examples by, in effect, a continuous mixture of trajectories. • This is achieved by modelling separately: • extra-segmentalvariation (underlying trajectory) • intra-segmentalvariation (about trajectory) => Probabilistic-trajectory segmental HMMs

probabilistic- trajectory segmental HMM standard HMM segmental HMM HMM states Comparing different models Generating a sequence of 5 observations

target intra- segmental variability extra-segmental variability t 1 D Probabilistic-trajectory segmental HMMs • Parametric trajectory model and Gaussian distributions. • Simple linear trajectory - characterized by mid-point and slope . • For illustration show with slope=0.

PTSHMM probability (general) • A segment of observations is y = y0,...,yT. • Probability of y and trajectory f given state S is extra-segmentalintra-segmental Alternative segmental models: 1. Define trajectory; model variation in trajectory 2. Fix trajectory and model observations - HMM is limiting case:

Linear Gaussian PTSHMM slope mid-point intra-segment • Gaussian distributions for slope, mid-point and intra-segment variance. • To use model in recognition, need to compute P(y|S). • but values of trajectory parameters m and c are not known - they are “hidden” from the observer. • Linear trajectory: slope m and mid-point c. • Joint probability of y and linear trajectory is:

Hidden-trajectory probability calculation • One possibility: estimate the location of the trajectory, and compute the probability for that trajectory. • Used this approach in early work, but suffers problems due to difficulty in making unbiased trajectory estimate. • A better alternative is to allow for all possible locations of the trajectory by integrating out the unknown parameters. • In the case of the linear model, the calculation is:

Parameters of the linear PTSHMM • Linear PTSHMM has five model parameters: mid-point mean and variance, slope mean and variance, and intra-segment variance. • Simpler models arise as special cases, by fixing various parameters. • If trajectory slope is set to zero => “static” PTSHMM. • If prevent variability in trajectory => “fixed-trajectory” SHMM. • Fixed-trajectory static SHMM = standard HMM with explicit duration model.

Digit recognition experiments • Speaker-independent connected-digit recognition • 8 mel cepstrum features + overall energy • three-state monophone models • Segmental HMM max. segment dur. 10 frames (=> maximum phone duration = 300 ms). • Compared probabilistic-trajectory SHMMs with fixed-trajectory SHMMs and with standard HMMs. • Initialised all SHMMs from segmented training data (using HMM Viterbi alignment). • Interested in acoustic-modelling aspects, so fixed all transition and duration probabilities to be equal. • 5 training iterations.

Digit recognition results: simple SHMMs % Sub. % Del. %Ins %Err. Standard HMM 6.2 1.5 0.9 8.6 Add duration constraint 5.2 0.7 0.7 6.6 Linear fixed trajectory 3.8 0.5 0.6 4.9 • Some benefit from simply imposing duration constraints by introducing the segmental structure (prevents “silly” segmentations). • Further benefit from representing dynamics by incorporating linear trajectory (one trajectory per model state).

Digit recognition results: static PTSHMMs %Sub. %Del. %Ins %Err. Static fixed SHMM 5.2 0.2 0.7 6.6 Static probabilistic SHMM5.2 2.2 0.1 7.5 • For static models, no advantage from distinguishing between extra- and intra-segmental variability.

Digit recognition results: linear SHMMs %Sub. %Del. %Ins %Err. Static fixed SHMM 5.2 0.2 0.7 6.6 Linear fixed trajectory 3.8 0.5 0.6 4.9 Linear PTSHMM (slope var=0) 2.0 0.8 0.1 2.9 Linear PTSHMM (flexible slope) 4.9 4.0 0.1 9.0 • Some advantage for linear trajectory. • Considerable further benefit from modelling variability in mid-point. • But modelling variability in both mid-point and slope is detrimental to recognition performance.

Conclusions from digit experiments Best trajectory model gives nearly 70% reduction inn error-rate (2.9%) compared with standard HMMs (8.6% error-rate). => advantages from trajectory-based segmental HMM which also incorporates distinction between intra- and extra-segmental variability, but: • Trajectory assumption must be reasonably accurate (advantage for linear but not for static models). • Not beneficial to model variability in slope parameter - possibly too variable between speakers, or too difficult to estimate reliably for short segments.

Phonetic classification: TIMIT • Training and recognition with given segment boundaries. • Train on complete training set (male speakers), with classification on core test set. • 12 mel cepstrum features + overall energy. • Evaluated (constrained) linear PTSHMMs. • Compared performance with standard-HMM performance for: • context-dependent (biphone) versus context-independent (monophone) models • feature set using only the mel cepstrum features versus one which also included time derivative features.

TIMIT classification results • Improvement with linear PTSHMM is greatest for more accurate (context-dependent) models. => more benefit from modelling trajectories when not including different phonetic events in one model. • Most advantage when not using delta features. => most benefit from modelling dynamics when not attempting to represent dynamics in front-end.

Benefit of PTSHMMs for some different phone classes no. HMM PTSHMM %impro- examples %error %error ment Fricatives (f v th dh s z sh hh) 710 41.7 38.9 6.8 Vowels(iy ih eh ae ah uw uh er) 1178 53.8 48.9 9.1 Semivowels and glides(l r y w) 97 39.2 33.2 15.4 Diphthongs(ey ay oy aw ow) 376 48.9 41.2 15.8 Stops (p t dx k b d g) 566 56.7 54.8 3.4 Most benefit from linear PTSHMM for sounds characterised by continuous smooth-changing dynamics.

Summary of findings • Probabilistic-trajectory segmental HMMs can outperform standard HMMs and fixed-trajectory segmental HMMs. • Separately modelling variability within/between segments is a powerful approach, provided that: • trajectory assumptions are appropriate (linear trajectory) • variability in the parameter can be usefully modelled (not useful to model variability in slope parameter with current approach). • The models have been shown to give useful performance gains.

Issues of modelling speech dynamics Compare error rates on TIMIT task: • HMMs with time derivatives: 29.8% • best segmental HMM result WITHOUT time derivatives: 38.2%. => time derivatives capture some aspects of dynamics not modelled in segmental HMMs. • Time derivative features provide some measure of dynamics for every frame. • current segmental HMMs only model dynamics within a segment.

modelling issues and questions (1) • Choice of model unit (e.g. phone, diphone) • How to model dynamics and continuity effects across segment boundaries, to represent dynamics throughout an utterance. • How to model context effects. (e.g. could define trajectories according to previous and following sounds - but complicates search) • How to define trajectories. (e.g. linear or higher-order polynomial; versus dynamical-system type model with filtered output of hidden states)

modelling issues and questions (2) • Incorporating a realistic duration model. • How to model any systematic effects of duration on trajectory realisation - should reduce remaining variability in trajectories. • How to model speaker-dependent effects and speaker continuity. • How to deal with other systematic influences - e.g. speaker stress, speaking rate. • Dealing with external influences - e.g. noise. • Choice of features for trajectory modelling.

Spectral representations (1) • Typical wideband spectrogram - for display compute spectrum at frequent time intervals (e.g. 2 ms) th r ee s I x s I x • Typical features for ASR: mfccs computed from FFT of 25 ms windows at 10 ms intervals:

Spectral representations (2) • Using long windows at fixed positions blurs rapid events - stop bursts and rapid formant transitions. • An alternative: use a shorter window “excitation synchronously”: th r ee s I x s I x • Compare with long fixed-window analysis:

Standard HMM digit recognition experiments • Compared excitation-synchronous analysis with fixed analysis for different window lengths. • In all cases computed FFT then mel cepstrum. • Shorter window gives lower frequency resolution, but effect is not so great on mel scale. • Best fixed-window condition 20 or 25 ms: 2.1% err. (increased to 4.6% for a 5 ms window). • Best synchronous-window condition 10ms: 1.9% err. But only increased to 2.1 % for a 5 ms window. => some advantage to capturing rapid events. But note short window may be disadvantage for fricatives. Maybe combine different analyses?

Moving beyond cepstrum trajectories • Start with spectral analysis: this must preserve all relevant information. • But is it appropriate to then model trajectories directly in the spectral/cepstral domain? • Motivation for modelling dynamics is from nature of articulation, and its acoustic consequences. => should be modelling in domain closer to articulation. • One possibility is an articulatory description. • Another option is formants - closely related to articulation but also to acoustics.

Problems with formant analysis • Unambiguous formant labelling may not be possible from a single spectral cross-section. e.g. close formants may merge to give single spectral peak • A formant may not be apparent in the spectrum. e.g. formant is weakly excited (F1 in unvoiced sounds). • NOT useful for certain distinctions, where low amplitude is the main feature. e.g. identifying silence or weak fricatives. => difficult to identify formants independently from recognition process, so not generally used as features for automatic speech recognition.

Estimating formant trajectories s i k s th r ee o ne • Where see clear formant structure, F1, F2 and F3 can be identified. • In voiceless fricatives, higher formant movements are usually continuous with those in adjacent vowels. • For F1, arbitrarily connect between adjacent vowels.

Formant analysis methodJohn Holmes (Proc. EUROSPEECH’97) • Aims to emulate human abilities: • ability to label single spectrum cross-sections • rely heavily on continuity over time • sometimes need knowledge of what is being said to disambiguate alternatives • Two fundamental features of the method: • outputs alternatives when uncertain (“delayed decisions”). • Notion of “confidence” in formant measurement when formants cannot be estimated (e.g. during silence), confidence is low and estimate not useful for recognition => rely on other features (general spectrum shape).

Example of formant analyser output • Up to two sets of formants for each frame. • Alternatives are in terms of sets - F1, F2, F3. • Specified frame by frame, but are usually alternative trajectories. “four seven”

Segmental HMM experiments • Each segment model is associated with a linear trajectory. • Model each phone by a sequence of one or more segments. e.g. monophthongal vowels, fricatives - 1 segment diphthongs - sequence of 2 segments aspirated voiceless stops - sequence of 3 segments. • Set allowed minimum and maximum segment duration dependent on identity of phone segment (loose constraint). • Incorporate confidence estimate (as a variance) in recognition calculations. • Resolve formant alternatives based on probability. • Use formants + low-order cepstrum features.

Some connected-digit recognition results Word error rates 8 cep. 5 cep.+3 for. Standard-HMM baseline 3.5 % 2.5 % with 3 states per phone Standard HMMs with 6.4 % 5.9 % variable state allocation • Performance drops when introduce new state allocation (total number of states about half that of baseline) Introduce segment structure 3.2 % 2.9 % • Need segment structure for good performance Introduce linear trajectory 2.6 % 2.3 % • Some advantage from linear trajectory • Formants show small, but consistent, advantage.

Formant modelling • Expressing a model in terms of formant dynamics offers: • Potential for modelling systematic effects in a meaningful way: e.g speaker identity, speaker stress, speaking rate. • Potential for a constrained model for speech, which should be more robust to noise (assuming also model the noise). • BUT: analysis of formants separately from hypotheses about what is being said will always be prone to errors. • FUTURE AIM: integrate formant analysis within recognition scheme: provided speech model is accurate, this should overcome any formant tracking errors. • A good model for speech should be appropriate for synthesis as well as for recognition: a trajectory-based formant model offers this possibility.

A “unified” speech model: applied to coding

A simple coding scheme • Demonstrate principles of coding using same model for both recognition and synthesis. • Model represents linear formant trajectories. • Recognition: linear trajectory segmental HMMs of formant features. • Synthesis: JSRU parallel-formant synthesizer. • Coding is applied to analysed formant trajectories => relatively high bit-rate (up to about 1000 bits/s). • Recognition is used mainly to identify segment boundaries, but also to guide the coding of the trajectories.

Segment coding scheme overview

Coded at about 600bps Speaker 1: digits Speaker 2: digits Speaker 3: digits Speaker 1: ARM report Natural Speaker 1: digits Speaker 2: digits Speaker 3: digits Speaker 1: ARM report Speech Coding results Achievements of study: Established principle of using formant trajectory model for both recognition and synthesis, including using information from recognition to assist in coding. Future work: better quality coding should be possible by further integrating formant analysis, recognition and synthesis within a common framework.

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition

Presentation Transcript

Distinctive Feature Detection For Automatic Speech Recognition

speech audiometry

Automatic Speech Recognition: An Overview

Large Vocabulary Continuous Speech Recognition (LVCSR)

Automatic Speech Recognition: An Overview

Speech recognition, understanding and conversational interfaces

Lombard Speech Recognition

Automatic speech recognition

Automatic Speech Recognition: An Overview

Building an ASR using HTK CS4706

Speech Recognition Technology

7- Speech Recognition

Automatic Speech Recognition and Audio Indexing

Confidence Measures for Automatic Speech Recognition

Automatic Speech Recognition

Dealing with Connected Speech and CI Models

MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION

DNA: Structure, Dynamics and Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION

Design and Implementation of Speech Recognition Systems