segmental hmms modelling dynamics and underlying structure for automatic speech recognition
Skip this Video
Download Presentation
Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition

Loading in 2 Seconds...

play fullscreen
1 / 48

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition - PowerPoint PPT Presentation

  • Uploaded on

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition. Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture. Overview. Hidden Markov models (HMMs): advantages and limitations Overcoming limitations with segment-based HMMs

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition' - afric

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
segmental hmms modelling dynamics and underlying structure for automatic speech recognition

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition

Wendy Holmes

20/20 Speech Limited, UK

A DERA/NXT Joint Venture

  • Hidden Markov models (HMMs): advantages and limitations
  • Overcoming limitations with segment-based HMMs
  • Modelling trajectories of acoustic features
  • Theory of trajectory-based segmental HMMs
  • Experimental investigations: comparing performance of different segmental HMMs
  • Choice of parameters for trajectory modelling: recognition using formant trajectories
  • A “unified” model for both recognition and synthesis
  • Challenges and further issues
typical speech spectral characteristics
Typical speech spectral characteristics

s i k s th r ee o ne

  • Each sound has particular spectral characteristics.
  • Characteristics change continuously with time.
  • Patterns of change give cues to phone identity.
  • Spectrum includes speaker identity information.
useful properties of hmms
Useful properties of HMMs

1. Appropriate general structure

  • Underlying Markov process allows for time-varying nature of utterances.
  • Probability distributions associated with states represent short-term spectral variability.
  • Can incorporate speech knowledge - e.g. context-dependent models, choice of features.

2. Tractable mathematical framework

  • Algorithms for automatically training model parameters from natural speech data.
  • Straightforward recognition algorithms.
conventional hmm assumptions
Conventional HMM assumptions
  • Piece-wise stationarity

Assume speech produced by piece-wise stationary process with instantaneous transitions between stationary states.

  • Independence Assumption

Probability of an acoustic vector given a model state depends ONLY on the vector and the state. Assume no dependency of observations, other than through the state sequence.

  • Duration model

State duration conforms to geometric pdf (given by self-loop transition probability).

limitations of hmm assumptions
Limitations of HMM assumptions
  • Speech production is not a piece-wise stationary process, but a continuous one.
  • Changes are mostly smoothly time varying.
  • Constraints of articulation are such that any one frame of speech is highly correlated with previous and following frames.
  • Time derivatives capture correlation to some extent - but not within the model.
  • Long-term correlations, e.g. speaker identity.
  • Speech sounds have a typical duration, with shorter and longer durations being less likely, and limitations on maximum duration.
addressing hmm limitations
Addressing HMM limitations


  • retain advantages of HMMs:
    • automatic and tractable algorithms for training to model quantity of speech data;
    • manageable recognition algorithms (principle of dynamic programming).
  • improve the underlying model structure to address HMM shortcomings as models of speech.


  • Associate states with sequencesof feature vectors


segmental hmms
Segmental HMMs
  • Associate states with sequencesof feature vectors, where these sequences can vary in duration.
  • Each state is associated with meaningful acoustic-phonetic event (phones or parts of phones).
  • Can easily incorporate realistic duration model.
  • Enable relationship between frames comprising a segment to be modelled explicitly.
  • Characterize dynamic behaviour during a segment.
recognition calculations with hmms

1 2 3 4 5 6 7

Recognition calculations with HMMs
  • Compute most likely path through model (or sequence of models).
  • Evaluate efficiently using dynamic programming (Viterbi algorithm).
  • To compute probability of emitting observations up to a given frame time, for any one state need only consider states which could be occupied at previous frame.
segmental hmm recognition calculation

1 2 3 4 5 6 7

Segmental HMM recognition calculation
  • Principle of dynamic programming still applies.
  • BUT, is more complex and computationally intensive.
  • For probability in any one state at any given frame time:
    • assume that represents last frame of a segment
    • consider all possible segment durations from 1 to some maximum D
    • therefore, must consider all possible previous states at all possible previous frame times from t-1 up to t-D.
trajectory based segmental hmms



Trajectory-based segmental HMMs
  • Approximate relation between successive feature vectors by some trajectory through feature space.
  • Simple trajectory-based segmental HMM: associate a state with a single mean trajectory, in place of (static) single mean value used for a standard HMM.
segmental hmm probability calculations
Segmental HMM probability calculations
  • Generate observations independently, but conditioned on the trajectory.
  • Aim to provide constraining model of dynamics without requiring a complex model of correlations.
  • BUT, trajectory may be different for different utterances of the same sound.
  • So, if a single trajectory is used to represent all examples of a given model unit, will not be a very accurate representation for any one example.
  • One possible solution is a mixture of trajectories, but needs many components to capture all different trajectories.
intra and extra segmental variability



Intra- and Extra-segmental variability
  • Model feature dynamics across all segment examples by, in effect, a continuous mixture of trajectories.
  • This is achieved by modelling separately:
    • extra-segmentalvariation (underlying trajectory)
    • intra-segmentalvariation (about trajectory)

=> Probabilistic-trajectory segmental HMMs

comparing different models


segmental HMM

standard HMM

segmental HMM



Comparing different models

Generating a sequence of 5 observations

probabilistic trajectory segmental hmms




extra-segmental variability




Probabilistic-trajectory segmental HMMs
  • Parametric trajectory model and Gaussian distributions.
  • Simple linear trajectory - characterized by mid-point and slope .
  • For illustration show with slope=0.
ptshmm probability general
PTSHMM probability (general)
  • A segment of observations is y = y0,...,yT.
  • Probability of y and trajectory f given state S is


Alternative segmental models:

1. Define trajectory; model variation in trajectory

2. Fix trajectory and model observations - HMM is limiting case:

linear gaussian ptshmm
Linear Gaussian PTSHMM

slope mid-point intra-segment

  • Gaussian distributions for slope, mid-point and intra-segment variance.
  • To use model in recognition, need to compute P(y|S).
  • but values of trajectory parameters m and c are not known - they are “hidden” from the observer.
  • Linear trajectory: slope m and mid-point c.
  • Joint probability of y and linear trajectory is:
hidden trajectory probability calculation
Hidden-trajectory probability calculation
  • One possibility: estimate the location of the trajectory, and compute the probability for that trajectory.
  • Used this approach in early work, but suffers problems due to difficulty in making unbiased trajectory estimate.
  • A better alternative is to allow for all possible locations of the trajectory by integrating out the unknown parameters.
  • In the case of the linear model, the calculation is:
parameters of the linear ptshmm
Parameters of the linear PTSHMM
  • Linear PTSHMM has five model parameters:

mid-point mean and variance,

slope mean and variance,

and intra-segment variance.

  • Simpler models arise as special cases, by fixing various parameters.
  • If trajectory slope is set to zero

=> “static” PTSHMM.

  • If prevent variability in trajectory

=> “fixed-trajectory” SHMM.

  • Fixed-trajectory static SHMM = standard HMM with explicit duration model.
digit recognition experiments
Digit recognition experiments
  • Speaker-independent connected-digit recognition
  • 8 mel cepstrum features + overall energy
  • three-state monophone models
  • Segmental HMM max. segment dur. 10 frames

(=> maximum phone duration = 300 ms).

  • Compared probabilistic-trajectory SHMMs with fixed-trajectory SHMMs and with standard HMMs.
  • Initialised all SHMMs from segmented training data (using HMM Viterbi alignment).
  • Interested in acoustic-modelling aspects, so fixed all transition and duration probabilities to be equal.
  • 5 training iterations.
digit recognition results simple shmms
Digit recognition results: simple SHMMs

% Sub. % Del. %Ins %Err.

Standard HMM 6.2 1.5 0.9 8.6

Add duration constraint 5.2 0.7 0.7 6.6

Linear fixed trajectory 3.8 0.5 0.6 4.9

  • Some benefit from simply imposing duration constraints by introducing the segmental structure (prevents “silly” segmentations).
  • Further benefit from representing dynamics by incorporating linear trajectory (one trajectory per model state).
digit recognition results static ptshmms
Digit recognition results: static PTSHMMs

%Sub. %Del. %Ins %Err.

Static fixed SHMM 5.2 0.2 0.7 6.6

Static probabilistic SHMM5.2 2.2 0.1 7.5

  • For static models, no advantage from distinguishing between extra- and intra-segmental variability.
digit recognition results linear shmms
Digit recognition results: linear SHMMs

%Sub. %Del. %Ins %Err.

Static fixed SHMM 5.2 0.2 0.7 6.6

Linear fixed trajectory 3.8 0.5 0.6 4.9

Linear PTSHMM (slope var=0) 2.0 0.8 0.1 2.9

Linear PTSHMM (flexible slope) 4.9 4.0 0.1 9.0

  • Some advantage for linear trajectory.
  • Considerable further benefit from modelling variability in mid-point.
  • But modelling variability in both mid-point and slope is detrimental to recognition performance.
conclusions from digit experiments
Conclusions from digit experiments

Best trajectory model gives nearly 70% reduction inn error-rate (2.9%) compared with standard HMMs (8.6% error-rate).

=> advantages from trajectory-based segmental HMM which also incorporates distinction between intra- and extra-segmental variability, but:

  • Trajectory assumption must be reasonably accurate (advantage for linear but not for static models).
  • Not beneficial to model variability in slope parameter - possibly too variable between speakers, or too difficult to estimate reliably for short segments.
phonetic classification timit
Phonetic classification: TIMIT
  • Training and recognition with given segment boundaries.
  • Train on complete training set (male speakers), with classification on core test set.
  • 12 mel cepstrum features + overall energy.
  • Evaluated (constrained) linear PTSHMMs.
  • Compared performance with standard-HMM performance for:
    • context-dependent (biphone) versus context-independent (monophone) models
    • feature set using only the mel cepstrum features versus one which also included time derivative features.
timit classification results
TIMIT classification results
  • Improvement with linear PTSHMM is greatest for more accurate (context-dependent) models.

=> more benefit from modelling trajectories when not including different phonetic events in one model.

  • Most advantage when not using delta features.

=> most benefit from modelling dynamics when not attempting to represent dynamics in front-end.

benefit of ptshmms for some different phone classes
Benefit of PTSHMMs for some different phone classes

no. HMM PTSHMM %impro-

examples %error %error ment

Fricatives (f v th dh s z sh hh) 710 41.7 38.9 6.8

Vowels(iy ih eh ae ah uw uh er) 1178 53.8 48.9 9.1

Semivowels and glides(l r y w) 97 39.2 33.2 15.4

Diphthongs(ey ay oy aw ow) 376 48.9 41.2 15.8

Stops (p t dx k b d g) 566 56.7 54.8 3.4

Most benefit from linear PTSHMM for sounds characterised by continuous smooth-changing dynamics.

summary of findings
Summary of findings
  • Probabilistic-trajectory segmental HMMs can outperform standard HMMs and fixed-trajectory segmental HMMs.
  • Separately modelling variability within/between segments is a powerful approach, provided that:
    • trajectory assumptions are appropriate (linear trajectory)
    • variability in the parameter can be usefully modelled (not useful to model variability in slope parameter with current approach).
  • The models have been shown to give useful performance gains.
issues of modelling speech dynamics
Issues of modelling speech dynamics

Compare error rates on TIMIT task:

  • HMMs with time derivatives: 29.8%
  • best segmental HMM result WITHOUT time derivatives: 38.2%.

=> time derivatives capture some aspects of dynamics not modelled in segmental HMMs.

  • Time derivative features provide some measure of dynamics for every frame.
  • current segmental HMMs only model dynamics within a segment.
modelling issues and questions 1
modelling issues and questions (1)
  • Choice of model unit (e.g. phone, diphone)
  • How to model dynamics and continuity effects across segment boundaries, to represent dynamics throughout an utterance.
  • How to model context effects. (e.g. could define trajectories according to previous and following sounds - but complicates search)
  • How to define trajectories. (e.g. linear or higher-order polynomial; versus dynamical-system type model with filtered output of hidden states)
modelling issues and questions 2
modelling issues and questions (2)
  • Incorporating a realistic duration model.
  • How to model any systematic effects of duration on trajectory realisation - should reduce remaining variability in trajectories.
  • How to model speaker-dependent effects and speaker continuity.
  • How to deal with other systematic influences - e.g. speaker stress, speaking rate.
  • Dealing with external influences - e.g. noise.
  • Choice of features for trajectory modelling.
spectral representations 1
Spectral representations (1)
  • Typical wideband spectrogram - for display compute spectrum at frequent time intervals (e.g. 2 ms)

th r ee s I x s I x

  • Typical features for ASR: mfccs computed from FFT of 25 ms windows at 10 ms intervals:
spectral representations 2
Spectral representations (2)
  • Using long windows at fixed positions blurs rapid events - stop bursts and rapid formant transitions.
  • An alternative: use a shorter window “excitation synchronously”:

th r ee s I x s I x

  • Compare with long fixed-window analysis:
standard hmm digit recognition experiments
Standard HMM digit recognition experiments
  • Compared excitation-synchronous analysis with fixed analysis for different window lengths.
  • In all cases computed FFT then mel cepstrum.
  • Shorter window gives lower frequency resolution, but effect is not so great on mel scale.
  • Best fixed-window condition 20 or 25 ms: 2.1% err. (increased to 4.6% for a 5 ms window).
  • Best synchronous-window condition 10ms: 1.9% err. But only increased to 2.1 % for a 5 ms window.

=> some advantage to capturing rapid events. But note short window may be disadvantage for fricatives.

Maybe combine different analyses?

moving beyond cepstrum trajectories
Moving beyond cepstrum trajectories
  • Start with spectral analysis: this must preserve all relevant information.
  • But is it appropriate to then model trajectories directly in the spectral/cepstral domain?
  • Motivation for modelling dynamics is from nature of articulation, and its acoustic consequences.

=> should be modelling in domain closer to articulation.

  • One possibility is an articulatory description.
  • Another option is formants - closely related to articulation but also to acoustics.
problems with formant analysis
Problems with formant analysis
  • Unambiguous formant labelling may not be possible from a single spectral cross-section.

e.g. close formants may merge to give single spectral peak

  • A formant may not be apparent in the spectrum.

e.g. formant is weakly excited (F1 in unvoiced sounds).

  • NOT useful for certain distinctions, where low amplitude is the main feature.

e.g. identifying silence or weak fricatives.

=> difficult to identify formants independently from recognition process, so not generally used as features for automatic speech recognition.

estimating formant trajectories
Estimating formant trajectories

s i k s th r ee o ne

  • Where see clear formant structure, F1, F2 and F3 can be identified.
  • In voiceless fricatives, higher formant movements are usually continuous with those in adjacent vowels.
  • For F1, arbitrarily connect between adjacent vowels.
formant analysis method john holmes proc eurospeech 97
Formant analysis methodJohn Holmes (Proc. EUROSPEECH’97)
  • Aims to emulate human abilities:
    • ability to label single spectrum cross-sections
    • rely heavily on continuity over time
    • sometimes need knowledge of what is being said to disambiguate alternatives
  • Two fundamental features of the method:
    • outputs alternatives when uncertain (“delayed decisions”).
    • Notion of “confidence” in formant measurement

when formants cannot be estimated (e.g. during silence), confidence is low and estimate not useful for recognition

=> rely on other features (general spectrum shape).

example of formant analyser output
Example of formant analyser output
  • Up to two sets of formants for each frame.
  • Alternatives are in terms of sets - F1, F2, F3.
  • Specified frame by frame, but are usually alternative trajectories.

“four seven”

segmental hmm experiments
Segmental HMM experiments
  • Each segment model is associated with a linear trajectory.
  • Model each phone by a sequence of one or more segments.

e.g. monophthongal vowels, fricatives - 1 segment

diphthongs - sequence of 2 segments

aspirated voiceless stops - sequence of 3 segments.

  • Set allowed minimum and maximum segment duration dependent on identity of phone segment (loose constraint).
  • Incorporate confidence estimate (as a variance) in recognition calculations.
  • Resolve formant alternatives based on probability.
  • Use formants + low-order cepstrum features.
some connected digit recognition results
Some connected-digit recognition results

Word error rates

8 cep. 5 cep.+3 for.

Standard-HMM baseline 3.5 % 2.5 %

with 3 states per phone

Standard HMMs with 6.4 % 5.9 %

variable state allocation

  • Performance drops when introduce new state allocation (total number of states about half that of baseline)

Introduce segment structure 3.2 % 2.9 %

  • Need segment structure for good performance

Introduce linear trajectory 2.6 % 2.3 %

  • Some advantage from linear trajectory
  • Formants show small, but consistent, advantage.
formant modelling
Formant modelling
  • Expressing a model in terms of formant dynamics offers:
    • Potential for modelling systematic effects in a meaningful way: e.g speaker identity, speaker stress, speaking rate.
    • Potential for a constrained model for speech, which should be more robust to noise (assuming also model the noise).
  • BUT: analysis of formants separately from hypotheses about what is being said will always be prone to errors.
  • FUTURE AIM: integrate formant analysis within recognition scheme: provided speech model is accurate, this should overcome any formant tracking errors.
  • A good model for speech should be appropriate for synthesis as well as for recognition: a trajectory-based formant model offers this possibility.
a simple coding scheme
A simple coding scheme
  • Demonstrate principles of coding using same model for both recognition and synthesis.
  • Model represents linear formant trajectories.
  • Recognition: linear trajectory segmental HMMs of formant features.
  • Synthesis: JSRU parallel-formant synthesizer.
  • Coding is applied to analysed formant trajectories

=> relatively high bit-rate (up to about 1000 bits/s).

  • Recognition is used mainly to identify segment boundaries, but also to guide the coding of the trajectories.
speech coding results
Coded at about 600bps

Speaker 1: digits

Speaker 2: digits

Speaker 3: digits

Speaker 1: ARM report


Speaker 1: digits

Speaker 2: digits

Speaker 3: digits

Speaker 1: ARM report

Speech Coding results

Achievements of study: Established principle of using formant trajectory model for both recognition and synthesis, including using information from recognition to assist in coding.

Future work: better quality coding should be possible by further integrating formant analysis, recognition and synthesis within a common framework.