Segmental hmms modelling dynamics and underlying structure for automatic speech recognition
Sponsored Links
This presentation is the property of its rightful owner.
1 / 48

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on
  • Presentation posted in: General

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition. Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture. Overview. Hidden Markov models (HMMs): advantages and limitations Overcoming limitations with segment-based HMMs

Download Presentation

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition

Wendy Holmes

20/20 Speech Limited, UK

A DERA/NXT Joint Venture


Overview

  • Hidden Markov models (HMMs): advantages and limitations

  • Overcoming limitations with segment-based HMMs

  • Modelling trajectories of acoustic features

  • Theory of trajectory-based segmental HMMs

  • Experimental investigations: comparing performance of different segmental HMMs

  • Choice of parameters for trajectory modelling: recognition using formant trajectories

  • A “unified” model for both recognition and synthesis

  • Challenges and further issues


Typical speech spectral characteristics

s i k s th r ee o ne

  • Each sound has particular spectral characteristics.

  • Characteristics change continuously with time.

  • Patterns of change give cues to phone identity.

  • Spectrum includes speaker identity information.


Useful properties of HMMs

1. Appropriate general structure

  • Underlying Markov process allows for time-varying nature of utterances.

  • Probability distributions associated with states represent short-term spectral variability.

  • Can incorporate speech knowledge - e.g. context-dependent models, choice of features.

    2. Tractable mathematical framework

  • Algorithms for automatically training model parameters from natural speech data.

  • Straightforward recognition algorithms.


observations

model

time t

time t+1

time t+2

Modelling observations with an HMM


Conventional HMM assumptions

  • Piece-wise stationarity

    Assume speech produced by piece-wise stationary process with instantaneous transitions between stationary states.

  • Independence Assumption

    Probability of an acoustic vector given a model state depends ONLY on the vector and the state. Assume no dependency of observations, other than through the state sequence.

  • Duration model

    State duration conforms to geometric pdf (given by self-loop transition probability).


Limitations of HMM assumptions

  • Speech production is not a piece-wise stationary process, but a continuous one.

  • Changes are mostly smoothly time varying.

  • Constraints of articulation are such that any one frame of speech is highly correlated with previous and following frames.

  • Time derivatives capture correlation to some extent - but not within the model.

  • Long-term correlations, e.g. speaker identity.

  • Speech sounds have a typical duration, with shorter and longer durations being less likely, and limitations on maximum duration.


Addressing HMM limitations

AIMS WERE TO:

  • retain advantages of HMMs:

    • automatic and tractable algorithms for training to model quantity of speech data;

    • manageable recognition algorithms (principle of dynamic programming).

  • improve the underlying model structure to address HMM shortcomings as models of speech.

    ACHIEVING THE AIMS:

  • Associate states with sequencesof feature vectors

    => SEGMENTAL HMMS


time t (d=3)

time t+3 (d=2)

time t+5 (d=5)

Modelling observations with Segmental HMMs


Segmental HMMs

  • Associate states with sequencesof feature vectors, where these sequences can vary in duration.

  • Each state is associated with meaningful acoustic-phonetic event (phones or parts of phones).

  • Can easily incorporate realistic duration model.

  • Enable relationship between frames comprising a segment to be modelled explicitly.

  • Characterize dynamic behaviour during a segment.


time

1 2 3 4 5 6 7

Recognition calculations with HMMs

  • Compute most likely path through model (or sequence of models).

  • Evaluate efficiently using dynamic programming (Viterbi algorithm).

  • To compute probability of emitting observations up to a given frame time, for any one state need only consider states which could be occupied at previous frame.


time

1 2 3 4 5 6 7

Segmental HMM recognition calculation

  • Principle of dynamic programming still applies.

  • BUT, is more complex and computationally intensive.

  • For probability in any one state at any given frame time:

    • assume that represents last frame of a segment

    • consider all possible segment durations from 1 to some maximum D

    • therefore, must consider all possible previous states at all possible previous frame times from t-1 up to t-D.


feature

value

t

Trajectory-based segmental HMMs

  • Approximate relation between successive feature vectors by some trajectory through feature space.

  • Simple trajectory-based segmental HMM: associate a state with a single mean trajectory, in place of (static) single mean value used for a standard HMM.


Segmental HMM probability calculations

  • Generate observations independently, but conditioned on the trajectory.

  • Aim to provide constraining model of dynamics without requiring a complex model of correlations.

  • BUT, trajectory may be different for different utterances of the same sound.

  • So, if a single trajectory is used to represent all examples of a given model unit, will not be a very accurate representation for any one example.

  • One possible solution is a mixture of trajectories, but needs many components to capture all different trajectories.


feature

value

t

Intra- and Extra-segmental variability

  • Model feature dynamics across all segment examples by, in effect, a continuous mixture of trajectories.

  • This is achieved by modelling separately:

    • extra-segmentalvariation (underlying trajectory)

    • intra-segmentalvariation (about trajectory)

      => Probabilistic-trajectory segmental HMMs


probabilistic-

trajectory

segmental HMM

standard HMM

segmental HMM

HMM

states

Comparing different models

Generating a sequence of 5 observations


target

intra-

segmental

variability

extra-segmental variability

t

1

D

Probabilistic-trajectory segmental HMMs

  • Parametric trajectory model and Gaussian distributions.

  • Simple linear trajectory - characterized by mid-point and slope .

  • For illustration show with slope=0.


PTSHMM probability (general)

  • A segment of observations is y = y0,...,yT.

  • Probability of y and trajectory f given state S is

extra-segmentalintra-segmental

Alternative segmental models:

1. Define trajectory; model variation in trajectory

2. Fix trajectory and model observations - HMM is limiting case:


Linear Gaussian PTSHMM

slope mid-point intra-segment

  • Gaussian distributions for slope, mid-point and intra-segment variance.

  • To use model in recognition, need to compute P(y|S).

  • but values of trajectory parameters m and c are not known - they are “hidden” from the observer.

  • Linear trajectory: slope m and mid-point c.

  • Joint probability of y and linear trajectory is:


Hidden-trajectory probability calculation

  • One possibility: estimate the location of the trajectory, and compute the probability for that trajectory.

  • Used this approach in early work, but suffers problems due to difficulty in making unbiased trajectory estimate.

  • A better alternative is to allow for all possible locations of the trajectory by integrating out the unknown parameters.

  • In the case of the linear model, the calculation is:


Parameters of the linear PTSHMM

  • Linear PTSHMM has five model parameters:

    mid-point mean and variance,

    slope mean and variance,

    and intra-segment variance.

  • Simpler models arise as special cases, by fixing various parameters.

  • If trajectory slope is set to zero

    => “static” PTSHMM.

  • If prevent variability in trajectory

    => “fixed-trajectory” SHMM.

  • Fixed-trajectory static SHMM = standard HMM with explicit duration model.


Digit recognition experiments

  • Speaker-independent connected-digit recognition

  • 8 mel cepstrum features + overall energy

  • three-state monophone models

  • Segmental HMM max. segment dur. 10 frames

    (=> maximum phone duration = 300 ms).

  • Compared probabilistic-trajectory SHMMs with fixed-trajectory SHMMs and with standard HMMs.

  • Initialised all SHMMs from segmented training data (using HMM Viterbi alignment).

  • Interested in acoustic-modelling aspects, so fixed all transition and duration probabilities to be equal.

  • 5 training iterations.


Digit recognition results: simple SHMMs

% Sub. % Del. %Ins %Err.

Standard HMM 6.2 1.5 0.9 8.6

Add duration constraint 5.2 0.7 0.7 6.6

Linear fixed trajectory 3.8 0.5 0.6 4.9

  • Some benefit from simply imposing duration constraints by introducing the segmental structure (prevents “silly” segmentations).

  • Further benefit from representing dynamics by incorporating linear trajectory (one trajectory per model state).


Digit recognition results: static PTSHMMs

%Sub. %Del. %Ins %Err.

Static fixed SHMM 5.2 0.2 0.7 6.6

Static probabilistic SHMM5.2 2.2 0.1 7.5

  • For static models, no advantage from distinguishing between extra- and intra-segmental variability.


Digit recognition results: linear SHMMs

%Sub. %Del. %Ins %Err.

Static fixed SHMM 5.2 0.2 0.7 6.6

Linear fixed trajectory 3.8 0.5 0.6 4.9

Linear PTSHMM (slope var=0) 2.0 0.8 0.1 2.9

Linear PTSHMM (flexible slope) 4.9 4.0 0.1 9.0

  • Some advantage for linear trajectory.

  • Considerable further benefit from modelling variability in mid-point.

  • But modelling variability in both mid-point and slope is detrimental to recognition performance.


Conclusions from digit experiments

Best trajectory model gives nearly 70% reduction inn error-rate (2.9%) compared with standard HMMs (8.6% error-rate).

=> advantages from trajectory-based segmental HMM which also incorporates distinction between intra- and extra-segmental variability, but:

  • Trajectory assumption must be reasonably accurate (advantage for linear but not for static models).

  • Not beneficial to model variability in slope parameter - possibly too variable between speakers, or too difficult to estimate reliably for short segments.


Phonetic classification: TIMIT

  • Training and recognition with given segment boundaries.

  • Train on complete training set (male speakers), with classification on core test set.

  • 12 mel cepstrum features + overall energy.

  • Evaluated (constrained) linear PTSHMMs.

  • Compared performance with standard-HMM performance for:

    • context-dependent (biphone) versus context-independent (monophone) models

    • feature set using only the mel cepstrum features versus one which also included time derivative features.


TIMIT classification results

  • Improvement with linear PTSHMM is greatest for more accurate (context-dependent) models.

    => more benefit from modelling trajectories when not including different phonetic events in one model.

  • Most advantage when not using delta features.

    => most benefit from modelling dynamics when not attempting to represent dynamics in front-end.


Benefit of PTSHMMs for some different phone classes

no. HMM PTSHMM %impro-

examples %error %error ment

Fricatives (f v th dh s z sh hh) 710 41.7 38.9 6.8

Vowels(iy ih eh ae ah uw uh er) 1178 53.8 48.9 9.1

Semivowels and glides(l r y w) 97 39.2 33.2 15.4

Diphthongs(ey ay oy aw ow) 376 48.9 41.2 15.8

Stops (p t dx k b d g) 566 56.7 54.8 3.4

Most benefit from linear PTSHMM for sounds characterised by continuous smooth-changing dynamics.


Summary of findings

  • Probabilistic-trajectory segmental HMMs can outperform standard HMMs and fixed-trajectory segmental HMMs.

  • Separately modelling variability within/between segments is a powerful approach, provided that:

    • trajectory assumptions are appropriate (linear trajectory)

    • variability in the parameter can be usefully modelled (not useful to model variability in slope parameter with current approach).

  • The models have been shown to give useful performance gains.


Issues of modelling speech dynamics

Compare error rates on TIMIT task:

  • HMMs with time derivatives: 29.8%

  • best segmental HMM result WITHOUT time derivatives: 38.2%.

    => time derivatives capture some aspects of dynamics not modelled in segmental HMMs.

  • Time derivative features provide some measure of dynamics for every frame.

  • current segmental HMMs only model dynamics within a segment.


modelling issues and questions (1)

  • Choice of model unit (e.g. phone, diphone)

  • How to model dynamics and continuity effects across segment boundaries, to represent dynamics throughout an utterance.

  • How to model context effects. (e.g. could define trajectories according to previous and following sounds - but complicates search)

  • How to define trajectories. (e.g. linear or higher-order polynomial; versus dynamical-system type model with filtered output of hidden states)


modelling issues and questions (2)

  • Incorporating a realistic duration model.

  • How to model any systematic effects of duration on trajectory realisation - should reduce remaining variability in trajectories.

  • How to model speaker-dependent effects and speaker continuity.

  • How to deal with other systematic influences - e.g. speaker stress, speaking rate.

  • Dealing with external influences - e.g. noise.

  • Choice of features for trajectory modelling.


Spectral representations (1)

  • Typical wideband spectrogram - for display compute spectrum at frequent time intervals (e.g. 2 ms)

    th r ee s I x s I x

  • Typical features for ASR: mfccs computed from FFT of 25 ms windows at 10 ms intervals:


Spectral representations (2)

  • Using long windows at fixed positions blurs rapid events - stop bursts and rapid formant transitions.

  • An alternative: use a shorter window “excitation synchronously”:

    th r ee s I x s I x

  • Compare with long fixed-window analysis:


Standard HMM digit recognition experiments

  • Compared excitation-synchronous analysis with fixed analysis for different window lengths.

  • In all cases computed FFT then mel cepstrum.

  • Shorter window gives lower frequency resolution, but effect is not so great on mel scale.

  • Best fixed-window condition 20 or 25 ms: 2.1% err. (increased to 4.6% for a 5 ms window).

  • Best synchronous-window condition 10ms: 1.9% err. But only increased to 2.1 % for a 5 ms window.

    => some advantage to capturing rapid events. But note short window may be disadvantage for fricatives.

    Maybe combine different analyses?


Moving beyond cepstrum trajectories

  • Start with spectral analysis: this must preserve all relevant information.

  • But is it appropriate to then model trajectories directly in the spectral/cepstral domain?

  • Motivation for modelling dynamics is from nature of articulation, and its acoustic consequences.

    => should be modelling in domain closer to articulation.

  • One possibility is an articulatory description.

  • Another option is formants - closely related to articulation but also to acoustics.


Problems with formant analysis

  • Unambiguous formant labelling may not be possible from a single spectral cross-section.

    e.g. close formants may merge to give single spectral peak

  • A formant may not be apparent in the spectrum.

    e.g. formant is weakly excited (F1 in unvoiced sounds).

  • NOT useful for certain distinctions, where low amplitude is the main feature.

    e.g. identifying silence or weak fricatives.

=> difficult to identify formants independently from recognition process, so not generally used as features for automatic speech recognition.


Estimating formant trajectories

s i k s th r ee o ne

  • Where see clear formant structure, F1, F2 and F3 can be identified.

  • In voiceless fricatives, higher formant movements are usually continuous with those in adjacent vowels.

  • For F1, arbitrarily connect between adjacent vowels.


Formant analysis methodJohn Holmes (Proc. EUROSPEECH’97)

  • Aims to emulate human abilities:

    • ability to label single spectrum cross-sections

    • rely heavily on continuity over time

    • sometimes need knowledge of what is being said to disambiguate alternatives

  • Two fundamental features of the method:

    • outputs alternatives when uncertain (“delayed decisions”).

    • Notion of “confidence” in formant measurement

      when formants cannot be estimated (e.g. during silence), confidence is low and estimate not useful for recognition

      => rely on other features (general spectrum shape).


Example of formant analyser output

  • Up to two sets of formants for each frame.

  • Alternatives are in terms of sets - F1, F2, F3.

  • Specified frame by frame, but are usually alternative trajectories.

“four seven”


Segmental HMM experiments

  • Each segment model is associated with a linear trajectory.

  • Model each phone by a sequence of one or more segments.

    e.g. monophthongal vowels, fricatives - 1 segment

    diphthongs - sequence of 2 segments

    aspirated voiceless stops - sequence of 3 segments.

  • Set allowed minimum and maximum segment duration dependent on identity of phone segment (loose constraint).

  • Incorporate confidence estimate (as a variance) in recognition calculations.

  • Resolve formant alternatives based on probability.

  • Use formants + low-order cepstrum features.


Some connected-digit recognition results

Word error rates

8 cep. 5 cep.+3 for.

Standard-HMM baseline 3.5 %2.5 %

with 3 states per phone

Standard HMMs with 6.4 % 5.9 %

variable state allocation

  • Performance drops when introduce new state allocation (total number of states about half that of baseline)

Introduce segment structure 3.2 % 2.9 %

  • Need segment structure for good performance

Introduce linear trajectory 2.6 %2.3 %

  • Some advantage from linear trajectory

  • Formants show small, but consistent, advantage.


Formant modelling

  • Expressing a model in terms of formant dynamics offers:

    • Potential for modelling systematic effects in a meaningful way: e.g speaker identity, speaker stress, speaking rate.

    • Potential for a constrained model for speech, which should be more robust to noise (assuming also model the noise).

  • BUT: analysis of formants separately from hypotheses about what is being said will always be prone to errors.

  • FUTURE AIM: integrate formant analysis within recognition scheme: provided speech model is accurate, this should overcome any formant tracking errors.

  • A good model for speech should be appropriate for synthesis as well as for recognition: a trajectory-based formant model offers this possibility.


A “unified” speech model: applied to coding


A simple coding scheme

  • Demonstrate principles of coding using same model for both recognition and synthesis.

  • Model represents linear formant trajectories.

  • Recognition: linear trajectory segmental HMMs of formant features.

  • Synthesis: JSRU parallel-formant synthesizer.

  • Coding is applied to analysed formant trajectories

    => relatively high bit-rate (up to about 1000 bits/s).

  • Recognition is used mainly to identify segment boundaries, but also to guide the coding of the trajectories.


Segment coding scheme overview


Coded at about 600bps

Speaker 1: digits

Speaker 2: digits

Speaker 3: digits

Speaker 1: ARM report

Natural

Speaker 1: digits

Speaker 2: digits

Speaker 3: digits

Speaker 1: ARM report

Speech Coding results

Achievements of study: Established principle of using formant trajectory model for both recognition and synthesis, including using information from recognition to assist in coding.

Future work: better quality coding should be possible by further integrating formant analysis, recognition and synthesis within a common framework.


  • Login