- 118 Views
- Uploaded on
- Presentation posted in: General

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition

Wendy Holmes

20/20 Speech Limited, UK

A DERA/NXT Joint Venture

- Hidden Markov models (HMMs): advantages and limitations
- Overcoming limitations with segment-based HMMs
- Modelling trajectories of acoustic features
- Theory of trajectory-based segmental HMMs
- Experimental investigations: comparing performance of different segmental HMMs

- Choice of parameters for trajectory modelling: recognition using formant trajectories
- A “unified” model for both recognition and synthesis
- Challenges and further issues

s i k s th r ee o ne

- Each sound has particular spectral characteristics.
- Characteristics change continuously with time.
- Patterns of change give cues to phone identity.
- Spectrum includes speaker identity information.

1. Appropriate general structure

- Underlying Markov process allows for time-varying nature of utterances.
- Probability distributions associated with states represent short-term spectral variability.
- Can incorporate speech knowledge - e.g. context-dependent models, choice of features.
2. Tractable mathematical framework

- Algorithms for automatically training model parameters from natural speech data.
- Straightforward recognition algorithms.

observations

model

time t

time t+1

time t+2

- Piece-wise stationarity
Assume speech produced by piece-wise stationary process with instantaneous transitions between stationary states.

- Independence Assumption
Probability of an acoustic vector given a model state depends ONLY on the vector and the state. Assume no dependency of observations, other than through the state sequence.

- Duration model
State duration conforms to geometric pdf (given by self-loop transition probability).

- Speech production is not a piece-wise stationary process, but a continuous one.
- Changes are mostly smoothly time varying.
- Constraints of articulation are such that any one frame of speech is highly correlated with previous and following frames.
- Time derivatives capture correlation to some extent - but not within the model.
- Long-term correlations, e.g. speaker identity.
- Speech sounds have a typical duration, with shorter and longer durations being less likely, and limitations on maximum duration.

AIMS WERE TO:

- retain advantages of HMMs:
- automatic and tractable algorithms for training to model quantity of speech data;
- manageable recognition algorithms (principle of dynamic programming).

- improve the underlying model structure to address HMM shortcomings as models of speech.
ACHIEVING THE AIMS:

- Associate states with sequencesof feature vectors
=> SEGMENTAL HMMS

time t (d=3)

time t+3 (d=2)

time t+5 (d=5)

- Associate states with sequencesof feature vectors, where these sequences can vary in duration.
- Each state is associated with meaningful acoustic-phonetic event (phones or parts of phones).
- Can easily incorporate realistic duration model.
- Enable relationship between frames comprising a segment to be modelled explicitly.
- Characterize dynamic behaviour during a segment.

time

1 2 3 4 5 6 7

- Compute most likely path through model (or sequence of models).
- Evaluate efficiently using dynamic programming (Viterbi algorithm).

- To compute probability of emitting observations up to a given frame time, for any one state need only consider states which could be occupied at previous frame.

time

1 2 3 4 5 6 7

- Principle of dynamic programming still applies.
- BUT, is more complex and computationally intensive.
- For probability in any one state at any given frame time:
- assume that represents last frame of a segment
- consider all possible segment durations from 1 to some maximum D
- therefore, must consider all possible previous states at all possible previous frame times from t-1 up to t-D.

feature

value

t

- Approximate relation between successive feature vectors by some trajectory through feature space.

- Simple trajectory-based segmental HMM: associate a state with a single mean trajectory, in place of (static) single mean value used for a standard HMM.

- Generate observations independently, but conditioned on the trajectory.
- Aim to provide constraining model of dynamics without requiring a complex model of correlations.
- BUT, trajectory may be different for different utterances of the same sound.
- So, if a single trajectory is used to represent all examples of a given model unit, will not be a very accurate representation for any one example.
- One possible solution is a mixture of trajectories, but needs many components to capture all different trajectories.

feature

value

t

- Model feature dynamics across all segment examples by, in effect, a continuous mixture of trajectories.
- This is achieved by modelling separately:
- extra-segmentalvariation (underlying trajectory)
- intra-segmentalvariation (about trajectory)
=> Probabilistic-trajectory segmental HMMs

probabilistic-

trajectory

segmental HMM

standard HMM

segmental HMM

HMM

states

Generating a sequence of 5 observations

target

intra-

segmental

variability

extra-segmental variability

t

1

D

- Parametric trajectory model and Gaussian distributions.
- Simple linear trajectory - characterized by mid-point and slope .
- For illustration show with slope=0.

- A segment of observations is y = y0,...,yT.
- Probability of y and trajectory f given state S is

extra-segmentalintra-segmental

Alternative segmental models:

1. Define trajectory; model variation in trajectory

2. Fix trajectory and model observations - HMM is limiting case:

slope mid-point intra-segment

- Gaussian distributions for slope, mid-point and intra-segment variance.
- To use model in recognition, need to compute P(y|S).
- but values of trajectory parameters m and c are not known - they are “hidden” from the observer.

- Linear trajectory: slope m and mid-point c.
- Joint probability of y and linear trajectory is:

- One possibility: estimate the location of the trajectory, and compute the probability for that trajectory.
- Used this approach in early work, but suffers problems due to difficulty in making unbiased trajectory estimate.
- A better alternative is to allow for all possible locations of the trajectory by integrating out the unknown parameters.
- In the case of the linear model, the calculation is:

- Linear PTSHMM has five model parameters:
mid-point mean and variance,

slope mean and variance,

and intra-segment variance.

- Simpler models arise as special cases, by fixing various parameters.
- If trajectory slope is set to zero
=> “static” PTSHMM.

- If prevent variability in trajectory
=> “fixed-trajectory” SHMM.

- Fixed-trajectory static SHMM = standard HMM with explicit duration model.

- Speaker-independent connected-digit recognition
- 8 mel cepstrum features + overall energy
- three-state monophone models
- Segmental HMM max. segment dur. 10 frames
(=> maximum phone duration = 300 ms).

- Compared probabilistic-trajectory SHMMs with fixed-trajectory SHMMs and with standard HMMs.
- Initialised all SHMMs from segmented training data (using HMM Viterbi alignment).
- Interested in acoustic-modelling aspects, so fixed all transition and duration probabilities to be equal.
- 5 training iterations.

% Sub. % Del. %Ins %Err.

Standard HMM 6.2 1.5 0.9 8.6

Add duration constraint 5.2 0.7 0.7 6.6

Linear fixed trajectory 3.8 0.5 0.6 4.9

- Some benefit from simply imposing duration constraints by introducing the segmental structure (prevents “silly” segmentations).
- Further benefit from representing dynamics by incorporating linear trajectory (one trajectory per model state).

%Sub. %Del. %Ins %Err.

Static fixed SHMM 5.2 0.2 0.7 6.6

Static probabilistic SHMM5.2 2.2 0.1 7.5

- For static models, no advantage from distinguishing between extra- and intra-segmental variability.

%Sub. %Del. %Ins %Err.

Static fixed SHMM 5.2 0.2 0.7 6.6

Linear fixed trajectory 3.8 0.5 0.6 4.9

Linear PTSHMM (slope var=0) 2.0 0.8 0.1 2.9

Linear PTSHMM (flexible slope) 4.9 4.0 0.1 9.0

- Some advantage for linear trajectory.
- Considerable further benefit from modelling variability in mid-point.
- But modelling variability in both mid-point and slope is detrimental to recognition performance.

Best trajectory model gives nearly 70% reduction inn error-rate (2.9%) compared with standard HMMs (8.6% error-rate).

=> advantages from trajectory-based segmental HMM which also incorporates distinction between intra- and extra-segmental variability, but:

- Trajectory assumption must be reasonably accurate (advantage for linear but not for static models).
- Not beneficial to model variability in slope parameter - possibly too variable between speakers, or too difficult to estimate reliably for short segments.

- Training and recognition with given segment boundaries.
- Train on complete training set (male speakers), with classification on core test set.
- 12 mel cepstrum features + overall energy.
- Evaluated (constrained) linear PTSHMMs.
- Compared performance with standard-HMM performance for:
- context-dependent (biphone) versus context-independent (monophone) models
- feature set using only the mel cepstrum features versus one which also included time derivative features.

- Improvement with linear PTSHMM is greatest for more accurate (context-dependent) models.
=> more benefit from modelling trajectories when not including different phonetic events in one model.

- Most advantage when not using delta features.
=> most benefit from modelling dynamics when not attempting to represent dynamics in front-end.

no. HMM PTSHMM %impro-

examples %error %error ment

Fricatives (f v th dh s z sh hh) 710 41.7 38.9 6.8

Vowels(iy ih eh ae ah uw uh er) 1178 53.8 48.9 9.1

Semivowels and glides(l r y w) 97 39.2 33.2 15.4

Diphthongs(ey ay oy aw ow) 376 48.9 41.2 15.8

Stops (p t dx k b d g) 566 56.7 54.8 3.4

Most benefit from linear PTSHMM for sounds characterised by continuous smooth-changing dynamics.

- Probabilistic-trajectory segmental HMMs can outperform standard HMMs and fixed-trajectory segmental HMMs.
- Separately modelling variability within/between segments is a powerful approach, provided that:
- trajectory assumptions are appropriate (linear trajectory)
- variability in the parameter can be usefully modelled (not useful to model variability in slope parameter with current approach).

- The models have been shown to give useful performance gains.

Compare error rates on TIMIT task:

- HMMs with time derivatives: 29.8%
- best segmental HMM result WITHOUT time derivatives: 38.2%.
=> time derivatives capture some aspects of dynamics not modelled in segmental HMMs.

- Time derivative features provide some measure of dynamics for every frame.
- current segmental HMMs only model dynamics within a segment.

- Choice of model unit (e.g. phone, diphone)
- How to model dynamics and continuity effects across segment boundaries, to represent dynamics throughout an utterance.
- How to model context effects. (e.g. could define trajectories according to previous and following sounds - but complicates search)
- How to define trajectories. (e.g. linear or higher-order polynomial; versus dynamical-system type model with filtered output of hidden states)

- Incorporating a realistic duration model.
- How to model any systematic effects of duration on trajectory realisation - should reduce remaining variability in trajectories.
- How to model speaker-dependent effects and speaker continuity.
- How to deal with other systematic influences - e.g. speaker stress, speaking rate.
- Dealing with external influences - e.g. noise.
- Choice of features for trajectory modelling.

- Typical wideband spectrogram - for display compute spectrum at frequent time intervals (e.g. 2 ms)
th r ee s I x s I x

- Typical features for ASR: mfccs computed from FFT of 25 ms windows at 10 ms intervals:

- Using long windows at fixed positions blurs rapid events - stop bursts and rapid formant transitions.
- An alternative: use a shorter window “excitation synchronously”:
th r ee s I x s I x

- Compare with long fixed-window analysis:

- Compared excitation-synchronous analysis with fixed analysis for different window lengths.
- In all cases computed FFT then mel cepstrum.
- Shorter window gives lower frequency resolution, but effect is not so great on mel scale.
- Best fixed-window condition 20 or 25 ms: 2.1% err. (increased to 4.6% for a 5 ms window).
- Best synchronous-window condition 10ms: 1.9% err. But only increased to 2.1 % for a 5 ms window.
=> some advantage to capturing rapid events. But note short window may be disadvantage for fricatives.

Maybe combine different analyses?

- Start with spectral analysis: this must preserve all relevant information.
- But is it appropriate to then model trajectories directly in the spectral/cepstral domain?
- Motivation for modelling dynamics is from nature of articulation, and its acoustic consequences.
=> should be modelling in domain closer to articulation.

- One possibility is an articulatory description.
- Another option is formants - closely related to articulation but also to acoustics.

- Unambiguous formant labelling may not be possible from a single spectral cross-section.
e.g. close formants may merge to give single spectral peak

- A formant may not be apparent in the spectrum.
e.g. formant is weakly excited (F1 in unvoiced sounds).

- NOT useful for certain distinctions, where low amplitude is the main feature.
e.g. identifying silence or weak fricatives.

=> difficult to identify formants independently from recognition process, so not generally used as features for automatic speech recognition.

s i k s th r ee o ne

- Where see clear formant structure, F1, F2 and F3 can be identified.
- In voiceless fricatives, higher formant movements are usually continuous with those in adjacent vowels.
- For F1, arbitrarily connect between adjacent vowels.

- Aims to emulate human abilities:
- ability to label single spectrum cross-sections
- rely heavily on continuity over time
- sometimes need knowledge of what is being said to disambiguate alternatives

- Two fundamental features of the method:
- outputs alternatives when uncertain (“delayed decisions”).
- Notion of “confidence” in formant measurement
when formants cannot be estimated (e.g. during silence), confidence is low and estimate not useful for recognition

=> rely on other features (general spectrum shape).

- Up to two sets of formants for each frame.
- Alternatives are in terms of sets - F1, F2, F3.
- Specified frame by frame, but are usually alternative trajectories.

“four seven”

- Each segment model is associated with a linear trajectory.
- Model each phone by a sequence of one or more segments.
e.g. monophthongal vowels, fricatives - 1 segment

diphthongs - sequence of 2 segments

aspirated voiceless stops - sequence of 3 segments.

- Set allowed minimum and maximum segment duration dependent on identity of phone segment (loose constraint).
- Incorporate confidence estimate (as a variance) in recognition calculations.
- Resolve formant alternatives based on probability.
- Use formants + low-order cepstrum features.

Word error rates

8 cep. 5 cep.+3 for.

Standard-HMM baseline 3.5 %2.5 %

with 3 states per phone

Standard HMMs with 6.4 % 5.9 %

variable state allocation

- Performance drops when introduce new state allocation (total number of states about half that of baseline)

Introduce segment structure 3.2 % 2.9 %

- Need segment structure for good performance

Introduce linear trajectory 2.6 %2.3 %

- Some advantage from linear trajectory

- Formants show small, but consistent, advantage.

- Expressing a model in terms of formant dynamics offers:
- Potential for modelling systematic effects in a meaningful way: e.g speaker identity, speaker stress, speaking rate.
- Potential for a constrained model for speech, which should be more robust to noise (assuming also model the noise).

- BUT: analysis of formants separately from hypotheses about what is being said will always be prone to errors.
- FUTURE AIM: integrate formant analysis within recognition scheme: provided speech model is accurate, this should overcome any formant tracking errors.
- A good model for speech should be appropriate for synthesis as well as for recognition: a trajectory-based formant model offers this possibility.

- Demonstrate principles of coding using same model for both recognition and synthesis.
- Model represents linear formant trajectories.
- Recognition: linear trajectory segmental HMMs of formant features.
- Synthesis: JSRU parallel-formant synthesizer.
- Coding is applied to analysed formant trajectories
=> relatively high bit-rate (up to about 1000 bits/s).

- Recognition is used mainly to identify segment boundaries, but also to guide the coding of the trajectories.

Coded at about 600bps

Speaker 1: digits

Speaker 2: digits

Speaker 3: digits

Speaker 1: ARM report

Natural

Speaker 1: digits

Speaker 2: digits

Speaker 3: digits

Speaker 1: ARM report

Achievements of study: Established principle of using formant trajectory model for both recognition and synthesis, including using information from recognition to assist in coding.

Future work: better quality coding should be possible by further integrating formant analysis, recognition and synthesis within a common framework.