Loading in 5 sec....

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech RecognitionPowerPoint Presentation

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition

- By
**afric** - Follow User

- 140 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition' - afric

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition

Wendy Holmes

20/20 Speech Limited, UK

A DERA/NXT Joint Venture

Overview for Automatic Speech Recognition

- Hidden Markov models (HMMs): advantages and limitations
- Overcoming limitations with segment-based HMMs
- Modelling trajectories of acoustic features
- Theory of trajectory-based segmental HMMs
- Experimental investigations: comparing performance of different segmental HMMs

- Choice of parameters for trajectory modelling: recognition using formant trajectories
- A “unified” model for both recognition and synthesis
- Challenges and further issues

Typical speech spectral characteristics for Automatic Speech Recognition

s i k s th r ee o ne

- Each sound has particular spectral characteristics.
- Characteristics change continuously with time.
- Patterns of change give cues to phone identity.
- Spectrum includes speaker identity information.

Useful properties of HMMs for Automatic Speech Recognition

1. Appropriate general structure

- Underlying Markov process allows for time-varying nature of utterances.
- Probability distributions associated with states represent short-term spectral variability.
- Can incorporate speech knowledge - e.g. context-dependent models, choice of features.
2. Tractable mathematical framework

- Algorithms for automatically training model parameters from natural speech data.
- Straightforward recognition algorithms.

observations for Automatic Speech Recognition

model

time t

time t+1

time t+2

Modelling observations with an HMMConventional HMM assumptions for Automatic Speech Recognition

- Piece-wise stationarity
Assume speech produced by piece-wise stationary process with instantaneous transitions between stationary states.

- Independence Assumption
Probability of an acoustic vector given a model state depends ONLY on the vector and the state. Assume no dependency of observations, other than through the state sequence.

- Duration model
State duration conforms to geometric pdf (given by self-loop transition probability).

Limitations of HMM assumptions for Automatic Speech Recognition

- Speech production is not a piece-wise stationary process, but a continuous one.
- Changes are mostly smoothly time varying.
- Constraints of articulation are such that any one frame of speech is highly correlated with previous and following frames.
- Time derivatives capture correlation to some extent - but not within the model.
- Long-term correlations, e.g. speaker identity.
- Speech sounds have a typical duration, with shorter and longer durations being less likely, and limitations on maximum duration.

Addressing HMM limitations for Automatic Speech Recognition

AIMS WERE TO:

- retain advantages of HMMs:
- automatic and tractable algorithms for training to model quantity of speech data;
- manageable recognition algorithms (principle of dynamic programming).

- improve the underlying model structure to address HMM shortcomings as models of speech.
ACHIEVING THE AIMS:

- Associate states with sequencesof feature vectors
=> SEGMENTAL HMMS

time for Automatic Speech Recognitiont (d=3)

time t+3 (d=2)

time t+5 (d=5)

Modelling observations with Segmental HMMsSegmental HMMs for Automatic Speech Recognition

- Associate states with sequencesof feature vectors, where these sequences can vary in duration.
- Each state is associated with meaningful acoustic-phonetic event (phones or parts of phones).
- Can easily incorporate realistic duration model.
- Enable relationship between frames comprising a segment to be modelled explicitly.
- Characterize dynamic behaviour during a segment.

time for Automatic Speech Recognition

1 2 3 4 5 6 7

Recognition calculations with HMMs- Compute most likely path through model (or sequence of models).
- Evaluate efficiently using dynamic programming (Viterbi algorithm).

- To compute probability of emitting observations up to a given frame time, for any one state need only consider states which could be occupied at previous frame.

time for Automatic Speech Recognition

1 2 3 4 5 6 7

Segmental HMM recognition calculation- Principle of dynamic programming still applies.
- BUT, is more complex and computationally intensive.
- For probability in any one state at any given frame time:
- assume that represents last frame of a segment
- consider all possible segment durations from 1 to some maximum D
- therefore, must consider all possible previous states at all possible previous frame times from t-1 up to t-D.

feature for Automatic Speech Recognition

value

t

Trajectory-based segmental HMMs- Approximate relation between successive feature vectors by some trajectory through feature space.

- Simple trajectory-based segmental HMM: associate a state with a single mean trajectory, in place of (static) single mean value used for a standard HMM.

Segmental HMM probability calculations for Automatic Speech Recognition

- Generate observations independently, but conditioned on the trajectory.
- Aim to provide constraining model of dynamics without requiring a complex model of correlations.
- BUT, trajectory may be different for different utterances of the same sound.
- So, if a single trajectory is used to represent all examples of a given model unit, will not be a very accurate representation for any one example.
- One possible solution is a mixture of trajectories, but needs many components to capture all different trajectories.

feature for Automatic Speech Recognition

value

t

Intra- and Extra-segmental variability- Model feature dynamics across all segment examples by, in effect, a continuous mixture of trajectories.
- This is achieved by modelling separately:
- extra-segmentalvariation (underlying trajectory)
- intra-segmentalvariation (about trajectory)
=> Probabilistic-trajectory segmental HMMs

probabilistic- for Automatic Speech Recognition

trajectory

segmental HMM

standard HMM

segmental HMM

HMM

states

Comparing different modelsGenerating a sequence of 5 observations

target for Automatic Speech Recognition

intra-

segmental

variability

extra-segmental variability

t

1

D

Probabilistic-trajectory segmental HMMs- Parametric trajectory model and Gaussian distributions.
- Simple linear trajectory - characterized by mid-point and slope .
- For illustration show with slope=0.

PTSHMM probability (general) for Automatic Speech Recognition

- A segment of observations is y = y0,...,yT.
- Probability of y and trajectory f given state S is

extra-segmentalintra-segmental

Alternative segmental models:

1. Define trajectory; model variation in trajectory

2. Fix trajectory and model observations - HMM is limiting case:

Linear Gaussian PTSHMM for Automatic Speech Recognition

slope mid-point intra-segment

- Gaussian distributions for slope, mid-point and intra-segment variance.
- To use model in recognition, need to compute P(y|S).
- but values of trajectory parameters m and c are not known - they are “hidden” from the observer.

- Linear trajectory: slope m and mid-point c.
- Joint probability of y and linear trajectory is:

Hidden-trajectory probability calculation for Automatic Speech Recognition

- One possibility: estimate the location of the trajectory, and compute the probability for that trajectory.
- Used this approach in early work, but suffers problems due to difficulty in making unbiased trajectory estimate.
- A better alternative is to allow for all possible locations of the trajectory by integrating out the unknown parameters.
- In the case of the linear model, the calculation is:

Parameters of the linear PTSHMM for Automatic Speech Recognition

- Linear PTSHMM has five model parameters:
mid-point mean and variance,

slope mean and variance,

and intra-segment variance.

- Simpler models arise as special cases, by fixing various parameters.
- If trajectory slope is set to zero
=> “static” PTSHMM.

- If prevent variability in trajectory
=> “fixed-trajectory” SHMM.

- Fixed-trajectory static SHMM = standard HMM with explicit duration model.

Digit recognition experiments for Automatic Speech Recognition

- Speaker-independent connected-digit recognition
- 8 mel cepstrum features + overall energy
- three-state monophone models
- Segmental HMM max. segment dur. 10 frames
(=> maximum phone duration = 300 ms).

- Compared probabilistic-trajectory SHMMs with fixed-trajectory SHMMs and with standard HMMs.
- Initialised all SHMMs from segmented training data (using HMM Viterbi alignment).
- Interested in acoustic-modelling aspects, so fixed all transition and duration probabilities to be equal.
- 5 training iterations.

Digit recognition results: simple SHMMs for Automatic Speech Recognition

% Sub. % Del. %Ins %Err.

Standard HMM 6.2 1.5 0.9 8.6

Add duration constraint 5.2 0.7 0.7 6.6

Linear fixed trajectory 3.8 0.5 0.6 4.9

- Some benefit from simply imposing duration constraints by introducing the segmental structure (prevents “silly” segmentations).
- Further benefit from representing dynamics by incorporating linear trajectory (one trajectory per model state).

Digit recognition results: static PTSHMMs for Automatic Speech Recognition

%Sub. %Del. %Ins %Err.

Static fixed SHMM 5.2 0.2 0.7 6.6

Static probabilistic SHMM5.2 2.2 0.1 7.5

- For static models, no advantage from distinguishing between extra- and intra-segmental variability.

Digit recognition results: linear SHMMs for Automatic Speech Recognition

%Sub. %Del. %Ins %Err.

Static fixed SHMM 5.2 0.2 0.7 6.6

Linear fixed trajectory 3.8 0.5 0.6 4.9

Linear PTSHMM (slope var=0) 2.0 0.8 0.1 2.9

Linear PTSHMM (flexible slope) 4.9 4.0 0.1 9.0

- Some advantage for linear trajectory.
- Considerable further benefit from modelling variability in mid-point.
- But modelling variability in both mid-point and slope is detrimental to recognition performance.

Conclusions from digit experiments for Automatic Speech Recognition

Best trajectory model gives nearly 70% reduction inn error-rate (2.9%) compared with standard HMMs (8.6% error-rate).

=> advantages from trajectory-based segmental HMM which also incorporates distinction between intra- and extra-segmental variability, but:

- Trajectory assumption must be reasonably accurate (advantage for linear but not for static models).
- Not beneficial to model variability in slope parameter - possibly too variable between speakers, or too difficult to estimate reliably for short segments.

Phonetic classification: TIMIT for Automatic Speech Recognition

- Training and recognition with given segment boundaries.
- Train on complete training set (male speakers), with classification on core test set.
- 12 mel cepstrum features + overall energy.
- Evaluated (constrained) linear PTSHMMs.
- Compared performance with standard-HMM performance for:
- context-dependent (biphone) versus context-independent (monophone) models
- feature set using only the mel cepstrum features versus one which also included time derivative features.

TIMIT classification results for Automatic Speech Recognition

- Improvement with linear PTSHMM is greatest for more accurate (context-dependent) models.
=> more benefit from modelling trajectories when not including different phonetic events in one model.

- Most advantage when not using delta features.
=> most benefit from modelling dynamics when not attempting to represent dynamics in front-end.

Benefit of PTSHMMs for some different phone classes for Automatic Speech Recognition

no. HMM PTSHMM %impro-

examples %error %error ment

Fricatives (f v th dh s z sh hh) 710 41.7 38.9 6.8

Vowels(iy ih eh ae ah uw uh er) 1178 53.8 48.9 9.1

Semivowels and glides(l r y w) 97 39.2 33.2 15.4

Diphthongs(ey ay oy aw ow) 376 48.9 41.2 15.8

Stops (p t dx k b d g) 566 56.7 54.8 3.4

Most benefit from linear PTSHMM for sounds characterised by continuous smooth-changing dynamics.

Summary of findings for Automatic Speech Recognition

- Probabilistic-trajectory segmental HMMs can outperform standard HMMs and fixed-trajectory segmental HMMs.
- Separately modelling variability within/between segments is a powerful approach, provided that:
- trajectory assumptions are appropriate (linear trajectory)
- variability in the parameter can be usefully modelled (not useful to model variability in slope parameter with current approach).

- The models have been shown to give useful performance gains.

Issues of modelling speech dynamics for Automatic Speech Recognition

Compare error rates on TIMIT task:

- HMMs with time derivatives: 29.8%
- best segmental HMM result WITHOUT time derivatives: 38.2%.
=> time derivatives capture some aspects of dynamics not modelled in segmental HMMs.

- Time derivative features provide some measure of dynamics for every frame.
- current segmental HMMs only model dynamics within a segment.

modelling issues and questions (1) for Automatic Speech Recognition

- Choice of model unit (e.g. phone, diphone)
- How to model dynamics and continuity effects across segment boundaries, to represent dynamics throughout an utterance.
- How to model context effects. (e.g. could define trajectories according to previous and following sounds - but complicates search)
- How to define trajectories. (e.g. linear or higher-order polynomial; versus dynamical-system type model with filtered output of hidden states)

modelling issues and questions (2) for Automatic Speech Recognition

- Incorporating a realistic duration model.
- How to model any systematic effects of duration on trajectory realisation - should reduce remaining variability in trajectories.
- How to model speaker-dependent effects and speaker continuity.
- How to deal with other systematic influences - e.g. speaker stress, speaking rate.
- Dealing with external influences - e.g. noise.
- Choice of features for trajectory modelling.

Spectral representations (1) for Automatic Speech Recognition

- Typical wideband spectrogram - for display compute spectrum at frequent time intervals (e.g. 2 ms)
th r ee s I x s I x

- Typical features for ASR: mfccs computed from FFT of 25 ms windows at 10 ms intervals:

Spectral representations (2) for Automatic Speech Recognition

- Using long windows at fixed positions blurs rapid events - stop bursts and rapid formant transitions.
- An alternative: use a shorter window “excitation synchronously”:
th r ee s I x s I x

- Compare with long fixed-window analysis:

Standard HMM digit recognition experiments for Automatic Speech Recognition

- Compared excitation-synchronous analysis with fixed analysis for different window lengths.
- In all cases computed FFT then mel cepstrum.
- Shorter window gives lower frequency resolution, but effect is not so great on mel scale.
- Best fixed-window condition 20 or 25 ms: 2.1% err. (increased to 4.6% for a 5 ms window).
- Best synchronous-window condition 10ms: 1.9% err. But only increased to 2.1 % for a 5 ms window.
=> some advantage to capturing rapid events. But note short window may be disadvantage for fricatives.

Maybe combine different analyses?

Moving beyond cepstrum trajectories for Automatic Speech Recognition

- Start with spectral analysis: this must preserve all relevant information.
- But is it appropriate to then model trajectories directly in the spectral/cepstral domain?
- Motivation for modelling dynamics is from nature of articulation, and its acoustic consequences.
=> should be modelling in domain closer to articulation.

- One possibility is an articulatory description.
- Another option is formants - closely related to articulation but also to acoustics.

Problems with formant analysis for Automatic Speech Recognition

- Unambiguous formant labelling may not be possible from a single spectral cross-section.
e.g. close formants may merge to give single spectral peak

- A formant may not be apparent in the spectrum.
e.g. formant is weakly excited (F1 in unvoiced sounds).

- NOT useful for certain distinctions, where low amplitude is the main feature.
e.g. identifying silence or weak fricatives.

=> difficult to identify formants independently from recognition process, so not generally used as features for automatic speech recognition.

Estimating formant trajectories for Automatic Speech Recognition

s i k s th r ee o ne

- Where see clear formant structure, F1, F2 and F3 can be identified.
- In voiceless fricatives, higher formant movements are usually continuous with those in adjacent vowels.
- For F1, arbitrarily connect between adjacent vowels.

Formant analysis method for Automatic Speech RecognitionJohn Holmes (Proc. EUROSPEECH’97)

- Aims to emulate human abilities:
- ability to label single spectrum cross-sections
- rely heavily on continuity over time
- sometimes need knowledge of what is being said to disambiguate alternatives

- Two fundamental features of the method:
- outputs alternatives when uncertain (“delayed decisions”).
- Notion of “confidence” in formant measurement
when formants cannot be estimated (e.g. during silence), confidence is low and estimate not useful for recognition

=> rely on other features (general spectrum shape).

Example of formant analyser output for Automatic Speech Recognition

- Up to two sets of formants for each frame.
- Alternatives are in terms of sets - F1, F2, F3.
- Specified frame by frame, but are usually alternative trajectories.

“four seven”

Segmental HMM experiments for Automatic Speech Recognition

- Each segment model is associated with a linear trajectory.
- Model each phone by a sequence of one or more segments.
e.g. monophthongal vowels, fricatives - 1 segment

diphthongs - sequence of 2 segments

aspirated voiceless stops - sequence of 3 segments.

- Set allowed minimum and maximum segment duration dependent on identity of phone segment (loose constraint).
- Incorporate confidence estimate (as a variance) in recognition calculations.
- Resolve formant alternatives based on probability.
- Use formants + low-order cepstrum features.

Some connected-digit recognition results for Automatic Speech Recognition

Word error rates

8 cep. 5 cep.+3 for.

Standard-HMM baseline 3.5 % 2.5 %

with 3 states per phone

Standard HMMs with 6.4 % 5.9 %

variable state allocation

- Performance drops when introduce new state allocation (total number of states about half that of baseline)

Introduce segment structure 3.2 % 2.9 %

- Need segment structure for good performance

Introduce linear trajectory 2.6 % 2.3 %

- Some advantage from linear trajectory

- Formants show small, but consistent, advantage.

Formant modelling for Automatic Speech Recognition

- Expressing a model in terms of formant dynamics offers:
- Potential for modelling systematic effects in a meaningful way: e.g speaker identity, speaker stress, speaking rate.
- Potential for a constrained model for speech, which should be more robust to noise (assuming also model the noise).

- BUT: analysis of formants separately from hypotheses about what is being said will always be prone to errors.
- FUTURE AIM: integrate formant analysis within recognition scheme: provided speech model is accurate, this should overcome any formant tracking errors.
- A good model for speech should be appropriate for synthesis as well as for recognition: a trajectory-based formant model offers this possibility.

A “unified” speech model: applied to coding for Automatic Speech Recognition

A simple coding scheme for Automatic Speech Recognition

- Demonstrate principles of coding using same model for both recognition and synthesis.
- Model represents linear formant trajectories.
- Recognition: linear trajectory segmental HMMs of formant features.
- Synthesis: JSRU parallel-formant synthesizer.
- Coding is applied to analysed formant trajectories
=> relatively high bit-rate (up to about 1000 bits/s).

- Recognition is used mainly to identify segment boundaries, but also to guide the coding of the trajectories.

Segment coding scheme overview for Automatic Speech Recognition

Coded at about 600bps for Automatic Speech Recognition

Speaker 1: digits

Speaker 2: digits

Speaker 3: digits

Speaker 1: ARM report

Natural

Speaker 1: digits

Speaker 2: digits

Speaker 3: digits

Speaker 1: ARM report

Speech Coding resultsAchievements of study: Established principle of using formant trajectory model for both recognition and synthesis, including using information from recognition to assist in coding.

Future work: better quality coding should be possible by further integrating formant analysis, recognition and synthesis within a common framework.

Download Presentation

Connecting to Server..