Adaptation techniques in automatic speech recognition
Download
1 / 17

Adaptation Techniques in Automatic Speech Recognition - PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on
  • Presentation posted in: General

Adaptation Techniques in Automatic Speech Recognition. Tor Andr é Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications, 2003. Goal and Objective. Make ASR robust to speaker and environmental variability.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Adaptation Techniques in Automatic Speech Recognition

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Adaptation Techniques in Automatic Speech Recognition

Tor André Myrvoll

Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications, 2003.


Goal and Objective

  • Make ASR robust to speaker and environmental variability.

  • Model adaptation: Automatically adapt a HMM using limited but representative new data to improve performance.

  • Train ASRs for applications w/ insufficient data.


What Do We Have/Adapt?

  • A HMM based ASR trained in the usual manner.

  • The output probability is parameterized by GMMs.

  • No improvement when adapting state transition probabilities and mixture weights.

  • Difficult to estimate  robustly.

  • Mixture means can be adapted “optimally” and proven useful.


Adaptation Principles

  • Main Assumption: Original model is “good enough”, model adaptation can’t be re-training!


Offline Vs. Online

  • If possible offline (performance uncompromised by computational reasons).

  • Decode the adaptation speech data based on current model.

  • Use this to estimate the “speaker-dependent” model’s statistics.


Online Adaptation Using Prior Evolution.

  • Present posterior is the next prior.


MAP Adaptation

  • HMMs have no sufficient statistics => can’t use conjugate prior-posterior pairs. Find posterior via EM.

  • Find prior empirically (multi-modal, first model estimated using ML training).


EMAP

  • All phonemes in every context don’t occur in adaptation data; Need to store correlations between variables.

  • EMAP only considers correlation between mean vectors under jointly Gaussian assumption.

  • For large model sizes, share means across models.


Transformation Based Model Adaptation

  • ML

  • MAP

  • Estimate a transform T parameterized by .


Bias, Affine and Nonlinear Transformations

  • ML estimation of bias.

  • Affine transformation.

  • Nonlinear transformation ( may be a neural network).


MLLR

  • Apply separate transformations to different parts of the model (HEAdapt in HTK).


SMAP

  • Model the mismatch between the SI model (x) and the test environment.

  • No mismatch

  • Mismatch

  •  and  estimated by usual ML methods on adaptation data.


Adaptive Training

  • Gender dependent model selection

  • VTLN (in HTK using WARPFREQ)


Speaker Adaptive Training

  • Assumption: There exists a compact model (c),which relates to all speaker-dependent model via an affine transformation T (~MLLR). The model and the transformation are found using EM.


Cluster Adaptive Training

  • Group speakers in training set into clusters. Now find the cluster closest to the test speaker.

  • Use Canonical Models


Eigenvoices

  • Similar to Cluster Adaptive Training.

  • Concatenate means from ‘R’ speaker dependent model. Perform PCA on the resulting vector. Store K << R eigenvoice vectors.

  • Form a vector of means from the SI model too.

  • Given a new speaker, the mean is a linear combination of SI vector and eigenvoice vector.


Summary

  • 2 major approaches: MAP (&EMAP) and MLLR.

  • MAP needs more data (use of a simple prior) than MLLR. MAP --> SD model.

  • Adaptive training is gaining popularity.

  • For mobile applications, complexity and memory are major concerns.


ad
  • Login