1 / 15

Roberto Togneri University of Western Australia Li Deng Microsoft Research, Redmond

An EKF-based algorithm for learning statistical hidden dynamic model parameters for phonetic recognition. Roberto Togneri University of Western Australia Li Deng Microsoft Research, Redmond. Contents. The Hidden Dynamic Model (HDM) Parameter Estimation by EM Parameter Estimation by EKF

Download Presentation

Roberto Togneri University of Western Australia Li Deng Microsoft Research, Redmond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An EKF-based algorithm for learning statistical hidden dynamic model parameters for phonetic recognition Roberto Togneri University of Western Australia Li Deng Microsoft Research, Redmond ICASSP'2001

  2. Contents • The Hidden Dynamic Model (HDM) • Parameter Estimation by EM • Parameter Estimation by EKF • Comparison between EM and EKF • Phone Recognition Evaluations • Recognition Results • Model Convergence Results • Discussion of Results • Conclusion and Future Work ICASSP'2001

  3. Hidden Dynamic Model (HDM) • Target-directed, VTR state dynamics • Static, non-linear mapping to MFCC • Switching state-space parameters • (Tj, j) parameters switch when crossing to new phone dynamic regime, j • Continuity of dynamic state, z(k) • z(k) is continuous across phone regimes • MLP non-linear mapping, h(.) • h(.) is a 3-layer MLP with z(k) on the input layer and MFCC observations, O(k), on the output layer • hyperbolic tangent activation function • Phone segmentation assumed known ICASSP'2001

  4. Parameter estimation by EM • E-step • Use EKF to provide estimates of the hidden dynamic, Z, given the known observations, O, and the current parameter estimates (Tj, j) • M-step • Maximise Q-function with respect to (Tj, j) • Solution of non-linear, high-order equations • Use a generalised form of EM • Gradient descent or Newton-Rhapson • Backprop algorithm for h(.) MLP weights • After each EM iteration use back-propagation algorithm to estimate MLP weights • smoothed EKF estimates of Z as input • given observations, O, as output ICASSP'2001

  5. Parameter Estimation by EKF • ADDENDUM to paper • Paper only covers EKF estimation of (Tj, j), but implemented version also estimates MLP weights, Wr, as described here • Augmented form of state vector • Augmented state equation • Observation equation • Nonlinear mapping hr(.) only depends on z(k) and Wr(k) ICASSP'2001

  6. Parameter Estimation by EKF • State equation Jacobian matrix • m-dim z(k), p-dim Wr(k), n-dim O(k) • Observation equation Jacobian matrix • Initialisation • Parameter vector, (0|0) • State error co-variance matrix, P(0|0) • Important to control convergence • State noise co-variance, Q(k) • Set to zero for parameter equations • Observation noise co-variance, R(k) ICASSP'2001

  7. Comparison between EM and EKF • Cons of EM algorithm • Requires additional back-propagation algorithm to estimate MLP weights • Convergence problems in M-step due to non-linear equations • Slower rate of convergence • Cons of EKF algorithm • Sensitive to initial conditions, especially initialisation of P(0|0) and Q • Computationally expensive for large augmented state vectors ICASSP'2001

  8. Phone Recognition Evaluations • N-best rescoring • Use baseline HMM to provide time-aligned 5-best and 100-best transcriptions • Optionally include the reference transcription • 100-best, 100-best+ref, 5-best, 5-best+ref • Calculate log-likelihood score of HDM across all transcriptions and select highest scoring transcription to calculate the WER of the HDM • Perform forced alignment of HMM across all transcriptions and select highest scoring transcription to calculate the WER of the HMM • Baseline HMM • Context-Dependent phone HMM • 3-state, left-right triphone model • cross-word triphone network • 39-dim (13+13+13) MFCC observation vectors • HTK v2.2 software • Trained on all TIMIT training data • Tested on TIMIT dr8 testing data ICASSP'2001

  9. Phone Recognition Evaluations • Evaluation HDM • 3-dim VTR state vector • 13-dim observation MFCC vector • Per phone model j • 3-dim target, Tj • 3-dim “diagonal” time-constant, j • HDMm variant • one 3-12-13 MLP per phone model • 42 phone models • HDMc variant • one 3-16-13 MLP per broad-class • 3 broad-classes: Silence, Voiced, Unvoiced • Trained on TIMIT dr8 training data • Tested on TIMIT dr8 testing data • Differences in training data • baseline HMM performance superior when using all TIMIT data • HDM computational requirements limit training data to dr8 subset ICASSP'2001

  10. Recognition Results • WER results • Both HMM and HDM perform little better than Chance when presented with the N-best list • phone recognition is a difficult problem • HDM performance improves significantly when presented with the N-best+ref list • HDM is able to select the reference transcription, whereas the HMM is unable to ICASSP'2001

  11. Model ConvergenceResults • Generative properties of HDM • Compare MFCC acoustic feature vector with generated outputs from HDMm and HDMc • HDMm and HDMc convergence to observed features is evident ICASSP'2001

  12. ModellingResults • Parsimony of HDM • HDM is a more structured modelling paradigm • HDMm parameters = 0.015 x HMM parameters • HDMc parameters = 0.0014 x HMM parameters • Identification of HDM parameters • The estimated (Tj, j) do not appear to have converged to the expected values • The estimated Tj do not necessarily correspond to the measured VTRs for phone j • Problems due to incorrect modelling assumption, insufficient training, or over-specification of model parameters • e.g. MLP observation non-linearity may be too general ICASSP'2001

  13. Discussion of Results • Baseline HMM fails to select reference transcription whereas HDM is successful with a WER reduction of ~ 10% • HDM represents a more parsimonious modelling paradigm compared to baseline HMM • HDM does not yield physically reliable parameters but does converge to the given observation features • HDM is possibly not identifiable in its current implementation • The N-best rescoring is not a reliable means of evaluating system performance • HDM results based on sub-optimal time-aligned transcriptions from HMM • WER results indicate the potential of the HDM paradigm for acoustic modelling ICASSP'2001

  14. Conclusion andFuture Work • The HDM is a promising alternative to the current state-of-the-art HMM • parsimonious model • requires less training data and fewer iterations • easier to adapt • better generalisation capabilities • More work is required to properly evaluate the performance of the HDM • implement a lattice scoring algorithm with optimal segmentation to produce own N-best transcriptions • More work is required to find efficient and less restricted estimation and decoding algorithms • compare EKF and EM estimation algorithms • estimation of segmentation boundaries • efficient decoding algorithms • train models on larger data-sets ICASSP'2001

  15. Conclusion andFuture Work • More work is needed to confirm the proposed dynamic and observation modelling structure • More constrained non-linear mapping from hidden state to MFCC observations • More inaccurate but much more efficient linear mapping from hidden state to MFCC observations • Reduce number of parameters to be estimated to allow the system to be identified • Use known VTR resonance values as Tj • estimate j and Wr • Use known mapping from VTR resonances to MFCC observations • estimate j and Tj ICASSP'2001

More Related