Roberto Togneri University of Western Australia Li Deng Microsoft Research, Redmond

An EKF-based algorithm for learning statistical hidden dynamic model parameters for phonetic recognition Roberto Togneri University of Western Australia Li Deng Microsoft Research, Redmond ICASSP'2001

Contents • The Hidden Dynamic Model (HDM) • Parameter Estimation by EM • Parameter Estimation by EKF • Comparison between EM and EKF • Phone Recognition Evaluations • Recognition Results • Model Convergence Results • Discussion of Results • Conclusion and Future Work ICASSP'2001

Hidden Dynamic Model (HDM) • Target-directed, VTR state dynamics • Static, non-linear mapping to MFCC • Switching state-space parameters • (Tj, j) parameters switch when crossing to new phone dynamic regime, j • Continuity of dynamic state, z(k) • z(k) is continuous across phone regimes • MLP non-linear mapping, h(.) • h(.) is a 3-layer MLP with z(k) on the input layer and MFCC observations, O(k), on the output layer • hyperbolic tangent activation function • Phone segmentation assumed known ICASSP'2001

Parameter estimation by EM • E-step • Use EKF to provide estimates of the hidden dynamic, Z, given the known observations, O, and the current parameter estimates (Tj, j) • M-step • Maximise Q-function with respect to (Tj, j) • Solution of non-linear, high-order equations • Use a generalised form of EM • Gradient descent or Newton-Rhapson • Backprop algorithm for h(.) MLP weights • After each EM iteration use back-propagation algorithm to estimate MLP weights • smoothed EKF estimates of Z as input • given observations, O, as output ICASSP'2001

Parameter Estimation by EKF • ADDENDUM to paper • Paper only covers EKF estimation of (Tj, j), but implemented version also estimates MLP weights, Wr, as described here • Augmented form of state vector • Augmented state equation • Observation equation • Nonlinear mapping hr(.) only depends on z(k) and Wr(k) ICASSP'2001

Parameter Estimation by EKF • State equation Jacobian matrix • m-dim z(k), p-dim Wr(k), n-dim O(k) • Observation equation Jacobian matrix • Initialisation • Parameter vector, (0|0) • State error co-variance matrix, P(0|0) • Important to control convergence • State noise co-variance, Q(k) • Set to zero for parameter equations • Observation noise co-variance, R(k) ICASSP'2001

Comparison between EM and EKF • Cons of EM algorithm • Requires additional back-propagation algorithm to estimate MLP weights • Convergence problems in M-step due to non-linear equations • Slower rate of convergence • Cons of EKF algorithm • Sensitive to initial conditions, especially initialisation of P(0|0) and Q • Computationally expensive for large augmented state vectors ICASSP'2001

Phone Recognition Evaluations • N-best rescoring • Use baseline HMM to provide time-aligned 5-best and 100-best transcriptions • Optionally include the reference transcription • 100-best, 100-best+ref, 5-best, 5-best+ref • Calculate log-likelihood score of HDM across all transcriptions and select highest scoring transcription to calculate the WER of the HDM • Perform forced alignment of HMM across all transcriptions and select highest scoring transcription to calculate the WER of the HMM • Baseline HMM • Context-Dependent phone HMM • 3-state, left-right triphone model • cross-word triphone network • 39-dim (13+13+13) MFCC observation vectors • HTK v2.2 software • Trained on all TIMIT training data • Tested on TIMIT dr8 testing data ICASSP'2001

Phone Recognition Evaluations • Evaluation HDM • 3-dim VTR state vector • 13-dim observation MFCC vector • Per phone model j • 3-dim target, Tj • 3-dim “diagonal” time-constant, j • HDMm variant • one 3-12-13 MLP per phone model • 42 phone models • HDMc variant • one 3-16-13 MLP per broad-class • 3 broad-classes: Silence, Voiced, Unvoiced • Trained on TIMIT dr8 training data • Tested on TIMIT dr8 testing data • Differences in training data • baseline HMM performance superior when using all TIMIT data • HDM computational requirements limit training data to dr8 subset ICASSP'2001

Recognition Results • WER results • Both HMM and HDM perform little better than Chance when presented with the N-best list • phone recognition is a difficult problem • HDM performance improves significantly when presented with the N-best+ref list • HDM is able to select the reference transcription, whereas the HMM is unable to ICASSP'2001

Model ConvergenceResults • Generative properties of HDM • Compare MFCC acoustic feature vector with generated outputs from HDMm and HDMc • HDMm and HDMc convergence to observed features is evident ICASSP'2001

ModellingResults • Parsimony of HDM • HDM is a more structured modelling paradigm • HDMm parameters = 0.015 x HMM parameters • HDMc parameters = 0.0014 x HMM parameters • Identification of HDM parameters • The estimated (Tj, j) do not appear to have converged to the expected values • The estimated Tj do not necessarily correspond to the measured VTRs for phone j • Problems due to incorrect modelling assumption, insufficient training, or over-specification of model parameters • e.g. MLP observation non-linearity may be too general ICASSP'2001

Discussion of Results • Baseline HMM fails to select reference transcription whereas HDM is successful with a WER reduction of ~ 10% • HDM represents a more parsimonious modelling paradigm compared to baseline HMM • HDM does not yield physically reliable parameters but does converge to the given observation features • HDM is possibly not identifiable in its current implementation • The N-best rescoring is not a reliable means of evaluating system performance • HDM results based on sub-optimal time-aligned transcriptions from HMM • WER results indicate the potential of the HDM paradigm for acoustic modelling ICASSP'2001

Conclusion andFuture Work • The HDM is a promising alternative to the current state-of-the-art HMM • parsimonious model • requires less training data and fewer iterations • easier to adapt • better generalisation capabilities • More work is required to properly evaluate the performance of the HDM • implement a lattice scoring algorithm with optimal segmentation to produce own N-best transcriptions • More work is required to find efficient and less restricted estimation and decoding algorithms • compare EKF and EM estimation algorithms • estimation of segmentation boundaries • efficient decoding algorithms • train models on larger data-sets ICASSP'2001

Conclusion andFuture Work • More work is needed to confirm the proposed dynamic and observation modelling structure • More constrained non-linear mapping from hidden state to MFCC observations • More inaccurate but much more efficient linear mapping from hidden state to MFCC observations • Reduce number of parameters to be estimated to allow the system to be identified • Use known VTR resonance values as Tj • estimate j and Wr • Use known mapping from VTR resonances to MFCC observations • estimate j and Tj ICASSP'2001

Roberto Togneri University of Western Australia Li Deng Microsoft Research, Redmond

Roberto Togneri University of Western Australia Li Deng Microsoft Research, Redmond

Presentation Transcript

David Indermaur Crime Research Centre University of Western Australia

Scott Counts Microsoft Research Redmond, WA

Dr Linda Newman University of Western Sydney - Australia

Western Australia

Nikolaj Bjørner Senior Researcher Microsoft Research Redmond

Ethan Jackson Research in Software Engineering (RiSE), Microsoft Research - Redmond

Nikolai Tillmann Foundations of Software Engineering Microsoft Research Redmond WA, USA

Microsoft Research Stanford University

Li Deng Microsoft Research, Redmond

Flat Datacenter Storage Microsoft Research, Redmond

Western Australia

University of Western Australia

Li Deng Microsoft Research Redmond, WA Presented at the Banff Workshop, July 2009

Western Australia

WESTERN AUSTRALIA

Microsoft Research Stanford University

University of Western Australia

Li Deng Microsoft Research, Redmond, USA Tianjin University, July 4, 2013 (Day 3)

Western Australia

Li Deng Microsoft Research, Redmond, USA Tianjin University, July 2-5, 2013

Nikolaj Bjørner Senior Researcher Microsoft Research Redmond

Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003