20 likes | 144 Views
This paper presents a comprehensive study on hyperparameter estimation for speech recognition systems using a Variational Bayesian approach. It explores the relationships between phoneme accuracy and the F-measure, emphasizing the importance of model structures and prior distributions. The authors demonstrate that optimizing hyperparameters can significantly improve recognition capability, particularly when large datasets are utilized. Experimental results show promising enhancements in accuracy, supporting the benefits of integrating prior knowledge into recognition systems.
E N D
20.4 The value of F Phoneme Acc. (%) 20.3 20.2 : Root node mean vector 0 2 4 6 8 10 T T : Root node 20,000 sentences 20,000 sentences covariance matrix 79 : Variational posterior , 78 77 76 75 0 2 4 6 8 10 P F : Leaf node index Q : Leaf node occupation 74 : Posterior parameters 79 Y N P = + - ΔF F F F • Stop node split if : Leaf node mean vector Q Q Q Q Y N F F Likelihood 78 73 : Observation vectors Evidence Q Q : Hidden variables Phoneme Acc. (%) Phoneme Acc. (%) 77 72 : Model parameters 76 Prior distribution all 71 75 0 20 40 60 80 100 0 20 40 80 100 60 phone Tree Size (%) 20,000 sentences Tree Size (%) 2,500 sentences : Input vectors … /a/ /N/ state : Observation vector 58 … /a/.state[2] /a/.state[4] 54 Posterior distribution : Mean vector Phoneme Acc. (%) 50 : Inverse covariance leaf 46 100 0 20 40 80 60 : Hyperparameters Tree Size (%) 200 sentences Predictive distribution : Number of dimensions Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi Tokuda(Nagoya Institute of Technology) 1. Introduction 3. Variational Bayesian Approach 4. Hyperparameter Estimation 6. Experimental Results Relationships between F and recognition Acc. • Estimating appropriate hyperparameters • Maximize F w.r.t. hyperparameters • Variational Bayes [Attias;1999] • Approximate posterior distributions • by variational method • Define a lower bound on the log-likelihood • Recent speech recognition systems • ML(Maximum Likelihood) criterion • ⇒ Reduce the estimation accuracy • MDL(Minimum Description Length) criterion • ⇒ Based on an asymptotic approximation Conventional • Using monophone HMM state statistics • ⇒ Maximizing F at the root node • Variational Bayesian(VB) approach • Higher generalization ability • Appropriate model structures can be selected • Performance depends on hyperparameters ⇒ • F and recognition accuracy • behaved similarly Objective Proposed • Proposed technique gives • consistent improvement at the value of F Estimate hyperparameters maximizing marginal likelihood ⇒ Maximize F w.r.t. variational posteriors • Using the statistics of all leaf nodes • ⇒ Maximizing F of the tree structure • If prior distributions have tying structure • ⇒ F is good for model selection Context Clustering based on Variational Bayes [Watanabe et al. ;2002] • Otherwise • ⇒ F increases monotonically as T increases ⇒ 2. Bayesian Framework Maximize F w.r.t. variational posteriors Relationships between tying structure and the amount of training data Q : Phonetic question Yes No Tying structure of prior distributions Consider four kinds of tying structure • Use a conjugate prior distribution • Output probability distribution • ⇒ Gaussian distribution • Conjugate prior distribution • ⇒ Gauss-Wishart distribution Based on the posterior distributions Model parameters are regarded as probabilistic variables 5. Experimental Conditions • The VB clustering with appropriate prior • distribution improves the recognition performance • Advantages • Prior knowledge can be integrated • Model structure can be selected • Robust classification • Likelihood function • ⇒ Proportional to a Gauss-Wishart distribution • Appropriate tying structure of prior distributions • ⇒ Depend on the amount of training data • Define new hyperparameter T representing • the amount of prior data ・ Large training data set ⇒ Tying few prior distributions • Disadvantage • Include integral and expectation calculations • ⇒ Effective approximation technique is required ・ Small training data set ⇒ Tying many prior distributions