1 / 35

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA. Lecture 9. Learning in Bayesian Networks.

brinly
Download Presentation

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA

  2. Lecture 9. Learning in Bayesian Networks • Learning via Global Optimization of a Criterion • Maximum-likelihood learning • The Expectation Maximization algorithm • Solution for discrete variables using Lagrangian multipliers • General solution for continuous variables • Example: Gaussian PDF • Example: Mixture Gaussian • Example: Bourlard-Morgan NN-DBN Hybrid • Example: BDFK NN-DBN Hybrid • Discriminative learning criteria • Maximum Mutual Information • Minimum Classification Error

  3. What is Learning? Imagine that you are a student who needs to learn how to propagate belief in a junction tree. • Level 1 Learning (Rule-Based): I tell you the algorithm. You memorize it. • Level 2 Learning (Category Formation): You observe examples (FHMM). You memorize them. From the examples, you build a cognitive model of each of the steps (moralization, triangulation, cliques, sum-product). • Level 3 Learning (Performance): You try a few problems. When you fail, you optimize your understanding of all components of the cognitive model in order to minimize the probability of future failures.

  4. What is Machine Learning? • Level 1 Learning (Rule-Based): Programmer tells the computer how to behave. This is not usually called “machine learning.” • Level 2 Learning (Category Formation): The program is given a numerical model of each category (e.g., a PDF, or a geometric model). Parameters of the numerical model are adjusted in order to represent the category. • Level 3 Learning (Performance): All parameters in a complex system are simultaneously adjusted in order to optimize a global performance metric.

  5. Learning Criteria

  6. Optimization Methods

  7. Maximum Likelihood Learning in a Dynamic Bayesian Network • Given: a particular model structure • Given: a set of training examples for that model, (bm,om), 1≤m≤M • Estimate all model parameters (pl(b|a), pl(c|a),…) in order to maximize Smlog p(bm,om|l) • Recognition is Nested within Training: at each step of the training algorithm, we need to compute p(bm,om,am,…,qm) for every training token, using sum-product algorithm. a b b c d e f n o o q

  8. Baum’s Theorem(Baum and Eagon, Bull. Am. Math. Soc., 1967)

  9. Expectation Maximization (EM)

  10. EM for a Discrete-Variable Bayesian Network a b b c d e f n o o q

  11. EM for a Discrete-Variable Bayesian Network a b b c d e f n o o q

  12. Solution: Lagrangian Method

  13. The EM Algorithm for a Large Training Corpus

  14. EM for Continuous Observations(Liporace, IEEE Trans. Inf. Th., 1982)

  15. Solution: Lagrangian Method

  16. Example: Gaussian(Liporace, IEEE Trans. Inf. Th., 1982)

  17. Example: Mixture Gaussian(Juang, Levinson, and Sondhi, IEEE Trans. Inf. Th., 1986)

  18. Example: Bourlard-Morgan Hybrid (Morgan and Bourlard, IEEE Sign. Proc. Magazine 1995)

  19. Pseudo-Priors and Training Priors

  20. Training the Hybrid Model Using the EM Algorithm

  21. The Solution: Q Back-Propagation

  22. Merging the EM and Gradient Ascent Loops

  23. Example: BDFK Hybrid (Bengio, De Mori, Flammia, and Kompe, Spe. Comm. 1992)

  24. The Q Function for a BDFK Hybrid

  25. The EM Algorithm for a BDFK Hybrid

  26. Discriminative Learning Criteria

  27. Maximum Mutual Information

  28. Maximum Mutual Information

  29. Maximum Mutual Information

  30. Maximum Mutual Information

  31. An EM-Like Algorithm for MMI

  32. An EM-Like Algorithm for MMI

  33. MMI for Databases with Different Kinds of Transcription • If every word’s start and end times are labeled, then WT is the true word label, and W* is the label of the false word (or words!) with maximum modeled probability. • If the start and times of individual word strings are not known, then WT is the true word sequence. W* may be computed as the best path (or paths) through a word lattice or N-best list. (Schlüter, Macherey, Müller, and Ney, Spe. Comm. 2001)

  34. Minimum Classification Error(McDermott and Katagiri, Comput. Speech Lang. 1994) • Define empirical risk as “the number of word tokens for which the wrong HMM has higher log-likelihood than the right HMM” • This risk definition has two nonlinearities: • Zero-one loss function, u(x). Replace with a differentiable loss function, s(x). • Max. Replace with a “softmax” function, log(exp(a)+exp(b)+exp(c)). • Differentiate the result; train all HMM parameters using error backpropagation.

  35. Summary • What is Machine Learning? • choose an optimality criterion, • find an algorithm that will adjust model parameters to optimize the criterion • Maximum Likelihood • Baum’s theorem: argmax E[log(p)] = argmax[p] • Apply directly to discrete, Gaussian, MG • Nest within EBP for BM and BDFK hybrids • Discriminative Criteria • Maximum Mutual Information (MMI) • Minimum Classification Error (MCE)

More Related