1 / 24

Mixture Models

opeel
Download Presentation

Mixture Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture ModelsChenguang Lu lcguang@foxmail.com2018-11-10从EM算法到CM-EM算法求混合模型全局收敛鲁晨光Homepage: http://survivor99.com/http://www.survivor99.com/lcg/english/This ppt may be downloaded from http://survivor99.com/lcg/CM/CM4mix.ppt

  2. Mixture Models • Sampling distribution P(X)=∑ j P*(yj)P(X|θj*) • Predicted distribution Pθ(X) by θ=(μ,σ) and P(Y) • To make the relative entropy KL divergence Iterations End iteration Pθ(X)≈ P(X) Start iteration Pθ(X)≠ P(X)

  3. The EM Algorithm for Mixture Models • The popular EM algorithm andits convergenceproof • Likelihood is negative • general entropy • negative general joint entropy, in short, negative entropy • E-step: put P(yj|xi, θ) into Q • M-step:Maximize Q. • Popular convergence proof: • 1) Increasing Q can maximizes logP(X|θ); • 2) Q is increasing in every M-step and no-decreasing in every E-step. Jensen's inequality

  4. The First Problem with the Convergence Proof of the EM Algorithm: Q may be greater than Q* • Assume P(y1)=P(y2)=0.5;µ1=µ1*, µ2=µ2*; σ1= σ2= σ. • [1].Dempster, A. P., Laird, N. M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38 (1977). • [2] Wu, C. F. J.: On the Convergence Properties of the EM Algorithm. Annals of Statistics 11, 95–10 (1983). logP(XN,Y|θ) 0 Q=-6.75N Target Q*=-6.89N

  5. The Second Problem with the EM algorithm Convergence Proof • P(Y|X) from the E-step is not a proper Shannon’s channel because new mixture ratio: P(X|θj+1)=P(X)P(X|θj)/Pθ(X) is not normalized. • It is possible that ∑ iP(xi|θ1+1)>1.6 and ∑ iP(xi|θ0+1) <0.4

  6. The CM-EM Algorithm for Mixture Models • The CM-EM algorithm: basic idea is to minimize R-G=I(X;Y)-I(X;θ) since minimizing R-G is equivalent to minimizing H(P||Pθ) • E1-step=E-step • E2-step: to modify P(Y) by replacing it by P+1(Y): Until • MG-step: to maximize semantic mutual information G=I(X;θ) • For Gaussian distributions

  7. Comparing the CM-EM with EM Algorithms for Mixture Models • To write Q of EM in cross-entropies: • Maximizing Q = Minimizing H(X|θ) and Hθ(Y). • The CM-EM does not minimize Hθ(Y), it modifies P(Y) so that P+1(Y)=P(Y) or Hθ(Y)=0. • Relationship: E-step of EM = E1-step of CM-EM M-step of EM (E2-step + MG-step) of CM-EM

  8. Comparing CM-EM and MM Algorithms • Neal and Hinton define F=Q+NH(Y)=-NH(X,Y|θ)+NH(Y)≈ -NH(X|θ). then maximize F in both M-step and E-step. • CM-EM maximize G=H(X)-H(X|θ)inMGstep.So,the MG-stepissimilartothe M-stepof the MM algorithm. • Maximizing F is similar to minimizing H(X|θ) or maximizing G. • If we replace H(Y) with Hθ(Y) in F then the M-step of MM is the same as the MG-step. • However, E2-step does not maximize G, it minimize H(Y+1||Y).

  9. An Iterative Example of Mixture Models with R<R* or Q<Q* • The number of iterations is 5 Both CM_G and EM_Q are monotonously increasing. H(Q||P)=R(G)-G→0

  10. A Counterexample with R>R* or Q>Q* against the EM convergence Proof True, starting, and ending parameters: Excel demo files can be can downloaded from: http://survivor99.com/lcg/cc-iteration.zip The number of iterations is 5

  11. Illustrating the Convergence of the CM-EM Algorithm for R>R* and R<R* A counterexample against the EM; Q is decreasing The central idea of The CM is • Finding the point G≈R on R-G plane, two-dimensional plane; also looking for R→R*——EM algorithm neglects R→R* • MinimizingH(Q||P)= R(G)-G (similar to min-max method); Two examples: Start R<R* or Q<Q* Start R>R* or Q>Q* Target

  12. Comparing the Iteration Numbers of CM-EM,EM and MM Algorithms • For the same example used by Neal and Hinton, • EM algorithm needs 36 iterations • MM algorithm (Neal and Hinton) needs 18 iterations; • CM-EM algorithm needs only 9 iterations. • References: 1. Lu Chenguang, From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models, http://arxiv.org/a/lu_c_3. • 2. Neal, Radford; Hinton, Geoffrey , ftp://ftp.cs.toronto.edu/pub/radford/emk.pdf

  13. Fundamentals for the Convergence Proof 1:SemanticInformation is Defined with Log-normalized-likelihood • Semantic information conveyed by yj about xi: • Averaging I(xi;θj) to get Semantic Kullback-Leibler Information: • Averaging I(X;θj) to get Semantic Mutual Information:

  14. From Shannon’s Channel to Semantic Channel yj不变X变 • The Shannon channel consists of transition function • The semantic channel consists of truth functions: • The semantic mutual information formula: • We may fix one and optimize another alternatively. X

  15. Fundamentals for the Convergence Proof 2: From R(D) Function to R(G) Function Shannon’s Information rate distortion function: R(D) where R(D) means minimum R for given D. Replacing D with G: We have R(G) function: All R(G) functions are bowl like. Matching Point

  16. Fundamentals for the Convergence Proof 2: Two Kinds of mutual Matching • 1. For maximum mutual information classifications • for Maximum R and G: • 2. For mixture models • for minimum R-G Matching Point

  17. Semantic Channel Matches Shannon’s Channel • Optimize the truth function and the semantic channel: • or • When the sample is large enough, the optimized truth function is proportional to the transition probability function • or

  18. Shannon’s Channel Matches Semantic Channel • For Maximum Mutual Information Classifications • Using classifier • For mixture models • Using E1-step and E2-step of CM-EM • Repeat • Until

  19. The Convergence Proof of CM-EM I: Basic Formulas Semantic mutual information Shannon mutual information where Main formula for mixture models: = =∑i P(xi)P(yj|xi)

  20. The Convergence Proof of CM-EM II: Using Variational Method • The Convergence Proof: Proving that Pθ(X) converges to P(X) is equivalent to proving that H(P||Pθ) converges to 0. Since E2-step makes R=R'' and H(Y+1||Y)=0, we only need to prove that every step minimizes R-G after the start step. • Because MG-step maximizes G without changing R. The left work is to prove that E1-step and E2-step minimize R-G. • Fortunately, we can strictly prove that by the variational method and the iterative method that Shannon (1959) and others (Berger, 1971; Zhou, 1883) used for analyzing the rate-distortion function R(D).

  21. The CM Algorithm: Using Optimized Mixture Models for Maximum Mutual Information Classifications To find the best dividing points. First assume a z’ to obtain P(zj|Y) Matching I: Obtain T*(θzj|Y) And information lines I(Y;θzj|X) Matching II: Using the classifier: If H(P||P θ)<0.001, then End,else Goto Matching I.

  22. IllustratingtheConvergence of the CMAlgorithmforMaximum Mutual Information Classifications with R(G) Function Iterative steps and convergence reasons: • 1)For each Shannon channel, there is a matched semantic channel that maximizes average log-likelihood; • 2)For given P(X) and semantic channel, we can find a better Shannon channel; • 3)Repeating the two steps can obtain the Shannon channel that maximizes Shannon mutual information and average log-likelihood. A R(G) function serves asaladderlettingRclimbup, and find a better semantic channel and a better ladder.

  23. An Example Shows the Reliability of The CM Algorithm • A 3×3 Shannon channel to show reliable convergence • Even if a pair of bad start points are used, the convergence is also reliable. • Using good start points, the number of iterations is 4; • Using very bad start points, the number of iterations is 11. beginning convergent

  24. Summary The CM algorithm is a new tool for statistical learning. To show its power, we use the CM-EM algorithm to resolve the problems with mixture models. In real applications, X may be multi-dimensional; however, the convergence reasons and reliability should be the same. ——End—— Thank you for your listening! Welcome to criticize! 2017-8-26 reported in ICIS2017 (第二届智能科学国际会议,上海) 2018-11-9 revised for better convergence proof. More papers about the author’s semantic information theory: http://survivor99.com/lcg/books/GIT/index.htm

More Related