Channels' Matching Algorithm for Mixture Models - A Challenge to the EM Algorithm

信道匹配算法用于混合模型——挑战EM算法Channels’ Matching Algorithm for Mixture Models——A Challenge to the EM Algorithm 鲁晨光 Chenguang Lu lcguang@foxmail.comHpage: http://survivor99.com/lcg/; http://www.survivor99.com/lcg/english/This ppt may be downloaded from http://survivor99.com/lcg/CM/CM4mix.ppt

1. Mixture Models: Guessing Parameters • There are about 70 thousand papers with EM in titles. See http://www.sciencedirect.com/ • True model: P*(Y) and P*(X|Y) producesP(X)=P*(y1)P*(X|y1)+P*(y2)P*(X|y2)+… • Predictive model：P(Y) and {θj} produces Q(X)=P(y1)P(X|θ1)+P(y2)P(X|θ2)+… • Gaussian distribution: P(X| θj)=Kexp[-(X-cj)2/(2dj2)] • Iterative algorithm to guess P(Y) and {(cj，dj)} till Kullback-Leibler divergence orrelativeentropy 开始 Q(X)≈ P(X) P(X) Q(X) Iterating

2. The EM Algorithm for Mixture Models • The popular EM algorithm andits convergenceproof • Likelihood is negative • general entropy • negative general joint entropy • E-step: put P(yj|xi, θ) into Q • M-step：Maximize Q. • Convergence Proof: • 1) Q’s increasing makes H(Q||P) --> 0; • 2) Q is increasing in every M-step and no-decreasing in every E-step.

3. Problems with the Convergence Proof of the EM Algorithm • 1). There is a counterexample against the convergence proof [1,2]: • Real and guessed model parameters and iterative results For the true model, Q*=P(XN,Y|θ*)= - 6.031N； After first M-step, Q=P(XN,Y|θ)= - 6.011N, which is larger. • 2). E-step might decrease Q, such as in the above example (talked later). • [1].Dempster, A. P., Laird, N. M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38 (1977). • [2] Wu, C. F. J.: On the Convergence Properties of the EM Algorithm. Annals of Statistics 11, 95–10 (1983). P(XN,Y|θ) 0 -6.011N Target -6.031N

4. Channels’Matching Algorithm • The Shannon channel • The semantic channel • The semantic mutual information formula:

5. Research History 1989: 色觉的译码模型及其验证( The decoding model of color vision and verification), 光学学报,9,2(1989)，158—163 1993：《广义信息论》(Ageneralized Information Theory），中国科技大学出版社； 1994:《广义熵和广义互信息的编码意义》,《通信学报》, 5卷6期,37-44. 1997：《投资组合的熵理论和信息价值》，中国科技大学出版社； 1999: A generalization of Shannon‘s information theory (ashortversionofthebook） , Int. J. of General Systems, 28: (6) 453-490，1999 Recently, I found this theory could be used to improve statistical learning in many aspects. See http://www.survivor99.com/lcg/books/GIT/ http://www.survivor99.com/lcg/CM.html Home page: http://survivor99.com/lcg/ Blog：http://blog.sciencenet.cn/?2056 Published in 1993

6. Truth function and Semantic Likelihood Function • Using membership function mAj(X) as truth function of a hypothesis yj=“X is in Aj”: T(θj|X)=mAj(X), θj=Aj(a fuzzy set) as a sub-model • Using thruth function T(θj|X) and source P(X) to produce semantic likelihood function: • Viewing semantic likelihood function from two examples of GPS Most possible position

7. Semantic Information Measure compatible with Shannon，Popper，Fisher，and Zadeh’sThoughts • If T(θj|X)=exp[-|X-xj|2/(2d2)], j=1, 2, …, n, then • =Bar-Hillel and Carnap’s information – standard deviation • This information measure reflects Popper’s thought well. • The less the logical probability is, the more information there is; • The larger the deviation is, the less information there is; • A wrong estimation conveys negative information.

8. Semantic Kullback-Leibler Informationand Semantic Mutual Information Averaging I(xi;θj) to get Semantic Kullback-Leibler Information: Relationship between normalized log-likelihood and I(X; θj): Averaging I(X;θj) to get Semantic Mutual Information: Sampling distribution

9. Semantic Channel Matches Shannon’s Channel • Optimize the truth function and the semantic channel: • When the sample is large enough, the optimized truth function is proportional to the transition probability function • xj* makes P(yj|xj*) be the maximum of P(yj|X). If P(yj|X) orP(yj)is hard to obtain, we may use • With T*(θj|X), the semantic Bayesian prediction is equivalent to traditional Bayesian prediction: P*(X|θj)=P(X|yj). Semantic channel Shannon channel

10. MSI in Comparison with MLE and MAP MSI(estimation)——Maximum Semantic Information (estimation) MLE: MAP: MSI： MSI has features: • 1）compatible with MLE，but, suitable to cases with variable source P(X); • 2）compatible with traditional Bayesian predictions; • 3）using truth functions as predictive models so that the models reflect communication channels’ features.

11. Matching Function between Shannon Mutual Information R and Average Log-normalized-likelihood G Shannon’s Information rate distortion function: R(D) Replaced by We have Information rate - semantic information function R(G)： All R(G) functions are bowl like.

12. The CM Algorithm for Mixture Models Semantic mutual information Shannon mutual information where Main formula for mixture models (without Jensen's inequality) : Three steps: 1) Left-step-a for 2)Left-step-b for H(Y||Y+1)→0; 3)Right-step for Shannon channel maximizing G The CM vs The EM: Left-step-a ≈ E-step; Left-step-b+ Right-step≈M-step = =∑i P(xi)P(yj|xi) Using an inner iteration

13. Illustrating the Convergence of the CM Algorithm The central idea of The CM is • Finding the point G≈R on R-G plane, two-dimensional plane； also looking for R→R*——EM algorithm neglects R→R* • Using formulaH(Q||P)= R(G)-G (similar to min-max method)； Two examples: Start R<R* or Q<Q* Start R>R* or Q>Q*

14. An Iterative Example of Mixture Models with R<R* or Q<Q* • E-step in EM = Left-step a in CM • M-step in EM ≈ Left-step b + Right-step in CM • In the M-step of the EM, if we first optimize P(Y) so that P+1(yj)=∑i P(xi)P(yj|xi, θ)=P(yj), then optimize parameters to max Q，the EM will be equal to the CM. • The number of iterations is 5.

15. A counterexample with R>R* or Q>Q* against the EM True, starting, and ending parameters: Q of EM=-NH(X,Y|θ)=-NH(X,ϴ) Excel demo files can be can downloaded from: http://survivor99.com/lcg/cc-iteration.zip Or say, Q of EM is not monotonously increasing The number of iterations is 5

16. Illustrating Maximum likelihood Classification for Mixture Models • After we obtain optimized P(X|Θ), we need to select Y (to make decision or classification) according to X. The parameter s in R(G) function reminds us that we may use the following Shannon channel: j=1, 2, …, n When s->∞, P(yj|X)=0 or 1.

17. The numbers of Iterations for Convergence For Gaussian mixture models with Component number n=2.

18. The CM algorithm for tests and estimations is simpler than the EM algorith For tests and estimations with given P(X) and P(Z|X), The CM algorithm can be used to find best boundaries to achieve maximum Shannon’s mutual information and maximum average log-likelihood.

19. IllustratingtheCMAlgorithmforTestsandEstimations with R(G) Function Iterative steps and convergence reasons: • 1)For each Shannon channel, there is a matched semantic channel that maximizes average log-likelihood; • 2)For given P(X) and semantic channel, we can find a better Shannon channel; • 3)Repeating the two steps can obtain the Shannon channel that maximizes Shannon mutual information and average log-likelihood.

20. An Example for Estimations • A 3×3 Shannon channel to show reliable convergence • Even if a pair of bad start points are used, the convergence is also reliable. • Using good start points, the number of iterations is 4; • Using bad start points, the number of iterations is 11. beginning convergent

21. Summary The Channel’s matching algorithm is a new tool for statistical learning. To show its power, this file uses it to resolve the problems with mixture models. In real applications, X may be multi-dimensional; however, the convergence reasons and reliability should be same. ——End—— Thank you for your listening！ Welcome to criticize！ 2017-8-26 reported in ICIS2017 (第二届智能科学国际会议,上海) More papers about the author’s semantic information theory: http://survivor99.com/lcg/books/GIT/index.htm

Channels' Matching Algorithm for Mixture Models - A Challenge to the EM Algorithm