1 / 25

研究经历 Research Experience

dgreen
Download Presentation

研究经历 Research Experience

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Bayesian Inference to Logical Bayesian Inference--A New Mathematical Frame for Semantic Communication and Machine Learning College of Intelligence Engineering and Mathematics,Liaoning Engineering and Technology University, Fuxin, Liaoning, China Chenguang Lulcguang@foxmail.com从贝叶斯推断到逻辑贝叶斯推断--新的数学框架for语义通信和机器学习鲁晨光 更详细中文原文见 http:\\survivor99.com\lcg\recent

  2. 研究经历 Research Experience • In 1990s, studied semantic information theory,color vision,portfolio • Recently combined semantic information method and likelihood method for machine learning: • Maximum mutual information classification • Mixture models,Multi-label learning • Improved Bayesian inference to Logical Bayesian inference (group A1) • 最早研究色觉和美感等哲学问题,因色觉模型涉及模糊数学,当了汪培庄教授的访问学者,完成《广义信息论》。 后来研究投资组合理论,下海搞投资。 最近在汪老师鼓励下重新搞研究, • 结合语义信息方法和似然度方法 研究机器学习:最大互信息分类, 混合模型,贝叶斯推断,多标签 分类(也是这次会议交流B1组)。

  3. Bayes’ Reasoning and Bayesian Inference 贝叶斯推理和贝叶斯(主义)推断 Predictions by both sides should be compatible for huge samples • 我的理解Bayes’ Reasoning Inference using θProbability reasoning without θ including classical Bayes’ prediction Likelihood Bayesian Inference Inference Tool: P(X|θj) Tools: P(θ), P(X|θ)->P(θ|X)=P(θ)P(X|θ)/Pθ(X) Max: logP(X|θj) Max: logP(θ|X) for MAP Logical Bayesian Inference Tool: truth or membership function T(θj|X) Max: log[T(θj|X)/T(θj)] = log[P(X|θj)/P(X)]

  4. Classical Bayes’ Prediction经典的贝叶斯预测 Note: P(yj|X) is not normalized Tool:transition probability function P(yj|X) or Shannon’s channel P(Y|X):P(yj|X), j=1,2,… Two steps: Step I: Obtain prediction tool P(yj|X) from asample orsamplingdistribution P(X,Y); Step 2: For given P(X) or P‘(X) and yj, make probability prediction:

  5. From Classical Bayes’ Prediction to Likelihood Prediction从经典的的贝叶斯预测到似然预测 • Advantage: When P(X) becomes P’(X), the tool P(yj|X) still works. • Disadvantage: If samples are small, we cannot obtain continuous P(yj|X) • So, Fisher developed the likelihood method. • Tool: likelihood function P(X|θj) • Step 1: For asample sequence: x(1), x(2),…,x(n) under IID assumption • we maximize likelihood to get optimized θj*. • Step 2: Using P(X|θj*) to make probability prediction. • Disadvantage: when P(X) becomes P’(X), P(X|θj*) will be invalid.

  6. Maximum Likelihood Criterion = Maximum Generalized KL Information Criterion最大似然准则=最大广义KL信息准则 • Likelihood =Negative Cross-entropy: Assume Nj->∞ and IID assumption is tenable,there is Conditional Sampling distribution Likelihood function Generalized KL information:

  7. Bayesian Inference:AdvantagesandDisadvantages贝叶斯主义推断: 优点和缺点 • Tool:Bayesian posterior • Advantages: 1) to consider the prior knowledge P(θ); • 2) with sample size’s increasing, distribution P(θ|X) shrinks to MAP θj*. 3)… • Disadvantages: 1) no using P(X), the prior of X; • 2) Probability prediction is not compatible with classical Bayesian prediction: • 3)…

  8. Two Reasons for Logical Bayesian Inference需要逻辑贝叶斯推断的两个理由 Reason 1: We need an inference tool P(θj|X) (reverse likelihood function)sothatwhenP(X)becomes P’(X), the prediction is 我们需要先验分布变化时贝叶斯预测 Compatible with classical probability prediction. Reason 2: We need to get the denotation or semantic meaning of yj according to sampling distribution P(X|yj) and P(X). 我们需要根据样本分布得到标签的外延或语义。

  9. Using the Truth Function or Membership Function T(θj|X) as the Inference Tool用真值函数或隶属函数作为推断工具 • Given age prior distribution P(x), posterior distribution P(x|“adult” is true). • We wish to • 1) get the denotation or truth function or membership function T(θ1|X)of “Adult”. • 2) get new probability prediction or new likelihood function when P(x) becomes P’(x)? • 3) obtain its membership function if the set {Adult} is fuzzy. “成年人”外延 Denotation of “Adult” Prior Posterior Existing math method cannot obtain this denotation; nevertheless, our brains can.

  10. Why Do We Use Truth Function T(θj|X) instead of Reverse Likelihood Function P(θj|X)为什么要用真值函数而不是反似然函数? • Reason 1:T(θj|X) indicates the denotation or semantic meaning of yj • Reason 2: It is difficult to obtain P(θj|X) because ∑ j P(θj|X)=1 whereas max[T(θj|X)]=1. • P(θj|X)=P(θj)P(X|θj)/P(X) , needs 3 items • T(θj|X)=P(X|θj)/P(X) /max[P(X|θj)/P(X)], only needs 2 items, without needing Bayesian prior P(θ). from The Third Kind of Bayes’ Theorem

  11. The Third Kind of Bayes’ Theorem I Proposed我提出第三种贝叶斯定理 Logical probability Semantic likelihood function My discovery • Bayes’ Theorem I proposed by Bayes: T(B|A)=T(A|B)T(B)/T(A) • Bayes’ Theorem II used by Shannon: • Baye’s Theorem III consists of two asymmetrical formulas: When samples are huge, the optimized truth function is

  12. Illustrating Bayes’ Theorem III图解贝叶斯定理III • Assume θj=Aj 假定集合清晰 • For fuzzy set θj and discontinuous sampling distributions, we need semantic information method. 集合模糊和样本分布不连续时,我们需要语义信息方法优化真值函数。

  13. Sematic Information Measures Logical probability as regularizer • Semantic Information of yj about xi defined with log-normalized-likelihood • Generalized Kullback-Leibler(KL) information: • Semantic mutual information:

  14. Optimizing Truth Functions with Maximum Semantic Information Criterion用最大语义信息准则优化真值函数 2 • Optimized truth function when sampling distributions are discontinuous: • If T(θj|X)=exp[-k(yj-xi) ] then • So, semantic information criterion is a special Regularized Least Square (RLS) criterion, and logT(θj) is the regularizer. • With the above formula, for continuous sampling distribution, optimized truth functions is also

  15. BI: Comparing BI and LBI比较贝叶斯推断和逻辑贝叶斯推断 • Basic Formula of Bayesian Inference: • Pθ(X): Horizontally normalization coefficient • Probability prediction: incompatible with classical Bayes’ prediction • Basic formula of Logical Bayesian Inference: • T(θj): Longitudinally normalization coefficient • Probability prediction: • compatible with classical Bayes’ prediction

  16. Application 1: Multi-label Learning and Classification • Multi-lable learning (training): • Obtain optimized semantic channel T(θj|X), j=1,2,… from Shannon’s channel P(yj|X) by T*(θj|X)=P(yj|X)/max[P(yj|X)] Or by • Multi-label classification (reasoning): • Classifier: • If classes are clear, • It encourages us select a compound label with least denotation. • Compared with One-Rest, Binary Relevance, it is much simpler. • For details see “Semantic channel and Shannon’s channel mutually match for multi-label classification” in the same conference session B1.

  17. Application 2: Maximum Mutual Information Classifications for Unseen Instances To optimize z’ 阳性 阴性 有病 没病 I(X;θ0|Z)I(X;θ1|Z) Channels’ Matching (CM) iteration algorithm, such as for medical tests 信道匹配迭代算法,可用于医学检验, 西瓜分类, 垃圾邮件分类… Given P(X), P(Z|X) and start dividing point z’, repeat the two steps: Matching I:T(θj|X) matches P(yj|X) Given z’, P(X,Z), we get P(yj|X) and T(θj|X). Matching II:P(yj|X) matches T(θj|X) For given Z, there are information lines I(X;θj|Z), j=1,2,… Classifier for new z’: If z’ unchanges, end; else, Goto Matching I. Fast convergence, need 3-5 iterations. 一般3-5次收敛。收敛证明见:http:\\survivor99.com\lcg\CM\CM4tests.pdf

  18. Channels’ Matching (CM) Iterative Algorithm:An Example Shows Its Reliability. 信道匹配迭代算法: 一个例子显示其可靠性 x2 x1 x0 Inbeginning, information lines At end, information lines y2 y1 y0 Z2’ Z1’ • Two dividing points for three classes • Bad start z1’=11andz2’=21 • The iteration converges after 11 iterations

  19. Application 3: Mixture Models Iterations End iteration Pθ(X)≈ P(X) Start iteration Pθ(X)≠ P(X) Sampling distribution P(X)=∑ j P*(yj)P(X|θj*) Predicted distribution Pθ(X) by θ=(μ,σ) and P(Y) To make the relative entropy KL divergence

  20. CM-EM Algorithm for Mixture Models EM algorithm:basic idea is to maximize Q repeatedly E-step:to construct Shannon’s channel with P(yj) and θj by M-step: to maximize Problem: New mixture ratio P(yj): New CM-EM algorithm: basic idea is to minimize R-G=I(X;Y)-I(X;θ) E1-step=E-step E2-step: to modify P(Y) until MG-step: to maximize semantic mutual information G=I(X;θ) I have found strict convergence proof, see http:\\survivor99\lcg\CM\Emwayout.pdf

  21. An Example against EM Convergence Proof Q may and should decrease.Only5iterations

  22. Comparing CM-EM with EM and MM Algorithms For the same example used by Neal and Hinton, EM algorithm needs 36 iterations MM algorithm (Neal and Hinton) needs 18 iterations; CM-EM algorithm needs only 9 iterations. References: 1. Lu Chenguang, From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models, http://arxiv.org/a/lu_c_3. 2. Neal, Radford; Hinton, Geoffrey , ftp://ftp.cs.toronto.edu/pub/radford/emk.pdf

  23. Application 4: ConfirmationMeasure or Optimized Degree of Belief forInduction归纳:计算确证度或优化的可信度 |可信度b*|=1-反例的真值b’* In medical tests, from sensitivity=P(y1|x1) Specificity=P(y0|x0) We can obtain P(Y|X) and T*(θ|X). References: Tentori, K. et al.: Comparison of confirmation measures. Cognition 103(1), 107–119 (2017). Lu, C.: Semantic Information Measure with Two Types of Probability for Falsification and Confirmation, https://arxiv.org/abs/1609.07827

  24. Optimized Degree of Belief b* vs Confidence Level CL优化的可信度对比置信水平 Confidence Level of y1=“positive”: CL1=P(y1|x1) /[P(y1|x0)+P(y1|x1)] Relationship: “Allravens are white” with b*≈-1. 因为所有证据支持谓词的否定 “All ravens are fat”with b*=0. It ensures that y1 with (b*= -1) ≡ y1’with (b*=1) • Important conclusions: 1) The b* of “positive” mainly depends on the correct rate of “negative”, vise versa. 和医学界共识兼容。 • 2) Less counterexamples are more important than more positive-examples. So, b* is compatible with Popper’s falsification theory. • 重要结论:1) 阳性的可信度主要取决于阴性的正确率, 反之亦然; 2) 较少的反例比较多的正例更重要;b*兼容波普尔的证伪理论。

  25. Summary总结 • Using truth function or membership function T(θj|X) as inference tool • To be compatible with classical Bayes’ prediction • Using the prior P(X) or P’(X) instead of P(θ) • Label learning or training: T(θj|X) matches P(yj|X) to maximize I(X;θ) • Label selecting or reasoning: P(yj|X) matches T(θj|X) to maximize I(X;θ) • Maximum mutual information classification: repeating two matches • Mixture models: minimizing I(X;Y)-I(X;θ) repeatedly. • Confirmation measure is compatible with Popper’s falsification theory: b*=1-counterexample-ratio / positive-example-ratio Thanks for your listening! Welcome to exchange ideas. For more papers about semantic information theory and machine learning, see http://survivor99.com/lcg/books/GIT/index.htm orhttp://arxiv.org/a/lu_c_3

More Related