1 / 41

2D1431 Machine Learning

2D1431 Machine Learning. Bayesian Learning. Outline. Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes classifier Bayes optimal classifier Bayesian belief networks Expectation maximization (EM) algorithm.

ranaet
Download Presentation

2D1431 Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 2D1431 Machine Learning Bayesian Learning

  2. Outline • Bayes theorem • Maximum likelihood (ML) hypothesis • Maximum a posteriori (MAP) hypothesis • Naïve Bayes classifier • Bayes optimal classifier • Bayesian belief networks • Expectation maximization (EM) algorithm

  3. Handwritten characters classification

  4. Gray level pictures:object classification

  5. Gray level pictures: human action classification

  6. Literature & Software • T. Mitchell: chapter 6 • S. Russell & P. Norvig, “Artificial Intelligence – A Modern Approach” : chapters 14+15 • R.O. Duda, P.E. Hart, D.G. Stork, “Pattern Classification 2nd ed.” : chapters 2+3 • David Heckerman: “A Tutorial on Learning with Bayesian Belief Networks” http://ftp.research.microsoft.com/pub/tr/tr-95-06.pdf • Bayes Net Toolbox for Matlab (free), Kevin Murphy http://www.cs.berkeley.edu/~murphyk/Bayes/bnt.html

  7. Bayes Theorem P(h|D) = P(D|h) P(h) / P(D) • P(D) : prior probability of the data D, evidence • P(h) : prior probability of the hypothesis h, prior • P(h|D) : posterior probability of the hypothesis given the data D, posterior • P(D|h) : probability of the data D given the hypothesis h , likelihood of the data

  8. Bayes Theorem P(h|D) = P(D|h) P(h) / P(D) posterior = likelihood x prior / evidence • By observing the data D we can convert the prior probability P(h) to the a posteriori probability (posterior) P(h|D) • The posterior is probability that h holds after data D has been observed. • The evidence P(D) can be viewed merely as a scale factor that guarantees that the posterior probabilities sum to one.

  9. Choosing Hypotheses P(h|D) = P(D|h) P(h) / P(D) • Generally want the most probable hypothesis given the training data • Maximum a posteriori hypothesis hMAP • hMAP = argmaxhH P(h|D) = argmaxhH P(D|h) P(h) / P(D) = argmaxhH P(D|h) P(h) • If the priors of hypothesis are equally likely P(hi)=P(hj) then one can choose the maximum likelihood (ML) hypothesis hML = argmaxhH P(D|h)

  10. Bayes Theorem Example A patient takes a lab test and the result is positive. The test returns a correct positive () result in 98% of the cases in which the disease is actually present, and a correct negative () result in 97% of the cases in which the disease is not present. Furthermore, 0.8% of the entire population have the disease. Hypotheses : disease, ¬disease priors P(h) : P(disease) = 0.008, P(¬ disease)=0.992 likelihoods P(D|h): P(|disease)=0.98, P(|disease)=0.02 P(|¬disease)=0.03, P(|¬disease)=0.97 Maximum posteriors argmax P(h|D): P(disease|)~ P(|disease)P(disease)=0.0078 P(¬ disease|)~ P(|¬disease) P(¬ disease) = 0.0298 P(disease|) = 0.0078/(0.0078+0.0298) = 0.21 P(¬ disease|) = 0.0298/(0.0078+0.0298) = 0.79

  11. Basic Formula for Probabilities • Product rule: P(AB) = P(A) P(B) • Sum rule: P(AB) = P(A) + P(B) - P(AB) • Theorem of total probability: if A1, A2, …, An are mutually exclusive events Si P(Ai) = 1, then P(B) = Si P(B|Ai) P(Ai)

  12. Bayes Theorem Example P(x1,x2|m1,m2,s) = 1/(2ps) exp -Si (xi-mi)2/2s2 h={m1,m2,s} D={x1,…,xm}

  13. Gaussian Probability Function • P(D|m1,m2,s) = Pm P(xm|m1,m2,s) • Maximum likelihood hypothesis hML hML = argmax m1,m2,s P(D|m1,m2,s) • Trick: maximize log-likelihood log P(D|m1,m2,s) = Sm log P(xm|m1,m2,s) = Sm log (1/(2ps) exp -Si (xmi-mi)2/2s2 = -M log (2ps) - Sm Si (xmi-mi)2/2s2

  14. Gaussian Probability Function log P(D|m1,m2,s)/ mi = 0 Sm xmi-mi= 0  mi ML = 1/M Sm xmi = E[xm] log P(D|m1,m2,s)/ s = 0 sML = SmSi (xmi-mi)2 / 2M = E[(Si (xmi-mi)2] / 2 Maximum likelihood hypothesis hML = {miML,sML}

  15. Maximum Likelihood Hypothesis • mML= (0.20, -0.14) sML = 1.42

  16. Bayes Decision Rule • x = examples of class c1 • o = examples of class c2 {m2,s2} {m1,s1}

  17. Bayes Decision Rule • Assume we have two Gaussians distributions associated to two separate classes c1, c2. • P(x|ci) = P(x|mi,si)= 1/(2ps) exp -Si (xi-mi)2/2s2 • Bayes decision rule (max posterior probability) Decide c1 if P(c1|x) > P(c2|x) otherwise decide c2. • if P(c1) = P(c2) use maximum likelihood P(x|ci) • else use maximum posterior P(ci|x) = P(x|ci) P(ci)

  18. Bayes Decision Rule c2 c1

  19. Two-Category Case • Discriminant functions: if g(x) > 0 then c1 else c2 • g(x) = P(c1|x) – P(c2|x) = P(x|c1) P(c1) - P(x|c1) P(c1) • g(x) = log P(c1|x) – log P(c2|x) = log P(x|c1)/P(x|c2) - log P(c1)/ P(c2) • Gaussian probability functions with identical si g(x) = (x-m2)2/2s2 - (x-m1)2/2s2 + log P(c1) – log P(c2) decision surface is a line/hyperplane

  20. Learning a Real Valued Function f hML e • Consider a real-valued target function f • Noisy training examples <xi,di> di = f(xi) + ei ei is a random variable drawn from a Gaussian distribution with zero mean. • The maximum likelihood hypothesis hML is the one that minimizes the squared sum of errors hML = argmin hHSi (di – h(xi))2

  21. Learning a Real Valued Function hML = argmax hH P(D|h) = argmax hHPi P(xi|h) = argmax hHPi (2ps)-0.5 exp -(di-h(xi))2/2s2 • maximizing logarithm log P(D|h) hML = argmax hHSi –0.5 log(2ps) -(di-h(xi))2/2s2 = argmax hHSi-(di - h(xi))2 = argmin hHSi (di – h(xi))2

  22. Learning to Predict Probabilities • Predicting survival probability of a patient • Training examples <xi,di> where di is 0 or 1 • Objective: train a neural network to output a probability h(xi) = p(di=1) given xi • Maximum likelihood hypothesis: hML = argmax hHSi di ln h(xi) + (1-di) ln (1-h(xi)) maximize cross entropy between di and h(xi) • Weight update rule for synapses wk to output neuron h(xi) wk = wk + Si (di-h(xi)) xk • Compare to standard BP weight update rule wk = wk + Si h(xi)(1-h(xi)) (di-h(xi)) xk

  23. Most Probable Classification • So far we sought the most probable hypothesis hMAP? • What is most probable classification of a new instance x given the data D? hMAP(x) is not the most probable classification, although often a sufficiently good approximation of it. • Consider three possible hypotheses: • P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3 • Given a new instance x, h1(x)=+, h2(x)=-, h3(x)=- hMAP(x) = h1(x) = + • most probable classification: P(+)=P(h1|D)=0.4 P(-)=P(h2|D) + P(h3|D) = 0.6

  24. Bayes Optimal Classifier • cmax = argmax cjCShiH P(cj|hi) P(hi|D) • Example: P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3 P(+|h1)=1, P(-|h1)=0 P(+|h2)=0, P(-|h2)=1 P(+|h3)=0, P(-|h3)=1 therefore ShiH P(+|hi) P(hi|D) = 0.4 ShiH P(- |hi) P(hi|D) = 0.6 argmax cjCShiH P(vj|hi) P(hi|D) = -

  25. MAP vs. Bayes Method • The maximum posterior hypothesis estimates a point hMAP in the hypothesis space H. • Bayes method instead estimates and uses a complete distribution P(h|D). • The difference appears when inference MAP or Bayes method are used for inference of unseen instances and one compares the distributions P(x|D) • MAP: P(x|D) = hMAP(x) with hML = argmax hH P(h|D) • Bayes: P(x|D) = ShiH P(x|hi) P(hi|D) • For reasonable prior distributions P(h) MAP and Bayes solution are equivalent in the asymptotic limit of infinite training data D.

  26. Naïve Bayes Classifier • popular, simple learning algorithm • moderate or large training set available • assumption: attributes that describe instances are conditionally independent given classification (in practice works surprisingly well even if assumption is violated) • Applications: • diagnosis • text classification (newsgroup articles 20 newsgroups, 1000 documents per newsgroup, classification accuracy 89%)

  27. Naïve Bayes Classifier • Assume discrete target function F: XC, where each instance x described by attributes <a1,a2,…,an> • Most probable value of f(x) is: cMAP= argmax cjC P(cj| <a1,a2,…,an>) = argmax cjC P(<a1,a2,…,an>|cj) P(cj) / P(<a1,a2,…,an>) = argmax cjC P(<a1,a2,…,an>|cj) P(cj) • Naïve Bayes assumption: P(<a1,a2,…,an>|cj) = Pi P(ai|cj) cNB = argmax cjC P(cj) Pi P(ai|cj)

  28. Naïve Bayes Learning Algorithm Naïve_Bayes_Learn(examples) for each target value cj estimate P(cj) for each attribute value ai estimate of each attribute a estimate P(ai|cj) Classify_New_Instance(x) cNB = argmax cjC P(cj) Paix P(ai|cj)

  29. Naïve Bayes Example • Consider PlayTennis and new instance <Outlook=Sunny, Temp=cool, Humidity=high, Wind=strong> • Compute cNB = argmax cjC P(cj) Paix P(ai|cj) playtennis (9+,5-) P(yes) = 9/14, P(no) = 5/14 wind=strong (3+,3-) P(strong|yes) = 3/9 , P(strong|no) 3/5 … P(yes) P(sun|yes) P(cool|yes) P(high|yes) P(strong|yes)= 0.005 P(no) P(sun|no) P(cool|no) P(high|no) P(strong|no)= 0.021

  30. Estimating Probabilities • What if none (nc=0) of the training instances with target value cj have attribute ai? P(ai|cj) = nc/n = 0 and P(cj) Paix P(ai|cj) = 0 • Solution: Bayesian estimate for P(ai|cj) • P(ai|cj) = (nc + mp)/(n + m) • n : number of training examples for which c=cj • nc : number of examples for which c=cj and a=ai • p : prior estimate of P(ai|cj) • m : weight given to prior (number of “virtual” examples)

  31. Bayesian Belief Networks • naïve assumption of conditional independency too restrictive • full probability distribution intractable due to lack of data • Bayesian belief networks describe conditional independence among subsets of variables • allows combining prior knowledge about causal relationships among variables with observed data

  32. Conditional Independence Definition: X is conditionally independent of Y given Z is the probability distribution governing X is independent of the value of Y given the value of Z, that is, if  xi,yj,zkP(X=xi|Y=yj,Z=zk) = P(X=xi|Z=zk) or more compactly P(X|Y,Z) = P(X|Z) Example: Thunder is conditionally independent of Rain given Lightning P(Thunder |Rain, Lightning) = P(Thunder |Lightning) Notice: P(Thunder |Rain)  P(Thunder) Naïve Bayes uses conditional independence to justify: P(X,Y|Z) = P(X|Y,Z) P(Y|Z) = P(X|Z) P(Y|Z)

  33. Bayesian Belief Network Storm BusTour Group Campfire Network represents a set of conditional independence assertions: • Each node is conditionally independent of its non-descendants, given its immediate predecessors. (directed acyclic graph) Lightning Campfire Forestfire Thunder

  34. Bayesian Belief Network Storm BusTour Group Campfire Network represents joint probability distribution over all variables • P(Storm,BusGroup,Lightning,Campfire,Thunder,Forestfire) • P(y1,…,yn) = Pi=1n P(yi|Parents(Yi)) • joint distribution is fully defined by graph plus P(yi|Parents(Yi)) Lightning Campfire Thunder Forestfire P(C|S,B)

  35. Expectation Maximization EM when to use • data is only partially observable • unsupervised clustering: target value unobservable • supervised learning: some instance attributes unobservable applications • training Bayesian Belief Networks • unsupervised clustering • learning hidden Markov models

  36. Generating Data from Mixture of Gaussians Each instance x generated by • choosing one of the k Gaussians at random • Generating an instance according to that Gaussian

  37. EM for Estimating k Means Given: • instances from X generated by mixture of k Gaussians • unknown means <m1,…,mk> of the k Gaussians • don’t know which instance xi was generated by which Gaussian Determine: • maximum likelihood estimates of <m1,…,mk> Think of full description of each instance as yi=<xi,zi1,zi2> • zij is 1 if xi generated by j-th Gaussian • xi observable • zij unobservable

  38. EM for Estimating k Means EM algorithm: pick random initial h=<m1,m2> then iterate • E step: Calculate the expected value E[zij] of each hidden variable zij, assuming the current hypothesis h=<m1,m2> holds. E[zij] = p(x=xi|m=mj) / Sn=12 p(x=xi|m=mj) = exp(-(xi-mj)2/2s2) / Sn=12 exp(-(xi-mn)2/2s2) • M step: Calculate a new maximum likelihood hypothesis h’=<m1’,m2’> assuming the value taken on by each hidden variable zij is its expected value E[zij] calculated in the E-step. Replace h=<m1,m2> by h’=<m1’,m2’> mj = Si=1m E[zij] xi / Si=1m E[zij]

  39. EM Algorithm Converges to local maximum likelihood and provides estimates of hidden variables zij. In fact local maximum in E [ln (P(Y|h)] • Y is complete (observable plus non-observable variables) data • Expected valued is taken over possible values of unobserved variables in Y

  40. General EM Problem Given: • observed data X = {x1,…,xm} • unobserved data Z = {z1,…,zm} • parameterized probability distribution P(Y|h) where • Y = {y1,…,ym} is the full data yi=<xi,zi> • h are the parameters Determine: • h that (locally) maximizes E[ln P(Y|h)] Applications: • train Bayesian Belief Networks • unsupervised clustering • hidden Markov models

  41. General EM Method Define likelihood function Q(h’|h) which calculates Y = X  Z using observed X and current parameters h to estimate Z Q(h’|h) = E[ ln( P(Y|h’) | h, X] EM algorithm: Estimation (E) step: Calculate Q(h’|h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y. Q(h’|h) = E[ ln( P(Y|h’) | h, X] Maximization (M) step: Replace hypothesis h by the hypothesis h’ that maximizes this Q function. h = argmaxh’H Q(h’|h)

More Related