1 / 37

Directed Graphical Probabilistic Models:

Directed Graphical Probabilistic Models:. the sequel. William W. Cohen Machine Learning 10-601 Feb 22 2008. 25. Directed Graphical Probabilistic Models: the son of the child of the bride of the sequel. William W. Cohen Machine Learning 10-601 Feb 27 2008. Outline. Quick recap

shepry
Download Presentation

Directed Graphical Probabilistic Models:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Directed Graphical Probabilistic Models: the sequel William W. Cohen Machine Learning 10-601 Feb 22 2008 25

  2. Directed Graphical Probabilistic Models:the son of the child of the bride of the sequel William W. Cohen Machine Learning 10-601 Feb 27 2008

  3. Outline • Quick recap • An example of learning • Given structure, find CPTs from “fully observed” data • Some interesting special cases of this • Learning with hidden variables • Expectation-maximization • Handwave argument for why EM works

  4. The story so far: Bayes nets First guess The money • Many problems can be solved using the joint probability P(X1,…,Xn). • Bayes nets describe a way to compactly write the joint. • For a Bayes net: A B Stick or swap? The goat C D E Second guess • Conditional independence:

  5. The story so far: d-separation E Y X There are three ways paths from X to Y given evidence E can be blocked. X is d-separated from Y given E iff all paths from X to Y given E are blocked…see there?  If X is d-separated from Y given E, then I<X,E,Y> Z Z Z

  6. The story so far: “Explaining away” X Y E

  7. Recap: Inference in linear chain networks E E X1 Xn ... ... Xj “backward” “forward” Instead of recursion you can use “message passing” (forward-backward, Baum-Welsh)….

  8. Recap: Inference in polytrees • Reduce P(X|E) to the product of two recursively calculated parts: • P(X=x|E+) • i.e., CPT for X and product of “forward” messages from parents • P(E-|X=x) • i.e., combination of “backward” messages from parents, CPTs, and P(Z|EZ\Yk), a simpler instance of P(X|E) • This can also be implemented by message-passing (belief propagation)

  9. Recap: Learning for Bayes nets • Input: • Sample of the joint: • Graph structure of the variables • for I=1,…,N, you know Xi and parents(Xi) • Output: • Estimated CPTs A B C D • Method (discrete variables): • Estimate each CPT independently • Use a MLE or MAP E …

  10. Recap: Learning for Bayes nets • Method (discrete variables): • Estimate each CPT independently • Use a MLE or MAP • MAP: A B C D E …

  11. Recap: A detailed example Z X Y D:

  12. A detailed example Z X Y D:

  13. A detailed example Z X Y D:

  14. A detailed example Z • Now we’re done learning: what can we do with this? • guess what your favorite professor is doing now? • given a new x,y compute P(prof|x,y), P(grad|x,y), P(ugrad|x,y)…using Bayes net inference • given a new x,y predict the most likely “label” Z X Y Of course we need to implement our Bayes net inference method first…

  15. A more interesting example C Parameters are “shared” or “tied” W1 W2 … WN or C A “plate” … Wi N

  16. Some special cases of Bayes net learning • Naïve Bayes • HMMs for biology and information extraction • Tree-augmented Naïve Bayes

  17. Another interesting example • A phylogenomic analysis of the Actinomycetales mce operons

  18. Another interesting example p1 p2 p3 p4 Z1 Z2 Z4 ... Z3 X1 X2 X4 X3 ... ...

  19. Another interesting example Tie: P(X2|Z2=pos4)=P(X4|Z4=pos4) G(T|A|G) “optional” p1 p2 p3 Tie: P(Xi|Zi=pos4)=P(Xj|Zj=pos4) Z1 Z2 Z4 ... Z3 X1 X2 X4 X3

  20. Another interesting example Tie: P(X2|Z2=pos4)=P(X4|Z4=pos4) G(T|A|G) “optional” Tie: P(Xi|Zi=pos4)=P(Xj|Zj=pos4) … 0.5 p1 p2 p4 ... p3 0.5 … P(X|P1) P(X|P2) • Three tables: • P(posj|posi) for all i,j … aka transition probabilities • P(x|posi) for all x,i …. Aka • P(Z1=posi)

  21. Another interesting example

  22. IE by text segmentation Title Journal Year Author Volume Page • Example: Addresses, bibliography records House number State Zip Building Road City 4089 Whispering Pines Nobel Drive San Diego CA 92122 P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237. Author, title, year, … are like “positions” in the previous example

  23. IE with Hidden Markov Models Y A C X B Z A B C 0.1 0.1 0.8 0.4 0.2 0.4 0.6 0.3 0.1 Emission probabilities Transition probabilities 0.5 0.9 0.5 0.1 0.8 0.2 dddd dd 0.8 0.2 • HMMs for IE • Note: we know how to train this model from segmented citations Title Author Journal Year

  24. Results: Comparative Evaluation The Nested model does best in all three cases (from Borkar et a, 2001)

  25. Learning with hidden variables Z • Hidden variables: what if some of your data is not completely observed? • Method: • Estimate parameters somehow or other. • Predict unknown values from your estimate. • Add pseudo-data corresponding to these predictions, weighting each example by confidence in its correctness. • Re-estimate parameters using the extended dataset (real + pseudo-data). • Repeat starting at step 2…. X Y Expectation-maximization aka EM expectation maximization (MLE/MAP)

  26. Learning with Hidden Variables: Example Z X Y

  27. Learning with Hidden Variables: Example Z X Y

  28. Learning with Hidden Variables: Example Z X Y

  29. Learning with Hidden Variables: Example Z X Y

  30. Learning with Hidden Variables: Example Z X Y .38 .35 .27

  31. Learning with Hidden Variables: Example Z X Y .24 .32 .54

  32. Learning with hidden variables Z • Hidden variables: what if some of your data is not completely observed? • Method: • Estimate parameters somehow or other. • Predict unknown values from your estimate. • Add pseudo-data corresponding to these predictions, weighting each example by confidence in its correctness. • Re-estimate parameters using the extended dataset (real + pseudo-data). • Repeat starting at step 2…. X Y

  33. Why does this work? Ignore prior - MLE Q(z) > 0 Q(z) a pdf

  34. Why does this work? Ignore prior Q(z) a pdf Q(z) > 0 Initial estimate of θ

  35. Jensen’s inequality Claim: log(q1x1+q2x2)≥q1log(x1)+q2log(x2) Holds for any downward-concave function, not just log(x) Further: log(EQ[X]) ≥ EQ[log(X)] log(q1x1+q2x2) log(x2) log(x) * * log(x1) q1log(x1)+q2log(x2) x1 x2 q1x1+q2x2 where q1+q2=1

  36. Why does this work? Ignore prior Q(z) a pdf Q(z) > 0 since log(EQ[X]) ≥ EQ[log(X)] Q is any estimate of θ – say θ0 θ’ depends on X,Z but not directly on Q so… P(X,Z,θ’|Q)=P(θ’|X,Z,Q)*P(X,Z|Q) So, plugging in pseudo-data weighted by Q and finding MLE optimizes a lower bound on log-likelihood

More Related