1 / 59

Graphical Models

Graphical Models. David Heckerman Microsoft Research. Overview. Intro to graphical models Application: Data exploration Dependency networks  undirected graphs Directed acyclic graphs (“Bayes nets”) Applications Clustering Evolutionary history/phylogeny. male. female. p(cust)=0.8.

Jimmy
Download Presentation

Graphical Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Graphical Models David Heckerman Microsoft Research

  2. Overview • Intro to graphical models • Application: Data exploration • Dependency networks  undirected graphs • Directed acyclic graphs (“Bayes nets”) • Applications • Clustering • Evolutionary history/phylogeny

  3. male female p(cust)=0.8 young old p(cust)=0.7 p(cust)=0.2 Using classification/regression for data exploration Decision tree: Logistic regression: log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old Neural network:

  4. male female p(cust)=0.8 young old p(cust)=0.7 p(cust)=0.2 Using classification/regression for data exploration p(target|inputs) Decision tree: Logistic regression: log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old Neural network:

  5. male female p(cust)=0.8 young old p(cust)=0.7 p(cust)=0.2 Conditional independence Decision tree: p(cust | gender, age, month born)=p(cust | gender, age) p(target | all inputs) = p(target | some inputs)

  6. Learning conditional independence from data: Model Selection • Cross validation • Bayesian methods • Penalized likelihood • Minimum description length

  7. Using classification/regression for data exploration • Suppose you have thousands of variables and you’re not sure about the interactions among those variables • Build a classification/regression model for each variable, using the rest of the variables as inputs

  8. Y=0 Y=1 Y=0 Y=1 p(z|y=0) p(z|y=1) p(x|y=0) p(x|y=1) Example with three variables X, Y, and Z Target: X Inputs: Y,Z Target: Y Inputs: X,Z Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) Z=0 Z=1 p(y|x=0,z=0) p(y|x=0,z=1)

  9. Y=0 Y=1 Y=0 Y=1 p(z|y=0) p(z|y=1) p(x|y=0) p(x|y=1) Summarize the trees with a single graph Target: X Inputs: Y,Z Target: Y Inputs: X,Z Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) Z=0 Z=1 p(y|x=0,z=0) p(y|x=0,z=1) X Y Z

  10. Dependency Network • Build a classification/regression model for every variable given the other variables as inputs • Construct a graph where • Nodes correspond to variables • There is an arc from X to Y if X helps to predict Y • The graph along with the individual classification/regression model is a “dependency network” (Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000)

  11. AgeShow1 Show2 Show3 viewer 1 73 y n n viewer 2 16 n y y ... viewer 3 35 n n n etc. Example: TV viewing Nielsen data: 2/6/95-2/19/95 Goal: exploratory data analysis (acausal) ~400 shows, ~3000 viewers

  12. A bit of history • Julian Besag (and others) invented dependency networks (under another name) in the mid 1970s • But they didn’t like them, because they could be inconsistent

  13. Y=0 Y=1 Y=0 Y=1 p(z|y=0) p(z|y=1) p(x|y=0) p(x|y=1) A consistent dependency network Target: X Inputs: Y,Z Target: Y Inputs: X,Z Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) Z=0 Z=1 p(y|x=0,z=0) p(y|x=0,z=1) X Y Z

  14. Y=0 Y=1 Y=0 Y=1 p(z|y=0) p(z|y=1) p(x|y=0) p(x|y=1) An inconsistent dependency network Target: X Inputs: Y,Z Target: Y Inputs: X,Z Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) Z=0 Z=1 p(y|x=0,z=0) p(y|x=0,z=1) X Y Z

  15. A bit of history • Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s • But they didn’t like them, because they could be inconsistent • So they used a property of consistent dependency networks to develop a new characterization of them

  16. Y=0 Y=1 Y=0 Y=1 p(z|y=0) p(z|y=1) p(x|y=0) p(x|y=1) Conditional independence Target: X Inputs: Y,Z Target: Y Inputs: X,Z Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) Z=0 Z=1 p(y|x=0,z=0) X ^ Z | Y p(y|x=0,z=1) X Y Z

  17. Conditional independence in a dependency network Each variable is independent of all other variables given its immediate neighbors

  18. Hammersley-Clifford Theorem(Besag 1974) • Given a set of variables which has a positive joint distribution • Where each variable is independent of all other variables given its immediate neighbors in some graph G • It follows that where c1, c2, …, cn are the maximal cliques in the graph G. “clique potentials”

  19. Example X Y Z

  20. Consistent dependency networks: Directed arcs not needed X Y Z X Y Z

  21. A bit of history • Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s • But they didn’t like them, because they could be inconsistent • So they used a property of consistent dependency networks to develop a new characterization of them • “Markov Random Fields” aka “undirected graphs” were born

  22. Inconsistent dependency networks aren’t that bad • They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized) • They are easy to learn from data (build separate classification/regression model for each variable) • Conditional distributions (e.g., trees) are easier to understand than clique potentials

  23. Inconsistent dependency networks aren’t that bad • They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized) • They are easy to learn from data (build separate classification/regression model for each variable) • Conditional distributions (e.g., trees) are easier to understand than clique potentials • Over the last decade, has proven to be a very useful tool for data exploration

  24. Shortcomings of undirected graphs • Lack a generative story (e.g., Lat Dir Alloc) • Lack a causal story cold lung cancer sore throat weight loss cough

  25. . p(x) Y=0 Y=1 p(z|y=0) p(z|y=1) Solution: Build trees in some order 1. Target: X Inputs: none 2. Target: Y Inputs: X 3. Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) p(y|x=0)

  26. . p(x) Y=0 Y=1 p(z|y=0) p(z|y=1) Solution: Build trees in some order 1. Target: X Inputs: none 2. Target: Y Inputs: X 3. Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) p(y|x=0) X Y Z

  27. Some orders are better than others • Random orders • Greedy search • Monte-Carlo methods X Y Z X Z Y

  28. . p(x) Y=0 Y=1 p(z|y=0) p(z|y=1) Joint distribution is easy to obtain 1. Target: X Inputs: none 2. Target: Y Inputs: X 3. Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) p(y|x=0) X Y Z

  29. Directed Acyclic Graphs (aka Bayes Nets) Many inventors: Wright 1921; Good 1961; Howard & Matheson 1976, Pearl 1982

  30. The power of graphical models • Easy to understand • Useful for adding prior knowledge to an analysis (e.g., causal knowledge) • The conditional independencies they express make inference more computationally efficient

  31. . p(x) Y=0 Y=1 p(z|y=0) p(z|y=1) Inference 1. Target: X Inputs: none 2. Target: Y Inputs: X 3. Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) p(y|x=0) X Y Z What is p(z|x=1)?

  32. Inference: Example Z X Y W

  33. Inference: Example(“Elimination Algorithm”) Z X Y W

  34. Inference: Example(“Elimination Algorithm”) Z X Y W

  35. Inference • Inference also important because it is the E step of EM algorithm (when learning with missing data and/or hidden variables) • Exact methods for inference that exploit conditional independence are well developed (e.g., Shachter, Lauritzen & Spiegelhalter, Dechter) • Exact methods fail when there are many cycles in the graph • MCMC (e.g., Geman and Geman 1984) • Loopy propagation (e.g., Murphy et al. 1999) • Variational methods (e.g., Jordan et al. 1999)

  36. Applications of Graphical Models DAGs and UGs: • Data exploration • Density estimation • Clustering UGs: • Spatial processes DAGs: • Expert systems • Causal discovery

  37. Applications • Clustering • Evolutionary history/phylogeny

  38. Clustering Example: msnbc.com User Sequence 1 frontpage news travel travel 2 news news news news news 3 frontpage news frontpage news frontpage 4 news news 5 frontpage news news travel travel travel 6 news weather weather weather weather weather 7 news health health business business business 8 frontpage sports sports sports weather Etc. Millions of users per day Goal: understand what is and isn’t working on the site

  39. Solution data Cluster User clusters • Cluster users based on their behavior on the site • Display clusters somehow

  40. Generative model for clustering(e.g., AutoClass, Cheeseman & Stutz 1995) Discrete, hidden Cluster … 1st page 2nd page 3rd page

  41. Sequence Clustering(Cadez, Heckerman, Meek, & Smyth, 2000) Discrete, hidden Cluster … 1st page 2nd page 3rd page

  42. Learning parameters (with missing data) Principles: • Find the parameters that maximize the (log) likelihood of the data • Find the parameters whose posterior probability is a maximum • Find distributions for quantities of interest by averaging over the unknown parameters Gradient methods or EM algorithm typically used for first two

  43. Expectation-Maximization (EM) algorithmDempster, Laird, Rubin 1977 Initialize parameters (e.g., at random) Expectation step: compute probabilities for values of unobserved variable using the current values of the parameters and the incomplete data [THIS IS INFERENCE]; reinterpret data as set of fractional cases based on these probabilities Maximization step: choose parameters so as to maximize the log likelihood of the fractional data Parameters will converge to a local maximum of log p(data)

  44. E-step Suppose cluster model has 2 clusters, and that p(cluster=1|case,current params) = 0.7 p(cluster=2|case,current params) = 0.3 Then, write q(case) = 0.7 log p(case,cluster=1|params) + 0.3 log p(case,cluster=2|params) Do this for each case and then find the parameters that maximize Q=Scase q(case). These parameters also maximize the log likelihood.

  45. Demo: SQL Server 2005 Example: msnbc.com User Sequence 1 frontpage news travel travel 2 news news news news news 3 frontpage news frontpage news frontpage 4 news news 5 frontpage news news travel travel travel 6 news weather weather weather weather weather 7 news health health business business business 8 frontpage sports sports sports weather Etc.

  46. Sequence clustering Other applications at Microsoft: • Analyze how people use programs (e.g. Office) • Analyze web traffic for intruders (anomaly detection)

  47. Computational biology applications • Evolutionary history/phylogeny • Vaccine for AIDS

  48. Perissodactyla Donkey Horse Carnivora Indian rhino White rhino Grey seal Harbor seal Dog Cetartiodactyla Cat Blue whale Fin whale Sperm whale Hippopotamus Sheep Cow Chiroptera Alpaca Pig Little red flying fox Ryukyu flying fox Moles+Shrews Horseshoe bat Japanese pipistrelle Long-tailed bat Afrotheria Jamaican fruit-eating bat Asiatic shrew Long-clawed shrew Mole Small Madagascar hedgehog Xenarthra Aardvark Elephant Armadillo Rabbit Lagomorpha + Scandentia Pika Tree shrew Bonobo Chimpanzee Man Gorilla Sumatran orangutan Primates Bornean orangutan Common gibbon Barbary ape Baboon White-fronted capuchin Rodentia 1 Slow loris Squirrel Dormouse Cane-rat Rodentia 2 Guinea pig Mouse Rat Vole Hedgehog Hedgehogs Gymnure Bandicoot Wallaroo Opossum Platypus Evolutionary History/PhylogenyJojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman 2004

More Related