Graphical Models

Graphical Models David Heckerman Microsoft Research

Overview • Intro to graphical models • Application: Data exploration • Dependency networks  undirected graphs • Directed acyclic graphs (“Bayes nets”) • Applications • Clustering • Evolutionary history/phylogeny

male female p(cust)=0.8 young old p(cust)=0.7 p(cust)=0.2 Using classification/regression for data exploration Decision tree: Logistic regression: log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old Neural network:

male female p(cust)=0.8 young old p(cust)=0.7 p(cust)=0.2 Using classification/regression for data exploration p(target|inputs) Decision tree: Logistic regression: log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old Neural network:

male female p(cust)=0.8 young old p(cust)=0.7 p(cust)=0.2 Conditional independence Decision tree: p(cust | gender, age, month born)=p(cust | gender, age) p(target | all inputs) = p(target | some inputs)

Learning conditional independence from data: Model Selection • Cross validation • Bayesian methods • Penalized likelihood • Minimum description length

Using classification/regression for data exploration • Suppose you have thousands of variables and you’re not sure about the interactions among those variables • Build a classification/regression model for each variable, using the rest of the variables as inputs

Y=0 Y=1 Y=0 Y=1 p(z|y=0) p(z|y=1) p(x|y=0) p(x|y=1) Example with three variables X, Y, and Z Target: X Inputs: Y,Z Target: Y Inputs: X,Z Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) Z=0 Z=1 p(y|x=0,z=0) p(y|x=0,z=1)

Dependency Network • Build a classification/regression model for every variable given the other variables as inputs • Construct a graph where • Nodes correspond to variables • There is an arc from X to Y if X helps to predict Y • The graph along with the individual classification/regression model is a “dependency network” (Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000)

AgeShow1 Show2 Show3 viewer 1 73 y n n viewer 2 16 n y y ... viewer 3 35 n n n etc. Example: TV viewing Nielsen data: 2/6/95-2/19/95 Goal: exploratory data analysis (acausal) ~400 shows, ~3000 viewers

A bit of history • Julian Besag (and others) invented dependency networks (under another name) in the mid 1970s • But they didn’t like them, because they could be inconsistent

A bit of history • Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s • But they didn’t like them, because they could be inconsistent • So they used a property of consistent dependency networks to develop a new characterization of them

Y=0 Y=1 Y=0 Y=1 p(z|y=0) p(z|y=1) p(x|y=0) p(x|y=1) Conditional independence Target: X Inputs: Y,Z Target: Y Inputs: X,Z Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) Z=0 Z=1 p(y|x=0,z=0) X ^ Z | Y p(y|x=0,z=1) X Y Z

Conditional independence in a dependency network Each variable is independent of all other variables given its immediate neighbors

Hammersley-Clifford Theorem(Besag 1974) • Given a set of variables which has a positive joint distribution • Where each variable is independent of all other variables given its immediate neighbors in some graph G • It follows that where c1, c2, …, cn are the maximal cliques in the graph G. “clique potentials”

Example X Y Z

Consistent dependency networks: Directed arcs not needed X Y Z X Y Z

A bit of history • Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s • But they didn’t like them, because they could be inconsistent • So they used a property of consistent dependency networks to develop a new characterization of them • “Markov Random Fields” aka “undirected graphs” were born

Inconsistent dependency networks aren’t that bad • They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized) • They are easy to learn from data (build separate classification/regression model for each variable) • Conditional distributions (e.g., trees) are easier to understand than clique potentials

Inconsistent dependency networks aren’t that bad • They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized) • They are easy to learn from data (build separate classification/regression model for each variable) • Conditional distributions (e.g., trees) are easier to understand than clique potentials • Over the last decade, has proven to be a very useful tool for data exploration

Shortcomings of undirected graphs • Lack a generative story (e.g., Lat Dir Alloc) • Lack a causal story cold lung cancer sore throat weight loss cough

. p(x) Y=0 Y=1 p(z|y=0) p(z|y=1) Solution: Build trees in some order 1. Target: X Inputs: none 2. Target: Y Inputs: X 3. Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) p(y|x=0)

. p(x) Y=0 Y=1 p(z|y=0) p(z|y=1) Solution: Build trees in some order 1. Target: X Inputs: none 2. Target: Y Inputs: X 3. Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) p(y|x=0) X Y Z

Some orders are better than others • Random orders • Greedy search • Monte-Carlo methods X Y Z X Z Y

. p(x) Y=0 Y=1 p(z|y=0) p(z|y=1) Joint distribution is easy to obtain 1. Target: X Inputs: none 2. Target: Y Inputs: X 3. Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) p(y|x=0) X Y Z

Directed Acyclic Graphs (aka Bayes Nets) Many inventors: Wright 1921; Good 1961; Howard & Matheson 1976, Pearl 1982

The power of graphical models • Easy to understand • Useful for adding prior knowledge to an analysis (e.g., causal knowledge) • The conditional independencies they express make inference more computationally efficient

. p(x) Y=0 Y=1 p(z|y=0) p(z|y=1) Inference 1. Target: X Inputs: none 2. Target: Y Inputs: X 3. Target: Z Inputs: X,Y X=0 X=1 p(y|x=1) p(y|x=0) X Y Z What is p(z|x=1)?

Inference: Example Z X Y W

Inference: Example(“Elimination Algorithm”) Z X Y W

Inference • Inference also important because it is the E step of EM algorithm (when learning with missing data and/or hidden variables) • Exact methods for inference that exploit conditional independence are well developed (e.g., Shachter, Lauritzen & Spiegelhalter, Dechter) • Exact methods fail when there are many cycles in the graph • MCMC (e.g., Geman and Geman 1984) • Loopy propagation (e.g., Murphy et al. 1999) • Variational methods (e.g., Jordan et al. 1999)

Applications of Graphical Models DAGs and UGs: • Data exploration • Density estimation • Clustering UGs: • Spatial processes DAGs: • Expert systems • Causal discovery

Applications • Clustering • Evolutionary history/phylogeny

Clustering Example: msnbc.com User Sequence 1 frontpage news travel travel 2 news news news news news 3 frontpage news frontpage news frontpage 4 news news 5 frontpage news news travel travel travel 6 news weather weather weather weather weather 7 news health health business business business 8 frontpage sports sports sports weather Etc. Millions of users per day Goal: understand what is and isn’t working on the site

Solution data Cluster User clusters • Cluster users based on their behavior on the site • Display clusters somehow

Generative model for clustering(e.g., AutoClass, Cheeseman & Stutz 1995) Discrete, hidden Cluster … 1st page 2nd page 3rd page

Sequence Clustering(Cadez, Heckerman, Meek, & Smyth, 2000) Discrete, hidden Cluster … 1st page 2nd page 3rd page

Learning parameters (with missing data) Principles: • Find the parameters that maximize the (log) likelihood of the data • Find the parameters whose posterior probability is a maximum • Find distributions for quantities of interest by averaging over the unknown parameters Gradient methods or EM algorithm typically used for first two

Expectation-Maximization (EM) algorithmDempster, Laird, Rubin 1977 Initialize parameters (e.g., at random) Expectation step: compute probabilities for values of unobserved variable using the current values of the parameters and the incomplete data [THIS IS INFERENCE]; reinterpret data as set of fractional cases based on these probabilities Maximization step: choose parameters so as to maximize the log likelihood of the fractional data Parameters will converge to a local maximum of log p(data)

E-step Suppose cluster model has 2 clusters, and that p(cluster=1|case,current params) = 0.7 p(cluster=2|case,current params) = 0.3 Then, write q(case) = 0.7 log p(case,cluster=1|params) + 0.3 log p(case,cluster=2|params) Do this for each case and then find the parameters that maximize Q=Scase q(case). These parameters also maximize the log likelihood.

Demo: SQL Server 2005 Example: msnbc.com User Sequence 1 frontpage news travel travel 2 news news news news news 3 frontpage news frontpage news frontpage 4 news news 5 frontpage news news travel travel travel 6 news weather weather weather weather weather 7 news health health business business business 8 frontpage sports sports sports weather Etc.

Sequence clustering Other applications at Microsoft: • Analyze how people use programs (e.g. Office) • Analyze web traffic for intruders (anomaly detection)

Computational biology applications • Evolutionary history/phylogeny • Vaccine for AIDS

Perissodactyla Donkey Horse Carnivora Indian rhino White rhino Grey seal Harbor seal Dog Cetartiodactyla Cat Blue whale Fin whale Sperm whale Hippopotamus Sheep Cow Chiroptera Alpaca Pig Little red flying fox Ryukyu flying fox Moles+Shrews Horseshoe bat Japanese pipistrelle Long-tailed bat Afrotheria Jamaican fruit-eating bat Asiatic shrew Long-clawed shrew Mole Small Madagascar hedgehog Xenarthra Aardvark Elephant Armadillo Rabbit Lagomorpha + Scandentia Pika Tree shrew Bonobo Chimpanzee Man Gorilla Sumatran orangutan Primates Bornean orangutan Common gibbon Barbary ape Baboon White-fronted capuchin Rodentia 1 Slow loris Squirrel Dormouse Cane-rat Rodentia 2 Guinea pig Mouse Rat Vole Hedgehog Hedgehogs Gymnure Bandicoot Wallaroo Opossum Platypus Evolutionary History/PhylogenyJojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman 2004

Graphical Models