1 / 59

# Graphical Models - PowerPoint PPT Presentation

Graphical Models. David Heckerman Microsoft Research. Overview. Intro to graphical models Application: Data exploration Dependency networks  undirected graphs Directed acyclic graphs (“Bayes nets”) Applications Clustering Evolutionary history/phylogeny. male. female. p(cust)=0.8.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Graphical Models' - Jimmy

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

David Heckerman

Microsoft Research

• Intro to graphical models

• Application: Data exploration

• Dependency networks  undirected graphs

• Directed acyclic graphs (“Bayes nets”)

• Applications

• Clustering

• Evolutionary history/phylogeny

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

Using classification/regression for data exploration

Decision tree:

Logistic regression:

log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old

Neural network:

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

Using classification/regression for data exploration

p(target|inputs)

Decision tree:

Logistic regression:

log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old

Neural network:

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

Conditional independence

Decision tree:

p(cust | gender, age, month born)=p(cust | gender, age)

p(target | all inputs) = p(target | some inputs)

• Cross validation

• Bayesian methods

• Penalized likelihood

• Minimum description length

• Suppose you have thousands of variables and you’re not sure about the interactions among those variables

• Build a classification/regression model for each variable, using the rest of the variables as inputs

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Example with three variables X, Y, and Z

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Summarize the trees with a single graph

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z

• Build a classification/regression model for every variable given the other variables as inputs

• Construct a graph where

• Nodes correspond to variables

• There is an arc from X to Y if X helps to predict Y

• The graph along with the individual classification/regression model is a “dependency network”

(Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000)

AgeShow1 Show2 Show3

viewer 1 73 y n n

viewer 2 16 n y y ...

viewer 3 35 n n n

etc.

Example: TV viewing

Nielsen data: 2/6/95-2/19/95

Goal: exploratory data analysis (acausal)

~400 shows, ~3000 viewers

• Julian Besag (and others) invented dependency networks (under another name) in the mid 1970s

• But they didn’t like them, because they could be inconsistent

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

A consistent dependency network

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

An inconsistent dependency network

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z

• Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s

• But they didn’t like them, because they could be inconsistent

• So they used a property of consistent dependency networks to develop a new characterization of them

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Conditional independence

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

X ^ Z | Y

p(y|x=0,z=1)

X

Y

Z

Each variable is independent of all other variables given its immediate neighbors

Hammersley-Clifford Theorem(Besag 1974)

• Given a set of variables which has a positive joint distribution

• Where each variable is independent of all other variables given its immediate neighbors in some graph G

• It follows that

where c1, c2, …, cn are the maximal cliques in the graph G.

“clique

potentials”

X

Y

Z

• Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s

• But they didn’t like them, because they could be inconsistent

• So they used a property of consistent dependency networks to develop a new characterization of them

• “Markov Random Fields” aka “undirected graphs” were born

• They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized)

• They are easy to learn from data (build separate classification/regression model for each variable)

• Conditional distributions (e.g., trees) are easier to understand than clique potentials

• They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized)

• They are easy to learn from data (build separate classification/regression model for each variable)

• Conditional distributions (e.g., trees) are easier to understand than clique potentials

• Over the last decade, has proven to be a very useful tool for data exploration

• Lack a generative story (e.g., Lat Dir Alloc)

• Lack a causal story

cold

lung cancer

sore throat

weight loss

cough

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Solution: Build trees in some order

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Solution: Build trees in some order

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

X

Y

Z

• Random orders

• Greedy search

• Monte-Carlo methods

X

Y

Z

X

Z

Y

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Joint distribution is easy to obtain

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

X

Y

Z

Many inventors: Wright 1921; Good 1961; Howard & Matheson 1976, Pearl 1982

• Easy to understand

• Useful for adding prior knowledge to an analysis (e.g., causal knowledge)

• The conditional independencies they express make inference more computationally efficient

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Inference

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

X

Y

Z

What is p(z|x=1)?

Inference: Example(“Elimination Algorithm”)

Z

X

Y

W

Inference: Example(“Elimination Algorithm”)

Z

X

Y

W

• Inference also important because it is the E step of EM algorithm (when learning with missing data and/or hidden variables)

• Exact methods for inference that exploit conditional independence are well developed (e.g., Shachter, Lauritzen & Spiegelhalter, Dechter)

• Exact methods fail when there are many cycles in the graph

• MCMC (e.g., Geman and Geman 1984)

• Loopy propagation (e.g., Murphy et al. 1999)

• Variational methods (e.g., Jordan et al. 1999)

DAGs and UGs:

• Data exploration

• Density estimation

• Clustering

UGs:

• Spatial processes

DAGs:

• Expert systems

• Causal discovery

• Clustering

• Evolutionary history/phylogeny

Example: msnbc.com

User Sequence

1 frontpage news travel travel

2 news news news news news

3 frontpage news frontpage news frontpage

4 news news

5 frontpage news news travel travel travel

6 news weather weather weather weather weather

8 frontpage sports sports sports weather

Etc.

Millions of users per day

Goal: understand what is and isn’t working on the site

data

Cluster

User clusters

• Cluster users based on their behavior on the site

• Display clusters somehow

Generative model for clustering(e.g., AutoClass, Cheeseman & Stutz 1995)

Discrete, hidden

Cluster

1st

page

2nd

page

3rd

page

Sequence Clustering(Cadez, Heckerman, Meek, & Smyth, 2000)

Discrete, hidden

Cluster

1st

page

2nd

page

3rd

page

Principles:

• Find the parameters that maximize the (log) likelihood of the data

• Find the parameters whose posterior probability is a maximum

• Find distributions for quantities of interest by averaging over the unknown parameters

Gradient methods or EM algorithm typically used for first two

Expectation-Maximization (EM) algorithmDempster, Laird, Rubin 1977

Initialize parameters (e.g., at random)

Expectation step: compute probabilities for values of unobserved variable using the current values of the parameters and the incomplete data [THIS IS INFERENCE]; reinterpret data as set of fractional cases based on these probabilities

Maximization step: choose parameters so as to maximize the log likelihood of the fractional data

Parameters will converge to a local maximum of log p(data)

Suppose cluster model has 2 clusters, and that

p(cluster=1|case,current params) = 0.7

p(cluster=2|case,current params) = 0.3

Then, write

q(case) = 0.7 log p(case,cluster=1|params) +

0.3 log p(case,cluster=2|params)

Do this for each case and then find the parameters that maximize Q=Scase q(case). These parameters also maximize the log likelihood.

Example: msnbc.com

User Sequence

1 frontpage news travel travel

2 news news news news news

3 frontpage news frontpage news frontpage

4 news news

5 frontpage news news travel travel travel

6 news weather weather weather weather weather

8 frontpage sports sports sports weather

Etc.

Other applications at Microsoft:

• Analyze how people use programs (e.g. Office)

• Analyze web traffic for intruders (anomaly detection)

• Evolutionary history/phylogeny

• Vaccine for AIDS

Donkey

Horse

Carnivora

Indian rhino

White rhino

Grey seal

Harbor seal

Dog

Cetartiodactyla

Cat

Blue whale

Fin whale

Sperm whale

Hippopotamus

Sheep

Cow

Chiroptera

Alpaca

Pig

Little red flying fox

Ryukyu flying fox

Moles+Shrews

Horseshoe bat

Japanese pipistrelle

Long-tailed bat

Afrotheria

Jamaican fruit-eating bat

Asiatic shrew

Long-clawed shrew

Mole

Xenarthra

Aardvark

Elephant

Rabbit

Lagomorpha

+ Scandentia

Pika

Tree shrew

Bonobo

Chimpanzee

Man

Gorilla

Sumatran orangutan

Primates

Bornean orangutan

Common gibbon

Barbary ape

Baboon

White-fronted capuchin

Rodentia 1

Slow loris

Squirrel

Dormouse

Cane-rat

Rodentia 2

Guinea pig

Mouse

Rat

Vole

Hedgehog

Hedgehogs

Gymnure

Bandicoot

Wallaroo

Opossum

Platypus

Evolutionary History/PhylogenyJojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman 2004

hidden

hidden

species

1

species

2

species

3

• For a given tree, find max likelihood parameters

• Search over structure to find best likelihood (penalized to avoid over fitting)

Strong simplifying assumption

Evolution at each DNA nucleotide is independent

 EM is computationally efficient

Nucleotide

position 1

Nucleotide

position 2

Nucleotide

position N

Relaxing the assumption

• Each substitution depends on the substitution at the previous position

• This structure captures context specific effects during evolution

• EM is computationally intractable

Lower bound good enough for EM-like algorithm

h

h

h

Product of

trees

o

h

o

h

o

h

o

o

o

o

o

o

Product of

chains

h

h

h

o

h

o

h

o

h

o

o

o

o

o

• Factor graphs, mixed graphs, etc.

• Relational learning: PRMs, Plates, PERs

• Bayesian methods for learning

• Scalability

• Causal modeling

• Variational methods

• Non-parametric distributions

Main conferences:

• Uncertainty in Artificial Intelligence (UAI)

• Neural information Processing Systems (NIPS)