graphical models
Download
Skip this Video
Download Presentation
Graphical Models

Loading in 2 Seconds...

play fullscreen
1 / 59

Graphical Models - PowerPoint PPT Presentation


  • 431 Views
  • Uploaded on

Graphical Models. David Heckerman Microsoft Research. Overview. Intro to graphical models Application: Data exploration Dependency networks  undirected graphs Directed acyclic graphs (“Bayes nets”) Applications Clustering Evolutionary history/phylogeny. male. female. p(cust)=0.8.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Graphical Models' - Jimmy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
graphical models
Graphical Models

David Heckerman

Microsoft Research

overview
Overview
  • Intro to graphical models
    • Application: Data exploration
    • Dependency networks  undirected graphs
    • Directed acyclic graphs (“Bayes nets”)
  • Applications
    • Clustering
    • Evolutionary history/phylogeny
using classification regression for data exploration
male

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

Using classification/regression for data exploration

Decision tree:

Logistic regression:

log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old

Neural network:

using classification regression for data exploration4
male

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

Using classification/regression for data exploration

p(target|inputs)

Decision tree:

Logistic regression:

log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old

Neural network:

conditional independence
male

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

Conditional independence

Decision tree:

p(cust | gender, age, month born)=p(cust | gender, age)

p(target | all inputs) = p(target | some inputs)

learning conditional independence from data model selection
Learning conditional independence from data: Model Selection
  • Cross validation
  • Bayesian methods
  • Penalized likelihood
  • Minimum description length
using classification regression for data exploration7
Using classification/regression for data exploration
  • Suppose you have thousands of variables and you’re not sure about the interactions among those variables
  • Build a classification/regression model for each variable, using the rest of the variables as inputs
example with three variables x y and z
Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Example with three variables X, Y, and Z

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

summarize the trees with a single graph
Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Summarize the trees with a single graph

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z

dependency network
Dependency Network
  • Build a classification/regression model for every variable given the other variables as inputs
  • Construct a graph where
    • Nodes correspond to variables
    • There is an arc from X to Y if X helps to predict Y
  • The graph along with the individual classification/regression model is a “dependency network”

(Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000)

example tv viewing
AgeShow1 Show2 Show3

viewer 1 73 y n n

viewer 2 16 n y y ...

viewer 3 35 n n n

etc.

Example: TV viewing

Nielsen data: 2/6/95-2/19/95

Goal: exploratory data analysis (acausal)

~400 shows, ~3000 viewers

a bit of history
A bit of history
  • Julian Besag (and others) invented dependency networks (under another name) in the mid 1970s
  • But they didn’t like them, because they could be inconsistent
a consistent dependency network
Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

A consistent dependency network

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z

an inconsistent dependency network
Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

An inconsistent dependency network

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z

a bit of history16
A bit of history
  • Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s
  • But they didn’t like them, because they could be inconsistent
  • So they used a property of consistent dependency networks to develop a new characterization of them
conditional independence17
Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Conditional independence

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

X ^ Z | Y

p(y|x=0,z=1)

X

Y

Z

conditional independence in a dependency network
Conditional independence in a dependency network

Each variable is independent of all other variables given its immediate neighbors

hammersley clifford theorem besag 1974
Hammersley-Clifford Theorem(Besag 1974)
  • Given a set of variables which has a positive joint distribution
  • Where each variable is independent of all other variables given its immediate neighbors in some graph G
  • It follows that

where c1, c2, …, cn are the maximal cliques in the graph G.

“clique

potentials”

a bit of history22
A bit of history
  • Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s
  • But they didn’t like them, because they could be inconsistent
  • So they used a property of consistent dependency networks to develop a new characterization of them
  • “Markov Random Fields” aka “undirected graphs” were born
inconsistent dependency networks aren t that bad
Inconsistent dependency networks aren’t that bad
  • They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized)
  • They are easy to learn from data (build separate classification/regression model for each variable)
  • Conditional distributions (e.g., trees) are easier to understand than clique potentials
inconsistent dependency networks aren t that bad24
Inconsistent dependency networks aren’t that bad
  • They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized)
  • They are easy to learn from data (build separate classification/regression model for each variable)
  • Conditional distributions (e.g., trees) are easier to understand than clique potentials
  • Over the last decade, has proven to be a very useful tool for data exploration
shortcomings of undirected graphs
Shortcomings of undirected graphs
  • Lack a generative story (e.g., Lat Dir Alloc)
  • Lack a causal story

cold

lung cancer

sore throat

weight loss

cough

solution build trees in some order
.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Solution: Build trees in some order

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

solution build trees in some order27
.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Solution: Build trees in some order

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

X

Y

Z

some orders are better than others
Some orders are better than others
  • Random orders
  • Greedy search
  • Monte-Carlo methods

X

Y

Z

X

Z

Y

joint distribution is easy to obtain
.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Joint distribution is easy to obtain

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

X

Y

Z

directed acyclic graphs aka bayes nets
Directed Acyclic Graphs (aka Bayes Nets)

Many inventors: Wright 1921; Good 1961; Howard & Matheson 1976, Pearl 1982

the power of graphical models
The power of graphical models
  • Easy to understand
  • Useful for adding prior knowledge to an analysis (e.g., causal knowledge)
  • The conditional independencies they express make inference more computationally efficient
inference
.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Inference

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

X

Y

Z

What is p(z|x=1)?

inference36
Inference
  • Inference also important because it is the E step of EM algorithm (when learning with missing data and/or hidden variables)
  • Exact methods for inference that exploit conditional independence are well developed (e.g., Shachter, Lauritzen & Spiegelhalter, Dechter)
  • Exact methods fail when there are many cycles in the graph
    • MCMC (e.g., Geman and Geman 1984)
    • Loopy propagation (e.g., Murphy et al. 1999)
    • Variational methods (e.g., Jordan et al. 1999)
applications of graphical models
Applications of Graphical Models

DAGs and UGs:

  • Data exploration
  • Density estimation
  • Clustering

UGs:

  • Spatial processes

DAGs:

  • Expert systems
  • Causal discovery
applications
Applications
  • Clustering
  • Evolutionary history/phylogeny
clustering
Clustering

Example: msnbc.com

User Sequence

1 frontpage news travel travel

2 news news news news news

3 frontpage news frontpage news frontpage

4 news news

5 frontpage news news travel travel travel

6 news weather weather weather weather weather

7 news health health business business business

8 frontpage sports sports sports weather

Etc.

Millions of users per day

Goal: understand what is and isn’t working on the site

solution
Solution

data

Cluster

User clusters

  • Cluster users based on their behavior on the site
  • Display clusters somehow
generative model for clustering e g autoclass cheeseman stutz 1995
Generative model for clustering(e.g., AutoClass, Cheeseman & Stutz 1995)

Discrete, hidden

Cluster

1st

page

2nd

page

3rd

page

sequence clustering cadez heckerman meek smyth 2000
Sequence Clustering(Cadez, Heckerman, Meek, & Smyth, 2000)

Discrete, hidden

Cluster

1st

page

2nd

page

3rd

page

learning parameters with missing data
Learning parameters (with missing data)

Principles:

  • Find the parameters that maximize the (log) likelihood of the data
  • Find the parameters whose posterior probability is a maximum
  • Find distributions for quantities of interest by averaging over the unknown parameters

Gradient methods or EM algorithm typically used for first two

expectation maximization em algorithm dempster laird rubin 1977
Expectation-Maximization (EM) algorithmDempster, Laird, Rubin 1977

Initialize parameters (e.g., at random)

Expectation step: compute probabilities for values of unobserved variable using the current values of the parameters and the incomplete data [THIS IS INFERENCE]; reinterpret data as set of fractional cases based on these probabilities

Maximization step: choose parameters so as to maximize the log likelihood of the fractional data

Parameters will converge to a local maximum of log p(data)

e step
E-step

Suppose cluster model has 2 clusters, and that

p(cluster=1|case,current params) = 0.7

p(cluster=2|case,current params) = 0.3

Then, write

q(case) = 0.7 log p(case,cluster=1|params) +

0.3 log p(case,cluster=2|params)

Do this for each case and then find the parameters that maximize Q=Scase q(case). These parameters also maximize the log likelihood.

demo sql server 2005
Demo: SQL Server 2005

Example: msnbc.com

User Sequence

1 frontpage news travel travel

2 news news news news news

3 frontpage news frontpage news frontpage

4 news news

5 frontpage news news travel travel travel

6 news weather weather weather weather weather

7 news health health business business business

8 frontpage sports sports sports weather

Etc.

sequence clustering
Sequence clustering

Other applications at Microsoft:

  • Analyze how people use programs (e.g. Office)
  • Analyze web traffic for intruders (anomaly detection)
computational biology applications
Computational biology applications
  • Evolutionary history/phylogeny
  • Vaccine for AIDS
evolutionary history phylogeny jojic jojic meek geiger siepel haussler and heckerman 2004
Perissodactyla

Donkey

Horse

Carnivora

Indian rhino

White rhino

Grey seal

Harbor seal

Dog

Cetartiodactyla

Cat

Blue whale

Fin whale

Sperm whale

Hippopotamus

Sheep

Cow

Chiroptera

Alpaca

Pig

Little red flying fox

Ryukyu flying fox

Moles+Shrews

Horseshoe bat

Japanese pipistrelle

Long-tailed bat

Afrotheria

Jamaican fruit-eating bat

Asiatic shrew

Long-clawed shrew

Mole

Small Madagascar hedgehog

Xenarthra

Aardvark

Elephant

Armadillo

Rabbit

Lagomorpha

+ Scandentia

Pika

Tree shrew

Bonobo

Chimpanzee

Man

Gorilla

Sumatran orangutan

Primates

Bornean orangutan

Common gibbon

Barbary ape

Baboon

White-fronted capuchin

Rodentia 1

Slow loris

Squirrel

Dormouse

Cane-rat

Rodentia 2

Guinea pig

Mouse

Rat

Vole

Hedgehog

Hedgehogs

Gymnure

Bandicoot

Wallaroo

Opossum

Platypus

Evolutionary History/PhylogenyJojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman 2004
probabilistic model of evolution
Probabilistic Model of Evolution

hidden

hidden

species

1

species

2

species

3

learning phylogeny from data
Learning phylogeny from data
  • For a given tree, find max likelihood parameters
  • Search over structure to find best likelihood (penalized to avoid over fitting)
strong simplifying assumption
Strong simplifying assumption

Evolution at each DNA nucleotide is independent

 EM is computationally efficient

Nucleotide

position 1

Nucleotide

position 2

Nucleotide

position N

relaxing the assumption

Relaxing the assumption
  • Each substitution depends on the substitution at the previous position
  • This structure captures context specific effects during evolution
  • EM is computationally intractable
variational approximation for inference
Variational approximation for inference

Lower bound good enough for EM-like algorithm

two simple q distributions
Two simple q distributions

h

h

h

Product of

trees

o

h

o

h

o

h

o

o

o

o

o

o

Product of

chains

h

h

h

o

h

o

h

o

h

o

o

o

o

o

things i didn t have time to talk about
Things I didn’t have time to talk about
  • Factor graphs, mixed graphs, etc.
  • Relational learning: PRMs, Plates, PERs
  • Bayesian methods for learning
  • Scalability
  • Causal modeling
  • Variational methods
  • Non-parametric distributions
to learn more
To learn more

Main conferences:

  • Uncertainty in Artificial Intelligence (UAI)
  • Neural information Processing Systems (NIPS)
ad