Graphical models
Download
1 / 59

Graphical Models - PowerPoint PPT Presentation


  • 417 Views
  • Updated On :

Graphical Models. David Heckerman Microsoft Research. Overview. Intro to graphical models Application: Data exploration Dependency networks  undirected graphs Directed acyclic graphs (“Bayes nets”) Applications Clustering Evolutionary history/phylogeny. male. female. p(cust)=0.8.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Graphical Models' - Jimmy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Graphical models l.jpg
Graphical Models

David Heckerman

Microsoft Research


Overview l.jpg
Overview

  • Intro to graphical models

    • Application: Data exploration

    • Dependency networks  undirected graphs

    • Directed acyclic graphs (“Bayes nets”)

  • Applications

    • Clustering

    • Evolutionary history/phylogeny


Using classification regression for data exploration l.jpg

male

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

Using classification/regression for data exploration

Decision tree:

Logistic regression:

log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old

Neural network:


Using classification regression for data exploration4 l.jpg

male

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

Using classification/regression for data exploration

p(target|inputs)

Decision tree:

Logistic regression:

log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old

Neural network:


Conditional independence l.jpg

male

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

Conditional independence

Decision tree:

p(cust | gender, age, month born)=p(cust | gender, age)

p(target | all inputs) = p(target | some inputs)


Learning conditional independence from data model selection l.jpg
Learning conditional independence from data: Model Selection

  • Cross validation

  • Bayesian methods

  • Penalized likelihood

  • Minimum description length


Using classification regression for data exploration7 l.jpg
Using classification/regression for data exploration

  • Suppose you have thousands of variables and you’re not sure about the interactions among those variables

  • Build a classification/regression model for each variable, using the rest of the variables as inputs


Example with three variables x y and z l.jpg

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Example with three variables X, Y, and Z

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)


Summarize the trees with a single graph l.jpg

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Summarize the trees with a single graph

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z


Dependency network l.jpg
Dependency Network

  • Build a classification/regression model for every variable given the other variables as inputs

  • Construct a graph where

    • Nodes correspond to variables

    • There is an arc from X to Y if X helps to predict Y

  • The graph along with the individual classification/regression model is a “dependency network”

    (Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000)


Example tv viewing l.jpg

AgeShow1 Show2 Show3

viewer 1 73 y n n

viewer 2 16 n y y ...

viewer 3 35 n n n

etc.

Example: TV viewing

Nielsen data: 2/6/95-2/19/95

Goal: exploratory data analysis (acausal)

~400 shows, ~3000 viewers


A bit of history l.jpg
A bit of history

  • Julian Besag (and others) invented dependency networks (under another name) in the mid 1970s

  • But they didn’t like them, because they could be inconsistent


A consistent dependency network l.jpg

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

A consistent dependency network

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z


An inconsistent dependency network l.jpg

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

An inconsistent dependency network

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z


A bit of history16 l.jpg
A bit of history

  • Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s

  • But they didn’t like them, because they could be inconsistent

  • So they used a property of consistent dependency networks to develop a new characterization of them


Conditional independence17 l.jpg

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Conditional independence

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

X ^ Z | Y

p(y|x=0,z=1)

X

Y

Z


Conditional independence in a dependency network l.jpg
Conditional independence in a dependency network

Each variable is independent of all other variables given its immediate neighbors


Hammersley clifford theorem besag 1974 l.jpg
Hammersley-Clifford Theorem(Besag 1974)

  • Given a set of variables which has a positive joint distribution

  • Where each variable is independent of all other variables given its immediate neighbors in some graph G

  • It follows that

    where c1, c2, …, cn are the maximal cliques in the graph G.

“clique

potentials”


Example l.jpg
Example

X

Y

Z



A bit of history22 l.jpg
A bit of history

  • Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s

  • But they didn’t like them, because they could be inconsistent

  • So they used a property of consistent dependency networks to develop a new characterization of them

  • “Markov Random Fields” aka “undirected graphs” were born


Inconsistent dependency networks aren t that bad l.jpg
Inconsistent dependency networks aren’t that bad

  • They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized)

  • They are easy to learn from data (build separate classification/regression model for each variable)

  • Conditional distributions (e.g., trees) are easier to understand than clique potentials


Inconsistent dependency networks aren t that bad24 l.jpg
Inconsistent dependency networks aren’t that bad

  • They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized)

  • They are easy to learn from data (build separate classification/regression model for each variable)

  • Conditional distributions (e.g., trees) are easier to understand than clique potentials

  • Over the last decade, has proven to be a very useful tool for data exploration


Shortcomings of undirected graphs l.jpg
Shortcomings of undirected graphs

  • Lack a generative story (e.g., Lat Dir Alloc)

  • Lack a causal story

cold

lung cancer

sore throat

weight loss

cough


Solution build trees in some order l.jpg

.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Solution: Build trees in some order

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)


Solution build trees in some order27 l.jpg

.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Solution: Build trees in some order

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

X

Y

Z


Some orders are better than others l.jpg
Some orders are better than others

  • Random orders

  • Greedy search

  • Monte-Carlo methods

X

Y

Z

X

Z

Y


Joint distribution is easy to obtain l.jpg

.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Joint distribution is easy to obtain

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

X

Y

Z


Directed acyclic graphs aka bayes nets l.jpg
Directed Acyclic Graphs (aka Bayes Nets)

Many inventors: Wright 1921; Good 1961; Howard & Matheson 1976, Pearl 1982


The power of graphical models l.jpg
The power of graphical models

  • Easy to understand

  • Useful for adding prior knowledge to an analysis (e.g., causal knowledge)

  • The conditional independencies they express make inference more computationally efficient


Inference l.jpg

.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Inference

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

X

Y

Z

What is p(z|x=1)?



Inference example elimination algorithm l.jpg
Inference: Example(“Elimination Algorithm”)

Z

X

Y

W


Inference example elimination algorithm35 l.jpg
Inference: Example(“Elimination Algorithm”)

Z

X

Y

W


Inference36 l.jpg
Inference

  • Inference also important because it is the E step of EM algorithm (when learning with missing data and/or hidden variables)

  • Exact methods for inference that exploit conditional independence are well developed (e.g., Shachter, Lauritzen & Spiegelhalter, Dechter)

  • Exact methods fail when there are many cycles in the graph

    • MCMC (e.g., Geman and Geman 1984)

    • Loopy propagation (e.g., Murphy et al. 1999)

    • Variational methods (e.g., Jordan et al. 1999)


Applications of graphical models l.jpg
Applications of Graphical Models

DAGs and UGs:

  • Data exploration

  • Density estimation

  • Clustering

    UGs:

  • Spatial processes

    DAGs:

  • Expert systems

  • Causal discovery


Applications l.jpg
Applications

  • Clustering

  • Evolutionary history/phylogeny


Clustering l.jpg
Clustering

Example: msnbc.com

User Sequence

1 frontpage news travel travel

2 news news news news news

3 frontpage news frontpage news frontpage

4 news news

5 frontpage news news travel travel travel

6 news weather weather weather weather weather

7 news health health business business business

8 frontpage sports sports sports weather

Etc.

Millions of users per day

Goal: understand what is and isn’t working on the site


Solution l.jpg
Solution

data

Cluster

User clusters

  • Cluster users based on their behavior on the site

  • Display clusters somehow


Generative model for clustering e g autoclass cheeseman stutz 1995 l.jpg
Generative model for clustering(e.g., AutoClass, Cheeseman & Stutz 1995)

Discrete, hidden

Cluster

1st

page

2nd

page

3rd

page


Sequence clustering cadez heckerman meek smyth 2000 l.jpg
Sequence Clustering(Cadez, Heckerman, Meek, & Smyth, 2000)

Discrete, hidden

Cluster

1st

page

2nd

page

3rd

page


Learning parameters with missing data l.jpg
Learning parameters (with missing data)

Principles:

  • Find the parameters that maximize the (log) likelihood of the data

  • Find the parameters whose posterior probability is a maximum

  • Find distributions for quantities of interest by averaging over the unknown parameters

    Gradient methods or EM algorithm typically used for first two


Expectation maximization em algorithm dempster laird rubin 1977 l.jpg
Expectation-Maximization (EM) algorithmDempster, Laird, Rubin 1977

Initialize parameters (e.g., at random)

Expectation step: compute probabilities for values of unobserved variable using the current values of the parameters and the incomplete data [THIS IS INFERENCE]; reinterpret data as set of fractional cases based on these probabilities

Maximization step: choose parameters so as to maximize the log likelihood of the fractional data

Parameters will converge to a local maximum of log p(data)


E step l.jpg
E-step

Suppose cluster model has 2 clusters, and that

p(cluster=1|case,current params) = 0.7

p(cluster=2|case,current params) = 0.3

Then, write

q(case) = 0.7 log p(case,cluster=1|params) +

0.3 log p(case,cluster=2|params)

Do this for each case and then find the parameters that maximize Q=Scase q(case). These parameters also maximize the log likelihood.


Demo sql server 2005 l.jpg
Demo: SQL Server 2005

Example: msnbc.com

User Sequence

1 frontpage news travel travel

2 news news news news news

3 frontpage news frontpage news frontpage

4 news news

5 frontpage news news travel travel travel

6 news weather weather weather weather weather

7 news health health business business business

8 frontpage sports sports sports weather

Etc.


Sequence clustering l.jpg
Sequence clustering

Other applications at Microsoft:

  • Analyze how people use programs (e.g. Office)

  • Analyze web traffic for intruders (anomaly detection)


Computational biology applications l.jpg
Computational biology applications

  • Evolutionary history/phylogeny

  • Vaccine for AIDS


Evolutionary history phylogeny jojic jojic meek geiger siepel haussler and heckerman 2004 l.jpg

Perissodactyla

Donkey

Horse

Carnivora

Indian rhino

White rhino

Grey seal

Harbor seal

Dog

Cetartiodactyla

Cat

Blue whale

Fin whale

Sperm whale

Hippopotamus

Sheep

Cow

Chiroptera

Alpaca

Pig

Little red flying fox

Ryukyu flying fox

Moles+Shrews

Horseshoe bat

Japanese pipistrelle

Long-tailed bat

Afrotheria

Jamaican fruit-eating bat

Asiatic shrew

Long-clawed shrew

Mole

Small Madagascar hedgehog

Xenarthra

Aardvark

Elephant

Armadillo

Rabbit

Lagomorpha

+ Scandentia

Pika

Tree shrew

Bonobo

Chimpanzee

Man

Gorilla

Sumatran orangutan

Primates

Bornean orangutan

Common gibbon

Barbary ape

Baboon

White-fronted capuchin

Rodentia 1

Slow loris

Squirrel

Dormouse

Cane-rat

Rodentia 2

Guinea pig

Mouse

Rat

Vole

Hedgehog

Hedgehogs

Gymnure

Bandicoot

Wallaroo

Opossum

Platypus

Evolutionary History/PhylogenyJojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman 2004


Probabilistic model of evolution l.jpg
Probabilistic Model of Evolution

hidden

hidden

species

1

species

2

species

3


Learning phylogeny from data l.jpg
Learning phylogeny from data

  • For a given tree, find max likelihood parameters

  • Search over structure to find best likelihood (penalized to avoid over fitting)


Strong simplifying assumption l.jpg

Strong simplifying assumption

Evolution at each DNA nucleotide is independent

 EM is computationally efficient

Nucleotide

position 1

Nucleotide

position 2

Nucleotide

position N


Relaxing the assumption l.jpg

Relaxing the assumption

  • Each substitution depends on the substitution at the previous position

  • This structure captures context specific effects during evolution

  • EM is computationally intractable


Variational approximation for inference l.jpg
Variational approximation for inference

Lower bound good enough for EM-like algorithm


Two simple q distributions l.jpg
Two simple q distributions

h

h

h

Product of

trees

o

h

o

h

o

h

o

o

o

o

o

o

Product of

chains

h

h

h

o

h

o

h

o

h

o

o

o

o

o


Things i didn t have time to talk about l.jpg
Things I didn’t have time to talk about

  • Factor graphs, mixed graphs, etc.

  • Relational learning: PRMs, Plates, PERs

  • Bayesian methods for learning

  • Scalability

  • Causal modeling

  • Variational methods

  • Non-parametric distributions


To learn more l.jpg
To learn more

Main conferences:

  • Uncertainty in Artificial Intelligence (UAI)

  • Neural information Processing Systems (NIPS)


ad