- 393 Views
- Updated On :
- Presentation posted in: Pets / Animals

Graphical Models

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

David Heckerman

Microsoft Research

- Intro to graphical models
- Application: Data exploration
- Dependency networks undirected graphs
- Directed acyclic graphs (“Bayes nets”)

- Applications
- Clustering
- Evolutionary history/phylogeny

male

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

Decision tree:

Logistic regression:

log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old

Neural network:

male

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

p(target|inputs)

Decision tree:

Logistic regression:

log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old

Neural network:

male

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

Decision tree:

p(cust | gender, age, month born)=p(cust | gender, age)

p(target | all inputs) = p(target | some inputs)

- Cross validation
- Bayesian methods
- Penalized likelihood
- Minimum description length

- Suppose you have thousands of variables and you’re not sure about the interactions among those variables
- Build a classification/regression model for each variable, using the rest of the variables as inputs

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z

- Build a classification/regression model for every variable given the other variables as inputs
- Construct a graph where
- Nodes correspond to variables
- There is an arc from X to Y if X helps to predict Y

- The graph along with the individual classification/regression model is a “dependency network”
(Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000)

AgeShow1 Show2 Show3

viewer 1 73 y n n

viewer 2 16 ny y...

viewer 3 35 nn n

etc.

Nielsen data: 2/6/95-2/19/95

Goal: exploratory data analysis (acausal)

~400 shows, ~3000 viewers

- Julian Besag (and others) invented dependency networks (under another name) in the mid 1970s
- But they didn’t like them, because they could be inconsistent

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z

- Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s
- But they didn’t like them, because they could be inconsistent
- So they used a property of consistent dependency networks to develop a new characterization of them

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Target: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

X ^ Z | Y

p(y|x=0,z=1)

X

Y

Z

Each variable is independent of all other variables given its immediate neighbors

- Given a set of variables which has a positive joint distribution
- Where each variable is independent of all other variables given its immediate neighbors in some graph G
- It follows that
where c1, c2, …, cn are the maximal cliques in the graph G.

“clique

potentials”

X

Y

Z

X

Y

Z

X

Y

Z

- Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s
- But they didn’t like them, because they could be inconsistent
- So they used a property of consistent dependency networks to develop a new characterization of them
- “Markov Random Fields” aka “undirected graphs” were born

- They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized)
- They are easy to learn from data (build separate classification/regression model for each variable)
- Conditional distributions (e.g., trees) are easier to understand than clique potentials

- They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized)
- They are easy to learn from data (build separate classification/regression model for each variable)
- Conditional distributions (e.g., trees) are easier to understand than clique potentials
- Over the last decade, has proven to be a very useful tool for data exploration

- Lack a generative story (e.g., Lat Dir Alloc)
- Lack a causal story

cold

lung cancer

sore throat

weight loss

cough

.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

X

Y

Z

- Random orders
- Greedy search
- Monte-Carlo methods

X

Y

Z

X

Z

Y

.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

X

Y

Z

Many inventors: Wright 1921; Good 1961; Howard & Matheson 1976, Pearl 1982

- Easy to understand
- Useful for adding prior knowledge to an analysis (e.g., causal knowledge)
- The conditional independencies they express make inference more computationally efficient

.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

X

Y

Z

What is p(z|x=1)?

Z

X

Y

W

Z

X

Y

W

Z

X

Y

W

- Inference also important because it is the E step of EM algorithm (when learning with missing data and/or hidden variables)
- Exact methods for inference that exploit conditional independence are well developed (e.g., Shachter, Lauritzen & Spiegelhalter, Dechter)
- Exact methods fail when there are many cycles in the graph
- MCMC (e.g., Geman and Geman 1984)
- Loopy propagation (e.g., Murphy et al. 1999)
- Variational methods (e.g., Jordan et al. 1999)

DAGs and UGs:

- Data exploration
- Density estimation
- Clustering
UGs:

- Spatial processes
DAGs:

- Expert systems
- Causal discovery

- Clustering
- Evolutionary history/phylogeny

Example: msnbc.com

UserSequence

1frontpagenewstraveltravel

2newsnewsnewsnewsnews

3frontpagenewsfrontpagenewsfrontpage

4newsnews

5frontpagenewsnewstraveltraveltravel

6newsweatherweatherweatherweatherweather

7newshealthhealthbusinessbusinessbusiness

8frontpagesportssportssportsweather

Etc.

Millions of users per day

Goal: understand what is and isn’t working on the site

data

Cluster

User clusters

- Cluster users based on their behavior on the site
- Display clusters somehow

Discrete, hidden

Cluster

…

1st

page

2nd

page

3rd

page

Discrete, hidden

Cluster

…

1st

page

2nd

page

3rd

page

Principles:

- Find the parameters that maximize the (log) likelihood of the data
- Find the parameters whose posterior probability is a maximum
- Find distributions for quantities of interest by averaging over the unknown parameters
Gradient methods or EM algorithm typically used for first two

Initialize parameters (e.g., at random)

Expectation step: compute probabilities for values of unobserved variable using the current values of the parameters and the incomplete data [THIS IS INFERENCE]; reinterpret data as set of fractional cases based on these probabilities

Maximization step: choose parameters so as to maximize the log likelihood of the fractional data

Parameters will converge to a local maximum of log p(data)

Suppose cluster model has 2 clusters, and that

p(cluster=1|case,current params) = 0.7

p(cluster=2|case,current params) = 0.3

Then, write

q(case) = 0.7 log p(case,cluster=1|params) +

0.3 log p(case,cluster=2|params)

Do this for each case and then find the parameters that maximize Q=Scase q(case). These parameters also maximize the log likelihood.

Example: msnbc.com

UserSequence

1frontpagenewstraveltravel

2newsnewsnewsnewsnews

3frontpagenewsfrontpagenewsfrontpage

4newsnews

5frontpagenewsnewstraveltraveltravel

6newsweatherweatherweatherweatherweather

7newshealthhealthbusinessbusinessbusiness

8frontpagesportssportssportsweather

Etc.

Other applications at Microsoft:

- Analyze how people use programs (e.g. Office)
- Analyze web traffic for intruders (anomaly detection)

- Evolutionary history/phylogeny
- Vaccine for AIDS

Perissodactyla

Donkey

Horse

Carnivora

Indian rhino

White rhino

Grey seal

Harbor seal

Dog

Cetartiodactyla

Cat

Blue whale

Fin whale

Sperm whale

Hippopotamus

Sheep

Cow

Chiroptera

Alpaca

Pig

Little red flying fox

Ryukyu flying fox

Moles+Shrews

Horseshoe bat

Japanese pipistrelle

Long-tailed bat

Afrotheria

Jamaican fruit-eating bat

Asiatic shrew

Long-clawed shrew

Mole

Small Madagascar hedgehog

Xenarthra

Aardvark

Elephant

Armadillo

Rabbit

Lagomorpha

+ Scandentia

Pika

Tree shrew

Bonobo

Chimpanzee

Man

Gorilla

Sumatran orangutan

Primates

Bornean orangutan

Common gibbon

Barbary ape

Baboon

White-fronted capuchin

Rodentia 1

Slow loris

Squirrel

Dormouse

Cane-rat

Rodentia 2

Guinea pig

Mouse

Rat

Vole

Hedgehog

Hedgehogs

Gymnure

Bandicoot

Wallaroo

Opossum

Platypus

hidden

…

hidden

species

1

species

2

species

3

- For a given tree, find max likelihood parameters
- Search over structure to find best likelihood (penalized to avoid over fitting)

…

Evolution at each DNA nucleotide is independent

EM is computationally efficient

Nucleotide

position 1

Nucleotide

position 2

Nucleotide

position N

…

…

- Each substitution depends on the substitution at the previous position
- This structure captures context specific effects during evolution
- EM is computationally intractable

Lower bound good enough for EM-like algorithm

h

h

h

…

Product of

trees

o

h

o

h

o

h

o

o

o

o

o

o

Product of

chains

h

h

h

…

o

h

o

h

o

h

o

o

o

o

o

- Factor graphs, mixed graphs, etc.
- Relational learning: PRMs, Plates, PERs
- Bayesian methods for learning
- Scalability
- Causal modeling
- Variational methods
- Non-parametric distributions

Main conferences:

- Uncertainty in Artificial Intelligence (UAI)
- Neural information Processing Systems (NIPS)