- By
**Jimmy** - Follow User

- 431 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Graphical Models' - Jimmy

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Overview

- Intro to graphical models
- Application: Data exploration
- Dependency networks undirected graphs
- Directed acyclic graphs (“Bayes nets”)
- Applications
- Clustering
- Evolutionary history/phylogeny

male

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

Using classification/regression for data explorationDecision tree:

Logistic regression:

log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old

Neural network:

male

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

Using classification/regression for data explorationp(target|inputs)

Decision tree:

Logistic regression:

log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old

Neural network:

male

female

p(cust)=0.8

young

old

p(cust)=0.7

p(cust)=0.2

Conditional independenceDecision tree:

p(cust | gender, age, month born)=p(cust | gender, age)

p(target | all inputs) = p(target | some inputs)

Learning conditional independence from data: Model Selection

- Cross validation
- Bayesian methods
- Penalized likelihood
- Minimum description length

Using classification/regression for data exploration

- Suppose you have thousands of variables and you’re not sure about the interactions among those variables
- Build a classification/regression model for each variable, using the rest of the variables as inputs

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Example with three variables X, Y, and ZTarget: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Summarize the trees with a single graphTarget: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z

Dependency Network

- Build a classification/regression model for every variable given the other variables as inputs
- Construct a graph where
- Nodes correspond to variables
- There is an arc from X to Y if X helps to predict Y
- The graph along with the individual classification/regression model is a “dependency network”

(Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000)

AgeShow1 Show2 Show3

viewer 1 73 y n n

viewer 2 16 n y y ...

viewer 3 35 n n n

etc.

Example: TV viewingNielsen data: 2/6/95-2/19/95

Goal: exploratory data analysis (acausal)

~400 shows, ~3000 viewers

A bit of history

- Julian Besag (and others) invented dependency networks (under another name) in the mid 1970s
- But they didn’t like them, because they could be inconsistent

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

A consistent dependency networkTarget: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

An inconsistent dependency networkTarget: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

p(y|x=0,z=1)

X

Y

Z

A bit of history

- Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s
- But they didn’t like them, because they could be inconsistent
- So they used a property of consistent dependency networks to develop a new characterization of them

Y=0

Y=1

Y=0

Y=1

p(z|y=0)

p(z|y=1)

p(x|y=0)

p(x|y=1)

Conditional independenceTarget: X

Inputs: Y,Z

Target: Y

Inputs: X,Z

Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

Z=0

Z=1

p(y|x=0,z=0)

X ^ Z | Y

p(y|x=0,z=1)

X

Y

Z

Conditional independence in a dependency network

Each variable is independent of all other variables given its immediate neighbors

Hammersley-Clifford Theorem(Besag 1974)

- Given a set of variables which has a positive joint distribution
- Where each variable is independent of all other variables given its immediate neighbors in some graph G
- It follows that

where c1, c2, …, cn are the maximal cliques in the graph G.

“clique

potentials”

A bit of history

- Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s
- But they didn’t like them, because they could be inconsistent
- So they used a property of consistent dependency networks to develop a new characterization of them
- “Markov Random Fields” aka “undirected graphs” were born

Inconsistent dependency networks aren’t that bad

- They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized)
- They are easy to learn from data (build separate classification/regression model for each variable)
- Conditional distributions (e.g., trees) are easier to understand than clique potentials

Inconsistent dependency networks aren’t that bad

- They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized)
- They are easy to learn from data (build separate classification/regression model for each variable)
- Conditional distributions (e.g., trees) are easier to understand than clique potentials
- Over the last decade, has proven to be a very useful tool for data exploration

Shortcomings of undirected graphs

- Lack a generative story (e.g., Lat Dir Alloc)
- Lack a causal story

cold

lung cancer

sore throat

weight loss

cough

.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Solution: Build trees in some order1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Solution: Build trees in some order1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

X

Y

Z

.

p(x)

Y=0

Y=1

p(z|y=0)

p(z|y=1)

Joint distribution is easy to obtain1. Target: X

Inputs: none

2. Target: Y

Inputs: X

3. Target: Z

Inputs: X,Y

X=0

X=1

p(y|x=1)

p(y|x=0)

X

Y

Z

Directed Acyclic Graphs (aka Bayes Nets)

Many inventors: Wright 1921; Good 1961; Howard & Matheson 1976, Pearl 1982

The power of graphical models

- Easy to understand
- Useful for adding prior knowledge to an analysis (e.g., causal knowledge)
- The conditional independencies they express make inference more computationally efficient

Inference

- Inference also important because it is the E step of EM algorithm (when learning with missing data and/or hidden variables)
- Exact methods for inference that exploit conditional independence are well developed (e.g., Shachter, Lauritzen & Spiegelhalter, Dechter)
- Exact methods fail when there are many cycles in the graph
- MCMC (e.g., Geman and Geman 1984)
- Loopy propagation (e.g., Murphy et al. 1999)
- Variational methods (e.g., Jordan et al. 1999)

Applications of Graphical Models

DAGs and UGs:

- Data exploration
- Density estimation
- Clustering

UGs:

- Spatial processes

DAGs:

- Expert systems
- Causal discovery

Applications

- Clustering
- Evolutionary history/phylogeny

Clustering

Example: msnbc.com

User Sequence

1 frontpage news travel travel

2 news news news news news

3 frontpage news frontpage news frontpage

4 news news

5 frontpage news news travel travel travel

6 news weather weather weather weather weather

7 news health health business business business

8 frontpage sports sports sports weather

Etc.

Millions of users per day

Goal: understand what is and isn’t working on the site

Solution

data

Cluster

User clusters

- Cluster users based on their behavior on the site
- Display clusters somehow

Generative model for clustering(e.g., AutoClass, Cheeseman & Stutz 1995)

Discrete, hidden

Cluster

…

1st

page

2nd

page

3rd

page

Sequence Clustering(Cadez, Heckerman, Meek, & Smyth, 2000)

Discrete, hidden

Cluster

…

1st

page

2nd

page

3rd

page

Learning parameters (with missing data)

Principles:

- Find the parameters that maximize the (log) likelihood of the data
- Find the parameters whose posterior probability is a maximum
- Find distributions for quantities of interest by averaging over the unknown parameters

Gradient methods or EM algorithm typically used for first two

Expectation-Maximization (EM) algorithmDempster, Laird, Rubin 1977

Initialize parameters (e.g., at random)

Expectation step: compute probabilities for values of unobserved variable using the current values of the parameters and the incomplete data [THIS IS INFERENCE]; reinterpret data as set of fractional cases based on these probabilities

Maximization step: choose parameters so as to maximize the log likelihood of the fractional data

Parameters will converge to a local maximum of log p(data)

E-step

Suppose cluster model has 2 clusters, and that

p(cluster=1|case,current params) = 0.7

p(cluster=2|case,current params) = 0.3

Then, write

q(case) = 0.7 log p(case,cluster=1|params) +

0.3 log p(case,cluster=2|params)

Do this for each case and then find the parameters that maximize Q=Scase q(case). These parameters also maximize the log likelihood.

Demo: SQL Server 2005

Example: msnbc.com

User Sequence

1 frontpage news travel travel

2 news news news news news

3 frontpage news frontpage news frontpage

4 news news

5 frontpage news news travel travel travel

6 news weather weather weather weather weather

7 news health health business business business

8 frontpage sports sports sports weather

Etc.

Sequence clustering

Other applications at Microsoft:

- Analyze how people use programs (e.g. Office)
- Analyze web traffic for intruders (anomaly detection)

Computational biology applications

- Evolutionary history/phylogeny
- Vaccine for AIDS

Perissodactyla

Donkey

Horse

Carnivora

Indian rhino

White rhino

Grey seal

Harbor seal

Dog

Cetartiodactyla

Cat

Blue whale

Fin whale

Sperm whale

Hippopotamus

Sheep

Cow

Chiroptera

Alpaca

Pig

Little red flying fox

Ryukyu flying fox

Moles+Shrews

Horseshoe bat

Japanese pipistrelle

Long-tailed bat

Afrotheria

Jamaican fruit-eating bat

Asiatic shrew

Long-clawed shrew

Mole

Small Madagascar hedgehog

Xenarthra

Aardvark

Elephant

Armadillo

Rabbit

Lagomorpha

+ Scandentia

Pika

Tree shrew

Bonobo

Chimpanzee

Man

Gorilla

Sumatran orangutan

Primates

Bornean orangutan

Common gibbon

Barbary ape

Baboon

White-fronted capuchin

Rodentia 1

Slow loris

Squirrel

Dormouse

Cane-rat

Rodentia 2

Guinea pig

Mouse

Rat

Vole

Hedgehog

Hedgehogs

Gymnure

Bandicoot

Wallaroo

Opossum

Platypus

Evolutionary History/PhylogenyJojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman 2004Learning phylogeny from data

- For a given tree, find max likelihood parameters
- Search over structure to find best likelihood (penalized to avoid over fitting)

…Strong simplifying assumption

Evolution at each DNA nucleotide is independent

EM is computationally efficient

Nucleotide

position 1

Nucleotide

position 2

Nucleotide

position N

…

…

Relaxing the assumption- Each substitution depends on the substitution at the previous position
- This structure captures context specific effects during evolution
- EM is computationally intractable

Variational approximation for inference

Lower bound good enough for EM-like algorithm

Things I didn’t have time to talk about

- Factor graphs, mixed graphs, etc.
- Relational learning: PRMs, Plates, PERs
- Bayesian methods for learning
- Scalability
- Causal modeling
- Variational methods
- Non-parametric distributions

To learn more

Main conferences:

- Uncertainty in Artificial Intelligence (UAI)
- Neural information Processing Systems (NIPS)

Download Presentation

Connecting to Server..