980 likes | 1.1k Views
THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY COMP 5213: Introduction to Bayesian Networks L12: Latent Tree Models and Multidimensional Clustering. Nevin L. Zhang Room 3504, phone: 2358-7015, Email: lzhang@cs.ust.hk Home page. Outline. Latent Tree Models Definition
E N D
THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGYCOMP 5213: Introduction to Bayesian NetworksL12: Latent Tree Models and Multidimensional Clustering Nevin L. ZhangRoom 3504, phone: 2358-7015, Email: lzhang@cs.ust.hkHome page
Outline • Latent Tree Models • Definition • Generalizing finite mixture models • Generalizing phylogenetic trees • Attractive representation of joint distributions • Basic Properties • Learning Algorithms • Applications • Latent structure discovery • Multidmensional clustering • Probabilistic inference • Readings: http://www.cse.ust.hk/~lzhang/ltm/index.htm
Latent Tree Models (LTM) • Bayesian networks with • Rooted tree structure • Leaves observed (manifest variables) • Discrete or continuous • Internal nodes latent (latent variables) • Discrete • Also known as hierarchical latent class (HLC)models, HLC models P(Y1), P(Y2|Y1), P(X1|Y2), P(X2|Y2), …
Example with Continuous Leaves • A leaf node can contain • One discrete observed variable, • One continuous observed variable, or • Multiple continuous observed variables.
Example • Manifest variables • Math Grade, Science Grade, Literature Grade, History Grade • Latent variables • Analytic Skill, Literal Skill, Intelligence
More General Tree Models • Some internal nodes can be observed • Internal nodes can be continuous • .. • We do not consider such models
Outline • Latent Tree Models • Definition • Generalizing finite mixture models • Generalizing phylogenetic trees • Attractive representation of joint distributions • Basic Properties • Learning Algorithms • Applications • Latent structure discovery • Multidmensional clustering • Probabilistic inference • Readings: http://www.cse.ust.hk/~lzhang/ltm/index.htm
Finite Mixture Models • Gaussian Mixture Models and Latent class models • Contains one latent variable • Produces one partition of data
How to Cluster Those? • Page 10
How to Cluster Those? • Page 11 Style of picture
How to Cluster Those? • Page 12 Type of object in picture
How to Cluster Those? • Page 13 • Need multiple partitions • In general, complex data usually • Have multiple facets • Can be meaningfully clustered in multiple ways
LTMs and Multidimensional Clustering • An LTM contains multiple latent variables • Each represents a partition of data. • Hence, LTMs can be used to produce multiple partitions of data • Called: Multidimensional clustering, each latent variable being a dimension.
From FMMs to LTMs • Start with several FMMs, • Each based on a distinct subset of attributes • Each partition from a certain perspective. • Different partitions are independent of each other • Link them up to form a tree model • Get LTM • Consider different perspectives in a single model
Outline • Latent Tree Models • Definition • Generalizing finite mixture models • Generalizing phylogenetic trees • Attractive representation of joint distributions • Basic Properties • Learning Algorithms • Applications • Latent structure discovery • Multidmensional clustering • Probabilistic inference • Readings: http://www.cse.ust.hk/~lzhang/ltm/index.htm
Phylogeny • Assumption • All organisms on Earth have a common ancestor • This implies that any set of species is related. • Phylogeny • The relationship between any set of species. • Phylogenetic tree • Usually, the relationship can be represented by a tree which is called a phylogenetic (evolution) tree • this is not always true
Phylogenetic trees • TAXA (sequences) identify species • Edge lengths represent evolution time • Assumption: bifurcating tree toplogy
Probabilistic Models of Evolution • Characterize relationship between taxa using substitution probability: • P(x | y, t): probability that ancestral sequence y evolves into sequence s along an edge of length t • P(X7), P(X5|X7, t5), P(X6|X7, t6), P(S1|X5, t1), P(S2|X5, t2), ….
Probabilistic Models of Evolution • What should P(x|y, t) be? • Two assumptions of commonly used models • There are only substitutions, no insertions/deletions (aligned) • One-to-one correspondence between sites in different sequences • Each site evolves independently and identically • P(x|y, t) = Pi=1 to m P(x(i) | y(i), t) • m is sequence length
Probabilistic Models of Evolution • What should P(x(i)|y(i), t) be? • Jukes-Cantor (Character Evolution) Model [1969] • Rate of substitution a (Constant or parameter?)
Phylogenetic Trees are Special LTMs • The structure is a binary tree • The variables share the same state space. • The conditional probabilities are from the character evolution model, parameterized by edge lengths instead of usual parameterization.
Outline • Latent Tree Models • Definition • Generalizing finite mixture models • Generalizing phylogenetic trees • Attractive representation of joint distributions • Basic Properties • Learning Algorithms • Applications • Latent structure discovery • Multidmensional clustering • Probabilistic inference • Readings: http://www.cse.ust.hk/~lzhang/ltm/index.htm
Attractive Representation of Joint Distributions • Characteristics of LTMs • Are computationally very simple to work with. • Can represent complex relationships among manifest variables. • Useful tool for density estimation.
What can LTMs be Used for? • Generalizing finite mixture models • Tool for multidimensional clustering • Generalizing phylogenetic trees • Tool for latent structure discovery • Attractive representation of joint distributions • Tool for density estimation (general probabilistic modeling)
Outline • Latent Tree Models • Definition • Generalizing finite mixture models • Generalizing phylogenetic trees • Attractive representation of joint distributions • Basic Properties • Learning Algorithms • Applications • Latent structure discovery • Multidmensional clustering • Probabilistic inference • Readings: http://www.cse.ust.hk/~lzhang/ltm/index.htm
Root Walking and Model Equivalence • M1: root walks to X2; M2: root walks to X3 • Root walking leads to equivalent models on manifest variables • Implications: • Cannot determine edge orientation from data • Can only learn unrooted models
Regularity • Can focus on regular models only • Irregular models can be made regular • Regularized models better than irregular models • Theorem: The set of all such models is finite.
Outline • Latent Tree Models • Definition • Generalizing finite mixture models • Generalizing phylogenetic trees • Basic Properties • Learning Algorithms • Applications • Latent structure discovery • Multidmensional clustering • Probabilistic inference • Readings: http://www.cse.ust.hk/~lzhang/ltm/index.htm
Learning Latent Tree Models Determine • Number of latent variables • Cardinality of each latent variable • Model Structure • Conditional probability distributions
Different Types of Algorithms • Search Algorithms • Clustering of manifest variables • Generalization of phylogenetic tree reconstruction algorithms, particularly Neighbor-Joining
Model Selection • Bayesian score: posterior probability P(m|D) • P(m|D)= P(m)∫P(D|m, θ) d θ/ P(D) • BIC Score: large sample approximation BIC(m|D) = log P(D|m, θ*) – d/2 logN d: Standard dimension, number of free parameters • BICe Score: BICe(m|D) = log P(D|m, θ*) – de/2 logN effective dimensionde. • Effective dimensions are difficult to compute • BICe not realistic
Page 40 Effective Dimension • Standard dimension: • Number of free parameters • Effective dimension • X1, X2, …, Xn: observed variables • P(X1, X2, …, Xn) is a point in a high-D space for each value of the parameter • Spans a manifold as parameter value varies. • Effective dimension: dimension of the manifold. • Parsimonious model: • Standard dimension = effective dimension • Open question: How to test parsimony?
Page 41 Effective Dimension • Paper: • N. L. Zhang and Tomas Kocka (2004). Effective dimensions of hierarchical latent class models.Journal of Artificial Intelligence Research, 21: 1-17. Open question: Effective of LTM with one latent variable
Model Selection • Other Choices • Cheeseman-Stutz (CS): impact of approximation error in BIC reduced • AIC • Holdout likelihood • (Cross validation: too expensive) • Simulation studies indicate that • BIC and CS result in good models • AIC and holdout likelihood do not • Therefore, we chose work with BIC.
Model Optimization • Double hill climbing (DHC), 2002 • 7 manifest variables. • Single hill climbing (SHC), 2004 • 12 manifest variables • Heuristic SHC (HSHC), 2004 • 50 manifest variables • EAST, 2012 • As efficient as HSHC, and more principled • 100+ manifest variables • Reference: T. Chen, N. L. Zhang, T. F. Liu, Y. Wang, L. K. M. Poon (2011). Model-based multidimensional clustering of categorical data. Artificial Intelligence, 176(1), 2246-2269. doi:10.1016/j.artint.2011.09.003.
EAST Algorithm: 5 Search Operators • EAST: Expansion, Adjustment, Simplification until Termination • Expansion operators: • Node introduction (NI): M1 => M2; |X1| = |X| • Constraint: To mediate a latent node and only two of its neighbors • State introduction (SI): adds a new state to a latent variable • Adjustment operator: node relocation (NR), M2 => M3 • Simplification operators: node deletion (ND), state deletion (SD)
Search Operators and Model Inclusion • M M’: by NI or SI • M’ includes M • M M’: by ND or SD • M’ included in M • M M’: by NR • No inclusion property in general.
Naïve Search • Start with an initial model • At each step: • Construct all possible candidate models • Evaluate them one by one • Pick the best one • Inefficient • Too many candidate models • Too expensive to run EM on all of them • Structural EM assumes fixed set of variables. • Does not work here Latent variables in models by NI, SI, SD differ from those in current model
Reducing Number of Candidate Models • Not to use ALL the operators at once. • How? • BIC: BIC(m|D) = log P(D|m,θ*) – d/2 logN • Improve the two terms alternately • SD and ND reduce the penalty term. • Which operators to improve the likelihood term?
Improve Likelihood Term • Let be m’ obtained from m using NI or SI log P(D|m’,θ’*) >= log P(D|m,θ*) NI and SI improves the likelihood term. [Homework: Prove this.] • Follow each NI operation with NR operations. • Overcome constraint by NI and allow transition from M1 to M3
Choosing between Models by SI and NI • Operation Granularity • p = 100 • SI: 101 additional parameters • NI: 2 additional parameters • Compare shovels with bulldozer • SI always preferred initially • Cost-effectiveness principle • Select candidate model with highest improvement ratio
Variable Complexity vs Structure Complexity • NI: Increases structure complexity. • SI: Increases variable complexity. • Cost-effectiveness principle: • Achieves good balance between the two kinds of complexity