Create Presentation
Download Presentation

Download Presentation

Nevin L. Zhang Room 3504, phone: 2358-7015, Email: lzhang@cst.hk Home page

Nevin L. Zhang Room 3504, phone: 2358-7015, Email: lzhang@cst.hk Home page

93 Views

Download Presentation
Download Presentation
## Nevin L. Zhang Room 3504, phone: 2358-7015, Email: lzhang@cst.hk Home page

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGYCOMP 5213:**Introduction to Bayesian NetworksL12: Latent Tree Models and Multidimensional Clustering Nevin L. ZhangRoom 3504, phone: 2358-7015, Email: lzhang@cs.ust.hkHome page**Outline**• Latent Tree Models • Definition • Generalizing finite mixture models • Generalizing phylogenetic trees • Attractive representation of joint distributions • Basic Properties • Learning Algorithms • Applications • Latent structure discovery • Multidmensional clustering • Probabilistic inference • Readings: http://www.cse.ust.hk/~lzhang/ltm/index.htm**Latent Tree Models (LTM)**• Bayesian networks with • Rooted tree structure • Leaves observed (manifest variables) • Discrete or continuous • Internal nodes latent (latent variables) • Discrete • Also known as hierarchical latent class (HLC)models, HLC models P(Y1), P(Y2|Y1), P(X1|Y2), P(X2|Y2), …**Example with Continuous Leaves**• A leaf node can contain • One discrete observed variable, • One continuous observed variable, or • Multiple continuous observed variables.**Example**• Manifest variables • Math Grade, Science Grade, Literature Grade, History Grade • Latent variables • Analytic Skill, Literal Skill, Intelligence**More General Tree Models**• Some internal nodes can be observed • Internal nodes can be continuous • .. • We do not consider such models**Outline**• Latent Tree Models • Definition • Generalizing finite mixture models • Generalizing phylogenetic trees • Attractive representation of joint distributions • Basic Properties • Learning Algorithms • Applications • Latent structure discovery • Multidmensional clustering • Probabilistic inference • Readings: http://www.cse.ust.hk/~lzhang/ltm/index.htm**Finite Mixture Models**• Gaussian Mixture Models and Latent class models • Contains one latent variable • Produces one partition of data**How to Cluster Those?**• Page 10**How to Cluster Those?**• Page 11 Style of picture**How to Cluster Those?**• Page 12 Type of object in picture**How to Cluster Those?**• Page 13 • Need multiple partitions • In general, complex data usually • Have multiple facets • Can be meaningfully clustered in multiple ways**LTMs and Multidimensional Clustering**• An LTM contains multiple latent variables • Each represents a partition of data. • Hence, LTMs can be used to produce multiple partitions of data • Called: Multidimensional clustering, each latent variable being a dimension.**From FMMs to LTMs**• Start with several FMMs, • Each based on a distinct subset of attributes • Each partition from a certain perspective. • Different partitions are independent of each other • Link them up to form a tree model • Get LTM • Consider different perspectives in a single model**Outline**• Latent Tree Models • Definition • Generalizing finite mixture models • Generalizing phylogenetic trees • Attractive representation of joint distributions • Basic Properties • Learning Algorithms • Applications • Latent structure discovery • Multidmensional clustering • Probabilistic inference • Readings: http://www.cse.ust.hk/~lzhang/ltm/index.htm**Phylogeny**• Assumption • All organisms on Earth have a common ancestor • This implies that any set of species is related. • Phylogeny • The relationship between any set of species. • Phylogenetic tree • Usually, the relationship can be represented by a tree which is called a phylogenetic (evolution) tree • this is not always true**Phylogenetic trees**• TAXA (sequences) identify species • Edge lengths represent evolution time • Assumption: bifurcating tree toplogy**Probabilistic Models of Evolution**• Characterize relationship between taxa using substitution probability: • P(x | y, t): probability that ancestral sequence y evolves into sequence s along an edge of length t • P(X7), P(X5|X7, t5), P(X6|X7, t6), P(S1|X5, t1), P(S2|X5, t2), ….**Probabilistic Models of Evolution**• What should P(x|y, t) be? • Two assumptions of commonly used models • There are only substitutions, no insertions/deletions (aligned) • One-to-one correspondence between sites in different sequences • Each site evolves independently and identically • P(x|y, t) = Pi=1 to m P(x(i) | y(i), t) • m is sequence length**Probabilistic Models of Evolution**• What should P(x(i)|y(i), t) be? • Jukes-Cantor (Character Evolution) Model [1969] • Rate of substitution a (Constant or parameter?)**Phylogenetic Trees are Special LTMs**• The structure is a binary tree • The variables share the same state space. • The conditional probabilities are from the character evolution model, parameterized by edge lengths instead of usual parameterization.**Outline**• Latent Tree Models • Definition • Generalizing finite mixture models • Generalizing phylogenetic trees • Attractive representation of joint distributions • Basic Properties • Learning Algorithms • Applications • Latent structure discovery • Multidmensional clustering • Probabilistic inference • Readings: http://www.cse.ust.hk/~lzhang/ltm/index.htm**Attractive Representation of Joint Distributions**• Characteristics of LTMs • Are computationally very simple to work with. • Can represent complex relationships among manifest variables. • Useful tool for density estimation.**What can LTMs be Used for?**• Generalizing finite mixture models • Tool for multidimensional clustering • Generalizing phylogenetic trees • Tool for latent structure discovery • Attractive representation of joint distributions • Tool for density estimation (general probabilistic modeling)**Outline**• Latent Tree Models • Definition • Generalizing finite mixture models • Generalizing phylogenetic trees • Attractive representation of joint distributions • Basic Properties • Learning Algorithms • Applications • Latent structure discovery • Multidmensional clustering • Probabilistic inference • Readings: http://www.cse.ust.hk/~lzhang/ltm/index.htm**Root Walking and Model Equivalence**• M1: root walks to X2; M2: root walks to X3 • Root walking leads to equivalent models on manifest variables • Implications: • Cannot determine edge orientation from data • Can only learn unrooted models**Regularity**• Can focus on regular models only • Irregular models can be made regular • Regularized models better than irregular models • Theorem: The set of all such models is finite.**Outline**• Latent Tree Models • Definition • Generalizing finite mixture models • Generalizing phylogenetic trees • Basic Properties • Learning Algorithms • Applications • Latent structure discovery • Multidmensional clustering • Probabilistic inference • Readings: http://www.cse.ust.hk/~lzhang/ltm/index.htm**Learning Latent Tree Models**Determine • Number of latent variables • Cardinality of each latent variable • Model Structure • Conditional probability distributions**Different Types of Algorithms**• Search Algorithms • Clustering of manifest variables • Generalization of phylogenetic tree reconstruction algorithms, particularly Neighbor-Joining**Model Selection**• Bayesian score: posterior probability P(m|D) • P(m|D)= P(m)∫P(D|m, θ) d θ/ P(D) • BIC Score: large sample approximation BIC(m|D) = log P(D|m, θ*) – d/2 logN d: Standard dimension, number of free parameters • BICe Score: BICe(m|D) = log P(D|m, θ*) – de/2 logN effective dimensionde. • Effective dimensions are difficult to compute • BICe not realistic**Page 40**Effective Dimension • Standard dimension: • Number of free parameters • Effective dimension • X1, X2, …, Xn: observed variables • P(X1, X2, …, Xn) is a point in a high-D space for each value of the parameter • Spans a manifold as parameter value varies. • Effective dimension: dimension of the manifold. • Parsimonious model: • Standard dimension = effective dimension • Open question: How to test parsimony?**Page 41**Effective Dimension • Paper: • N. L. Zhang and Tomas Kocka (2004). Effective dimensions of hierarchical latent class models.Journal of Artificial Intelligence Research, 21: 1-17. Open question: Effective of LTM with one latent variable**Model Selection**• Other Choices • Cheeseman-Stutz (CS): impact of approximation error in BIC reduced • AIC • Holdout likelihood • (Cross validation: too expensive) • Simulation studies indicate that • BIC and CS result in good models • AIC and holdout likelihood do not • Therefore, we chose work with BIC.**Model Optimization**• Double hill climbing (DHC), 2002 • 7 manifest variables. • Single hill climbing (SHC), 2004 • 12 manifest variables • Heuristic SHC (HSHC), 2004 • 50 manifest variables • EAST, 2012 • As efficient as HSHC, and more principled • 100+ manifest variables • Reference: T. Chen, N. L. Zhang, T. F. Liu, Y. Wang, L. K. M. Poon (2011). Model-based multidimensional clustering of categorical data. Artificial Intelligence, 176(1), 2246-2269. doi:10.1016/j.artint.2011.09.003.**EAST Algorithm: 5 Search Operators**• EAST: Expansion, Adjustment, Simplification until Termination • Expansion operators: • Node introduction (NI): M1 => M2; |X1| = |X| • Constraint: To mediate a latent node and only two of its neighbors • State introduction (SI): adds a new state to a latent variable • Adjustment operator: node relocation (NR), M2 => M3 • Simplification operators: node deletion (ND), state deletion (SD)**Search Operators and Model Inclusion**• M M’: by NI or SI • M’ includes M • M M’: by ND or SD • M’ included in M • M M’: by NR • No inclusion property in general.**Naïve Search**• Start with an initial model • At each step: • Construct all possible candidate models • Evaluate them one by one • Pick the best one • Inefficient • Too many candidate models • Too expensive to run EM on all of them • Structural EM assumes fixed set of variables. • Does not work here Latent variables in models by NI, SI, SD differ from those in current model**Reducing Number of Candidate Models**• Not to use ALL the operators at once. • How? • BIC: BIC(m|D) = log P(D|m,θ*) – d/2 logN • Improve the two terms alternately • SD and ND reduce the penalty term. • Which operators to improve the likelihood term?**Improve Likelihood Term**• Let be m’ obtained from m using NI or SI log P(D|m’,θ’*) >= log P(D|m,θ*) NI and SI improves the likelihood term. [Homework: Prove this.] • Follow each NI operation with NR operations. • Overcome constraint by NI and allow transition from M1 to M3**Choosing between Models by SI and NI**• Operation Granularity • p = 100 • SI: 101 additional parameters • NI: 2 additional parameters • Compare shovels with bulldozer • SI always preferred initially • Cost-effectiveness principle • Select candidate model with highest improvement ratio**Variable Complexity vs Structure Complexity**• NI: Increases structure complexity. • SI: Increases variable complexity. • Cost-effectiveness principle: • Achieves good balance between the two kinds of complexity