Unsupervised Learning of Finite Mixture Models

Unsupervised Learning of Finite Mixture Models Mário A. T. Figueiredo mtf@lx.it.pt http://red.lx.it.pt/~mtf Institute of Telecommunications, and Instituto Superior Técnico. Technical University of Lisbon PORTUGAL This work was done jointly with Anil K. Jain, Michigan State University

Some of this work (and earlier versions of it) is reported in: • M. Figueiredo and A. K. Jain, "Unsupervised Learning of Finite Mixture Models", to appear in • IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002. • M. Figueiredo, and A. K. Jain, "Unsupervised Selection and Estimation of Finite Mixture Models", • in Proc. of the Intern. Conf. on Pattern Recognition- ICPR'2000, vol. 2, pp. 87-90, Barcelona, 2000. • M. Figueiredo, J. Leitão, and A.K.Jain, "On Fitting Mixture Models", in E. Hancock and M. Pellilo • (Editors), Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 54 - 69, • Springer Verlag, 1999. Outline - Review of finite mixtures - Estimating mixtures: the expectation-maximization (EM) algorithm - Research issues: order selection and initialization of EM - A new order selection criterion - A new algorithm - Experimental results - Concluding remarks

Finite mixtures f1(x) Choose at random f2(x) random variable X fi(x) fk(x) k random sources, probability density functions fi(x), i=1,…,k

Finite mixtures Example: 3 species (Iris)

Finite mixtures f1(x) f2(x) random variable X Conditional: f (x|source i) = fi (x) fi(x) f (x and source i) = fi (x) ai Joint: fk(x) Unconditional: f(x) = f (x and source i) Choose at random, Prob.(source i) = ai

Finite mixtures Component densities and Mixing probabilities: Parameterized components (e.g., Gaussian):

Gaussian mixtures Gaussian Arbitrary covariances: Common covariance:

Mixture fitting / estimation Goals: estimate the parameter set Data: n independent observations, Classified data (classes unknown) Observed data maybe “classify the observations” Example: - How many species ? Mean of each species ? - Which points belong to each species ?

Gaussian mixtures (d=1), an example

Gaussian mixtures, an R2 example (1500 points) k = 3

Uses of mixtures in pattern recognition Unsupervised learning (model-based clustering): - each component models one cluster - clustering = mixture fitting Observations: - unclassified points Goals: - find the classes, - classify the points

Uses of mixtures in pattern recognition Good to represent class conditional densities in supervised learning Example: - two strongly non-Gaussian classes. - Use mixtures to model each class-conditional density. Mixtures can approximate arbitrary densities

Fitting mixtures n independent observations Maximum (log)likelihood (ML) estimate of Q: ML estimate has no closed-form solution mixture

Gaussian mixtures: A peculiar type of ML Maximum (log)likelihood (ML) estimate of Q: Subject to: Problem: the likelihood function is unbounded as There is no global maximum. and Unusual goal: a “good” local maximum

A Peculiar type of ML problem Some data points: Example: a 2-component Gaussian mixture

Fitting mixtures: a missing data problem expectation-maximization (EM) algorithm Standard alternative: Observed data: Missing data: Missing labels (“colors”) “1” at position i x( j) generated by component i ML estimate has no closed-form solution Missing data problem:

Fitting mixtures: a missing data problem Observed data: Missing data: k-1 zeros, one “ 1” Complete log-likelihood function: In the presence of both x and z, Q would be easy to estimate, …but z is missing.

The EM algorithm Iterative procedure: Under mild conditions: local maximum of compute the expected value of The E-step: The M-step: update parameter estimates

The EM algorithm: the Gaussian case Because is linear in z Bayes law Binary variable Estimate, at iteration t, of the probability that x( j) was produced by component i The E-step: “Soft” probabilistic assignment

The EM algorithm: the Gaussian case Estimate, at iteration t, of the probability that x( j) was produced by component i Result of the E-step: The M-step:

Difficulties with EM It is a local (greedy) algorithm (likelihood never dcreases) Initialization dependent 74 iterations 270 iterations

Model selection can not be used to estimate k …because For any there exists a such that Number of components ? The maximized likelihood never decreases when k increases ...the classical over/under fitting issue. Example: a(k+1) = 0 Parameter spaces are “nested”

Estimating the number of components (EM-based) Usually: penalty term k log-likelihood Criteria in this cathegory: - Bezdek’s partition coefficient (PC), Bezdek, 1981 (in a clustering framework). - Minimum description length (MDL), Rissanen and Ristad, 1992. - Akaike’s information criterion (AIC), Whindham and Cutler, 1992. - Approximate weight of evidence (AWE), Banfield and Raftery, 1993. - Evidence Bayesian criterion (EBC), Roberts,Husmeyer, Rezek, and Penny, 1998. - Schwarz´s Bayesian inference criterion (BIC), Fraley and Raftery, 1998. obtained, e.g., via EM

Estimating the number of components: other approaches Resampling based techniques - Bootstrap for clustering, Jain and Moreau, 1987. - Bootstrap for Gaussian mixtures, McLachlan, 1987. - Cross validation, Smyth, 1998. Stochastic techniques Comment: computationally very heavy. - Estimating k via Markov Chain Monte Carlo (MCMC), Mengersen and Robert, 1996; Bensmail, Celeux, Raftery, and Robert, 1997; Roeder and Wasserman, 1997. - Sampling the full posterior via MCMC, Neal, 1992; Richardson and Green, 1997 Comment: computationally extremely heavy.

STOP ! E-step: probability that x( j) was produced by component i M-step: Given , update parameter estimates - Review of basic concepts concerning finite mixtures - Review of the EM algorithm for Gaussian mixtures - Difficulty: how to initialize EM ? - Difficulty: how to estimate the number of components, k ? - Difficulty: how to avoid the boundary of the parameter space ?

A New Model Selection Criterion coder good model short code compressed data long code bad model code length model adequacy decoder Several flavors: Rissanen (MDL) 78, 87 Rissanen (NML) 96, Wallace and Freeman (MML), 87 Introduction to minimum encoding length criteria Rationale:

MDL criterion: formalization Unknown to transmitter and receiver Family of models: Known to transmitter and receiver Given Q(k) , shortest code-length for x (Shannon’s): …decoder needs to know which code was chosen, i.e., Parameter code-length Total code-length (two part code): Not enough, because...

MDL criterion: formalization MDL criterion: grows with k Can be seen as an order-penalized ML ? Remaining question: how to choose

MDL criterion: parameter code length truncate to finite precision: Finite High precision but Low precision but may be >> L(each component of Q(k) ) = Amount of data from which the parameter is estimated Optimal compromise (under certain conditions, and asymptotic)

MDL criterion: parameter code length Amount of data from which the parameter is estimated L[each component of q(k) ] = Classical MDL: n’ = n Not true for mixtures ! Why ? Not all data points have equal weight in each parameter estimate

MDL for mixtures Any parameter of the m-th component (e.g., a component of ) Fisher information for Fisher information, for one observation from component m Conclusion: sample size “seen” by qm is nam Amount of data from which the parameter is estimated L(each component of q(k) ) =

MDL for mixtures Then: Np is the number of parameters of each component. so, Recall that Sample size “seen” by qm is nam What about am ? Examples: Gaussian, arbitrary covariances Np = d + d(d+1)/2 Gaussian, common covariance: Np = d

MDL for mixtures the mixture-MDL (MMDL) criterion :

The MMDL criterion is not just a function of k Key observation is not an ML estimate. For fixed k, This is not a simple order penalty (like in standard MDL) For fixed k, MMDL has a simple Bayesian interpretation: This is a Dirichlet-type (improper) prior.

EM for MMDL redefining the M-step (there is a prior on the am’s) Using EM y+ y because of constraints Simple, because Dirichlet is conjugate Remarkable fact: this M-step may annihilate components

EM for MMDL Interesting interpretation: Kullback-Leibler between uniform and The M-step for MMDL is able to annihilate components MMDL promotes sparseness This suggests: start with k much higher than true value, EM for MMDL will set some am’s to zero MMDL favors lower entropy distributions (Zhu, Wu, & Mumford, 1997; Brand, 1999)

The MMDL criterion: an example The MMDL term, for Np = 2 and k = 2 Promotes instability and competition between components.

The initialization problem Mixture estimates depend on good initialization Many approaches • - Multiple random starts, McLachlan and Peel, 1998; • Roberts,Husmeyer, Rezek, and Penny, 1998, many others. • - Initialization by clustering algorithm (e.g., k-means), McLachlan and • Peel, 1998, and others. • Deterministic annealing, Yuille, Stolorz, and Utans, 1994; • Kloppenburg and Tavan, 1997; Ueda and Nakano, Neural Networks, 1998. • - The split and merge EM algorithm, Ueda, Nakano, Gharhamani, and Hinton,2000 Our approach: - start with too many components, and prune with the new M-step EM is a local (greedy) algorithm,

probability that x( j) was produced by component i k too high, may happen that all Solution: use “component-wise EM” [Celeux, Chrétien, Forbes, and Mkhadri, 1999] Convergence shown using the approach in [Chrétien and Hero, 2000] Possible problem with the new M-step:

- Update - Recompute all the - Update - Recompute all the Repeat until convergence - Update - Recompute all the Component-wise EM [Celeux, Chrétien, Forbes, and Mkhadri, 1999] - .... Key fact: if one componend “dies”, its probability mass is immediately re-distributed

The complete algorithm - While number of “surviving” > kmin - Start with a large number of components. - Run component-wise EM (CEM2), using the new “killer” M-step - After convergence, store final value of MMDL cost funtion - Kill the weakest component, and restart CEM2 - Select model with the minimum value of the MMDL cost.

Example Same as in [Ueda and Nakano, 1998].

Example C = I k = 4 n = 1200 kmax = 10

Example Same as in [Ueda, Nakano, Ghahramani and Hinton, 2000].

Example An example with overlapping components

Evolution of the MMDL cost function

Comparison with other methods An easy mixture, d=5 A not so easy mixture, d=2 Performance evaluation, as a function of separation (d )

2-dimensional data, separation = d

10-dimensional data, separation =

Unsupervised Learning of Finite Mixture Models