Probabilistic PCA: Maximum Likelihood Solution for Linear-Gaussian Latent Variable Model

Ch 12. Continuous Latent Variables12.2 ~ 12.4Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Rhee, Je-Keun Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

Probabilistic PCA • PCA can express naturally as the maximum likelihood solution to a particular form of linear-Gaussian latent variable model. • This probabilistic reformulation brings many advantages; • the use of EM for parameter estimation • principled extensions to mixture of PCA models • Bayesian formulations that allow the number of principal component to be determined automatically from the data • Etc. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Probabilistic PCA is a simple example of the linear-Gaussian framework. • It can be formulate by first introducing an explicit latent variable z corresponding to the principal-component subspace. • Prior distribution over z • The conditional distribution of the observed variable x, conditioned on the value of the latent variable z (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

D-dimensional observed variable x is defined by a linear transformation of the M-dimensional latent variable z plus additive Gaussian ‘noise’. • This framework is based on a mapping from latent space to data space, in contrast to the more conventional view of PCA. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

When we evaluate the predictive distribution, we require C-1, which involves the inversion of a D x D matrix. • Because we invert M rather than inverting C-1 directly, the cost of evaluating C-1 is reduced from O(D3) to O(M3). • As well as the predictive distribution p(x), we will also require the posterior distribution p(z|x) • M is M x M matrix (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Maximum likelihood PCA • log-likelihood function • The log-likelihood function is a quadratic function of μ, this solution represents the unique maximum, as can confirmed by computing second derivatives. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Maximization with respect to W and σ2 is more complex buy nonetheless has an exact closed-form solution. • All of the stationary points of the log likelihood function can be written as • UM is a D x M matrix whose columns are given by any subset (of size M) of the eigenvectors of the data covariance matrix S • LM is a M x M matrix whose elements are given by corresponding eigenvalues λi . • R is an arbitrary M x M orthogonal matrix. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Conventional PCA is generally formulated as a projection of points from the D-dimensional data space onto an M-dimensional linear subspace. • Probabilistic PCA, however, is most naturally expressed as a mapping from the latent space into the data space. • We can reverse this mapping using Bayes’ theorem. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

EM algorithms for PCA • Using the EM algorithm to find maximum likelihood estimates of the model parameters • This may seem rather pointless because we have already obtained an exact closed-form solution for the maximum likelihood parameter values. • In space of high dimensionality, there may be computational advantages in using an iterative EM procedure rather than working directly with the sample covariance matrix. • This EM procedure can also be extended to the factor analysis model, for which there is no closed-form solution. • It allows missing data to be handled in a principled way. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Benefits of the EM algorithm for PCA • Another elegent feature of the EM approach is that we can take the limit σ2 →0, corresponding to standard PCA, and still obtain a valid EM-like algorithm. • E step • M step (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Model selection • In practice, we must choose a value, M, for dimensionality of principal subspace according to application. • Probabilistic PCA model has a well-defined likelihood function, we could employ cross-validation to determine the value of dimensionality by selection the largest log likelihood on a validation data set. • However, this approach can become computationally costly. • Given that we have a probabilistic formulation of PCA, it seems natural to seek a Bayesian approach to model selection. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian PCA • Specifically, we define an independent Gaussian prior over each column of W, which represent the vectors defining the principal subspace. • Each Gaussian has an independent variance governed by a precision hyperparameter αi • The effective dimensionality of the principal subspace is determined by the number of finite αi values, and the corresponding vector wi can be thought of as ‘relevant’ for modeling the data distribution. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Factor analysis • Factor analysis is a linear-Gaussian latent variable model that is closely related to probabilistic PCA. • Its definition differs from that of probabilistic PCA only in that the conditional distribution of the observed variable x given the latent variable z is taken to have a diagonal rather than an isotropic covariance. Ψ is a D x D diagonal matirx (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Links between FA and PCA • At the stationary points of the likelihood function, for a factor analysis model with Ψ = σ2I, the columns of W are scaled eigenvectors of the sample covariance matrix, and σ2 is the average of the discarded eigenvalues. • The maximum of the log likelihood function occurs when the eigenvectors comprising W are chosen to be the principal eigenvectors. • Unlike probabilistic PCA, there is no longer a closed-form maximum likelihood solution for W, which must therefore be found iteratively. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Conventional PCA vs. Kernel PCA • Conventional PCA • (Assume that the mean of vectors xn is zero) • a nonlinear transformation φ(x) into an M-dimensional feature space • (Assume that the projected data set also has zero mean) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The normalization condition for the coefficient ai is obtained by requiring that the eigenvectors in feature space be normalized. • Having solved the eigenvector problem, the resulting principal component projections can then also be cast in terms of the kernel function so that the projection of a point x onto eigenvector i is given by (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

In the original D-dimensional x space • D orthogonal eigenvectors • at most D linear principal components. • In the M-dimensional feature space • We can find a number of nonlinear principal components that can exceed D. • However, the number of nonzero eigenvalues cannot exceed the number N of data points, because (even if M > N) the covariance matrix in feature space has rank at most equal to N. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Nonlinear Latent Variable Models • Some generalizations of continuous latent variables framework to models that are either nonlinear or non-Gaussian, or both. • Independent component analysis (ICA) • Autoassociative neural networks • Etc. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Independent component analysis • Eg. blind source separation • The observed variables are related linearly to the latent variables, but the latent distribution is non-Gaussian (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Autoassociative neural networks • Neural networks have been applied to unsupervised learning where they have been used for dimensionality reduction. • This is achieved by using a network having the same number of outputs as inputs, and optimizing the weights so as to minimize some measure of the reconstruction error between inputs and outputs with respect to a set of training data • The vectors of weights which lead into the hidden units form a basis set which spans the principal subspace. • These vectors need not be orthogonal or normalized. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Modeling nonlinear manifolds • One way to the nonlinear structure is through a combination of linear model, so that we make a piece-wise linear approximation to the manifold. • An alternative to considering a mixture of linear models is to consider a single nonlinear model. • Conventional PCA finds a linear subspace that passes close to the data in a least-squares sense. • This concept can be extended to one-dimensional nonlinear surfaces in the form of principal curves. • Principal curves can be generalized to multidimensional manifolds called principal surfaces. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Generative topological mapping (GTM) • It uses a latent distribution that is defined by a finite regular grid of delta functions over the (typically two-dimensional) latent space. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Self-organizing map (SOM) • The GTM can be seen as a probabilistic version of an earlier model, SOM. • The SOM represents a two-dimensional nonlinear manifold as a regular array of discrete points. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Probabilistic PCA: Maximum Likelihood Solution for Linear-Gaussian Latent Variable Model

Probabilistic PCA: Maximum Likelihood Solution for Linear-Gaussian Latent Variable Model

Presentation Transcript

Seoul National University

Seoul National University

Seoul National University

Seoul National University

Seoul National University

Presented by Rhee, Je-Keun Graduate Program in Bioinformatics

Seoul National University

Previously summarized by Yung-Kyun Noh Updated and presented by Rhee, Je-Keun

Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

Summarized by Yung-Kyun Noh Biointelligence Laboratory, Seoul National University

Seoul National University

Seoul National University

Seoul National University

Biointelligence Laboratory, Seoul National University bi.snu.ac.kr/

Summarized by Yung-Kyun Noh Biointelligence Laboratory, Seoul National University

Summarized by M.-O. Heo Biointelligence Laboratory, Seoul National University

Summarized by Joo-kyung Kim Biointelligence Laboratory, Seoul National University

Summarized by Seung-Joon Yi Biointelligence Laboratory, Seoul National University

Presented by Rhee, Je-Keun Graduate Program in Bioinformatics

Summarized by Yung-Kyun Noh Biointelligence Laboratory, Seoul National University

Previously summarized by Yung-Kyun Noh Updated and presented by Rhee, Je-Keun