Principal Component Analysis

Principal Component Analysis Machine Learning

Last Time • Expectation Maximization in Graphical Models • Baum Welch

Now • Unsupervised Dimensionality Reduction

Curse of Dimensionality • In (nearly) all modeling approaches, more features (dimensions) require (a lot) more data • Typically exponential in the number of features • This is clearly seen from filling a probability table. • Topological arguments are also made. • Compare the volume of an inscribed hypersphereto a hypercube

Dimensionality Reduction • We’ve already seen some of this. • Regularization attempts to reduce the number of effective features used in linear and logistic regression classifiers

Linear Models • When we regularize, we optimize a function that ignores as many features as possible. • The “effective” number of dimensions is much smaller than D

Support Vector Machines • In exemplar approaches (SVM, k-nn) each data point can be considered to describe a dimension. • By selecting only those instances that maximize the margin (setting α to zero), SVMs use only a subset of available dimensions in their decision making.

Decision Trees weight • Decision Trees explicitly select split points based on features that improve InformationGain or Accuracy • Features that don’t contribute to the classification sufficiently are never used. <165 5M height <68 5F 1F / 1M

Feature Spaces • Even though a data point is described in terms of N features, this may not be the most compact representation of the feature space • Even classifiers that try to use a smaller effective feature space can suffer from the curse-of-dimensionality • If a feature has some discriminative power, the dimension may remain in the effective set.

1-d data in a 2-d world

Dimensions of high variance

Identifying dimensions of variance • Assumption: directions that show high variance represent the appropriate/useful dimension to represent the feature set.

Aside: Normalization • Assume 2 features: • Percentile GPA • Height in cm. • Which dimension shows greater variability?

Aside: Normalization • Assume 2 features: • Percentile GPA • Height in m. • Which dimension shows greater variability?

Principal Component Analysis • Principal Component Analysis (PCA) identifies the dimensions of greatest variance of a set of data.

Eigenvectors • Eigenvectors are orthogonal vectors that define a space, the eigenspace. • Any data point can be described as a linear combination of eigenvectors. • Eigenvectors of a square matrix have the following property. • The associated lambda is the eigenvalue.

PCA • Write each data point in this new space • To do the dimensionality reduction, keep C < D dimensions. • Each data point is now represented as a vector of c’s.

Identifying Eigenvectors • PCA is easy once we have eigenvectors and the mean. • Identifying the mean is easy. • Eigenvectors of the covariance matrix, represent a set of direction of variance. • Eigenvalues represent the degree of the variance.

Eigenvectors of the Covariance Matrix • Eigenvectors are orthonormal • In the eigenspace, the Gaussian is diagonal – zero covariance. • All eigen values are non-negative. • Eigenvalues are sorted. • Larger eigenvalues, higher variance

Dimensionality reduction with PCA • To convert from an original data point to PCA • To reconstruct a point

Eigenfaces Encoded then Decoded. Efficiency can be evaluatedwith Absolute or Squared error

Some other (unsupervised) dimensionality reduction techniques • Kernel PCA • Distance Preserving Dimension Reduction • Maximum Variance Unfolding • Multi Dimensional Scaling (MDS) • Isomap

Next Time • Model Adaptation and Semi-supervised Techniques • Work on your projects.

Principal Component Analysis