Principal Component Analysis

Principal Component Analysis Beatrice M. Ombuki-Berman See attached Keller slides & paper

A standard statistical technique for reducing dimensionality • of data (without using neural approach • Purpose: Better understanding or communication of data; • used a lot in the sciences to select the most important • features • In so reducing, we want to lose as little information as • possible, given the before- and after- dimensions. • Also known as Karhuenen-Loeve (K-L) transformation • (Watanabe, 1969) What is PCA?

Linear regression requires one to • pre-identify dependent vs. independent variables. • PCA does not. Comparison with Linear Regression

Suppose we are given a set of data points:http://144.124.112.51/auj/scattering/demo/page4.active.html transform coordinates to get a better understanding

Shift set of points so that average position is at origin

What is the trend (direction) of the set of points • Which is the direction of maximum variance (dispersion) • Which is the direction of minimum variance (dispersion) We may need answers to questions like:

We may need answers to questions like (2): • Suppose you are only allowed to use a 1D plot • for this set of 2D points. • How should I represent the points • in such a way that the overall error is minimized ? • --- this is data compression problem • All these questions can be answered using • Principal Components Analysis (PCA)

What is Principal Component Analysis? • Standard statistical technique for data reduction • also known as Karhuenen-Loeve (K-L) transformation (in communications theory) • an effective data reduction technique for representing the most common variations to all the training data.

Principal component analysis • Useful statistical technique in various fields such as e.g., face recognition and image compression. • Common technique for finding patterns in data of high dimension.

Principal Component Analysis • Compute the covariance matrix. • Determine its principal components. • Project data into the plane spanned by the principal components. PCA is a linear method

What is “Principal Component” Number of principal components depends on number of Dimensions of the data points First principal component -> predominant direction in the data

Illustrative Example • Project 2-dimensional data down onto a 1-dimensional space.

Main idea • Transform the input data into fewer dimension • Preserve as much of the variance as possible

Transformation

Predominant direction • Minimizes the reconstruction error • Spanned by the directions of largest variance • Spanned by the principal eigenvectors of the covariance matrix i.e., the eigenvectors with maximal eigen values

Planets Examplefromhttp://www.cs.mcgill.ca/~sqrt/dimr/dimreduction.html • Suppose we have 3-dimensional data set where the variables are the logarithms of • distance to the sun • equatorial diameter • the density

Data Sets prior to taking logs

Projecting data • Possible to project a set of data points on fewer transformations by ignoring certain columns • Projection is a special case of a linear transformer

Data projections: In 2D, we can plot any one variable against another

Data projection • Having the 2D data plot gives a nice representation with obvious properties: • close points mean similar planets • far apart points mean dissimilar planets • convex hull points mean "extreme" planets Now suppose that there is another planet feature that you find equally important: the density

Maximinizing Variance • Transforming using first two principal components preserves more of the variance (summing variances in each dimension) in the projection than does projecting on any 2 variable

Principal Component Analysis