Multivariate Analysis And PCA
Principal Components Analysis ( PCA) • Is a Factor Analytic method • Can be used to: • Reduce number of dimensions in data • Find patterns in high-dimensional data • Visualize data of high dimensionality • Example applications: • Face recognition • Image compression • Gene expression analysis
Curse of Dimensionality. • A major problem is the curse of dimensionality. • If the data x lies in high dimensional space, then an enormous amount of data is required to learn distributions or decision rules. • Example: 50 dimensions. Each dimension has 20 levels. This gives a total of cells. But the no. of data samples will be far less. There will not be enough data samples to learn.
What is PCA(Principal Component Analysis) • Basic idea : Given the data in M-Space,PCA reduce the dimensionality of a data set in which there are a large number of interrelated variables ,while retaining as much as possible of the variation present in the data set.This can be achieved by transforming to a new set of variables,the principle components,which are uncorrelated and are order so that the first few retain most of the variation present in all of the original variables
Benefits of PCA • Reduction of computation and storage overhead • Reduction of noise • Useful for visualizing the data
Multivariate Analysis: Multiple Regression Ordination
Ordination Goal: to discover and summarize the main patterns of variation in a set of variables measured over a number of sample locations.
Ordination • Ordination techniques may generate useful simplifications of patterns in complex multivariate data sets. • Ordination combine common variation into new variables (multivariate axes) along which samples are ordered.
Dimension Reduction • One way to avoid the curse of dimensionality is by projecting the data onto a lower-dimensional space. • Techniques for dimension reduction: • Principal Component Analysis (PCA) • Fisher’s Linear Discriminant • Multi-dimensional Scaling. • Independent Component Analysis.
A Numerical Example • Original data values & mean centered:
A Numerical Example • Transformed data space:
Ordination “A procedure for adapting a multidimensional swarm of data points in such a way that when it is projected onto a reduced number of dimensions any intrinsic pattern will become apparent”
Ordination Data reduction technique: • To select low-dimensional projections of the data for graphing. • To search for “structure” in the data.
A Numerical Example • Compare original vs. transformed data space:
Ordination methods: • Principal Component Analysis (PCA) • Correspondence Analysis (CA) • Principal Coordinate Analysis (PCoA) • Discriminant Function Analysis (DFA)
PCA: Principal components analysis (PCA) is perhaps the most common technique used to summarize patterns among variables in multivariate datasets.
Principal Component Analysis (PCA): • A geometric interpretation • PCA constructs a new coordinate system - new variables - which are linear combinations of the original axes and which are defined to align the samples along their major dimensions or axes of variation. • PCA finds the coordinate system that best represents the internal variability in the data, essentially representing the data.
Principal Component Analysis (PCA): A geometric interpretation The technique also compresses the internal variability in the data into a smaller number of important axes, by capturing associations among variables (species, environmental variables).
Basics & Background • Objective: Conceptualize underlying pattern or structure of observed variables yi1, …,yip on p attributes at each of n sites si. • PCA can be viewed as a rotation between data spaces of yi1, …,yip and ui1, …,uip. • Where u1 is measured along the direction of maximum separation (i.e., variance) and u2 along the second in line and so forth …
Principal components • 1. principal component (PC1) • the direction along which there is greatest variation • 2. principal component (PC2) • the direction with maximum variation left in data, orthogonal to the 1. PC General about principal components • linear combinations of the original variables • uncorrelated with each other
Basics & Background • Eigenvalue and Eigenvector: • Eigen originates in the German language and can be loosely translated as “of itself” • Thus an Eigenvalue of a matrix could be conceptualized as a “value of itself” • Eigenvalues and Eigenvectors are utilized in a wide range of applications (PCA, calculating a power of a matrix, finding solutions for a system of differential equations, and growth models)
Background - variance • Standard deviation: • Average distance from mean to a point • Variance: • Standard deviation squared • One-dimensional measure
Principal Component Analysis • PCA is the most commonly used dimension reduction technique. • (Also called the Karhunen-Loeve transform). • PCA – data samples • Compute the mean
Background - covariance • How two dimensions vary from the mean with respect to each other • cov(X,Y)>0: Dimensions increase together • cov(X,Y)<0: One increases, one decreases • cov(X,Y)=0: Dimensions are independent
Background – covariance matrix • Contains covariance values between all possible dimensions: • Example for three dimensions (x,y,z):
Associations among variables in PCA is measured by: Correlation Matrix (variables have different scales, e.g., environmental variables. Covariance Matrix (variables have the same scales, e.g., morphological variables = it preserves allometric relationships = parts of the same organism grow at different rates).
Why only choose two axes? • Eigenvalue for the 3 axes are 1.8907,0.9951,0.1142 typically express the eigenvalue as percentage of the total: PCA Axis 1: 63% PCA Axis 2: 33% PCA Axis 3: 4%
Describing Video via PCA • Strategy:condense local spatial information using the PCA, and to preserve the temporal information by keeping all such reduced spatial information for all frames.
The mathematic of Principal Component Analysis (PCA): Eigenanalysis is a mathematical operation on a square, symmetric matrix (e.g., pairwise correlation matrix). A square matrix has # rows =#cols. A symmetric matrix is transpose invariant.
The mathematic of Principal Component Analysis (PCA): The answer to an eigenanalysis consists of a series of eigenvalues and eigenvectors. Each eigenvalue has an eigenvector, and there are as many eigenvectors and eigenvalues as there are rows in the initial correlation or covariance matrix. Eigenvalues are usually ranked from the greatest to the least.
Principal component analysis presents three important structures: 1 - Eigenvalues: represent the amount of variation summarized by each principal component. The first principal component (PC-1) presents the largest amount, PC-2 presents the second largest, and so on.
Step1 Extracting features • Features used in video analysis: color,texture,shape,motion vector… • Criteria of choosing features : they should have similar statistical behavior across time • Color histogram: simple and robust • Motion vectors:invariance to color and light
Principal component analysis presents three important structures: 2 - Eigenvectors:Each principal component is a linear function with coefficients for each variable. Eigenvectors contain these coefficients. High values, positive or negative, represents high association with the component.
Principal component analysis presents three important structures: 3 - Scores:Since each component is a linear function of the variables, when multiplying the standardized variables (in the case of correlation matrices) by the eigenvector structure, a matrix containing the position of each observation in each principal component is produced. The plot of these scores in the first few dimensions, represents the main patterns of variation among the original observations.
Original data Correlation or covariance matrix eigenvalues eigenvectors scores
Principal component analysis: an example 53 sites 28 sites
Local environment: • Depth • Depth variation • Current velocity • Current variation • Substrate composition: Boulder, rubble, gravel and sand • Substrate variation • Width variation (irregularity) • Area • Altitude Are the two streams different in their environments?
Original data Correlation matrix 87 sites by 12 variables 12 eigenvalues 12 variables eigenvectors 12 x 12 scores 87 sites by 12 PC axes
boulder depth current area altitude width variation current variability depth variability rubble gravel sediment variability sand
Macae boulder depth current Macacu area altitude width variation current variability depth variability rubble gravel sediment variability sand
Ordination bi-plots This summary is often a useful end in itself: the analysis discovers the latent structure of the data and how the variables contribute to this structure.
Background – eigenvalues & eigenvectors • Vectors x having same direction as Ax are called eigenvectors of A (A is an n by n matrix). • Ax=x, is called an eigenvalue of A. • Ax=x (A-I)x=0
How to calculate x and : • Calculate det(A-I), yields a polynomial (degree n) • Determine roots to det(A-I)=0, roots are eigenvalues • Solve (A- I) x=0 for each to obtain eigenvectors x