Principal Components Analysis - PowerPoint PPT Presentation

gali
slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Principal Components Analysis PowerPoint Presentation
Download Presentation
Principal Components Analysis

play fullscreen
1 / 33
Download Presentation
Principal Components Analysis
174 Views
Download Presentation

Principal Components Analysis

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

    1. Xuhua Xia Slide 1 Principal Components Analysis Objectives: Understand the principles of principal components analysis (PCA) Recognize conditions under which PCA may be useful Use SAS procedure PRINCOMP to perform a principal components analysis interpret PRINCOMP output.

    2. Xuhua Xia Slide 2 Typical Form of Data

    3. Xuhua Xia Slide 3 What are Principal Components? Principal components are linear combinations of the observed variables. The coefficients of these principal components are chosen to meet three criteria What are the three criteria?

    4. Xuhua Xia Slide 4 What are Principal Components? The three criteria: There are exactly p principal components (PCs), each being a linear combination of the observed variables; The PCs are mutually orthogonal (i.e., perpendicular and uncorrelated); The components are extracted in order of decreasing variance.

    5. Xuhua Xia Slide 5 A Simple Data Set

    6. Xuhua Xia Slide 6 General Patterns The total variance is 3 (= 1 + 2) The two variables, X and Y, are perfectly correlated, with all points fall on the regression line. The spatial relationship among the 5 points can therefore be represented by a single dimension. PCA is a dimension-reduction technique. What would happen if we apply PCA to the data?

    7. Xuhua Xia Slide 7 Graphic PCA

    8. Xuhua Xia Slide 8 SAS Program

    9. Xuhua Xia Slide 9 A positive definite matrix When you run the SAS program, the log file will warn that The Correlation Matrix is not positive definite.. What does that mean? A symmetric matrix M (such as a correlation matrix or a covariance matrix) is positive definite if zMz > 0 for all non-zero vectors z with real entries, where z is the transpose of z. Given our correlation matrix with all entries being 1, it is easy to find z that lead to zMz = 0. So the matrix is not positive definite:

    10. Xuhua Xia Slide 10 SAS Output

    11. Xuhua Xia Slide 11 SAS Output

    12. Xuhua Xia Slide 12 SAS Output

    13. Xuhua Xia Slide 13 Steps in a PCA Have at least two variables Generate a correlation or variance-covariance matrix Obtain eigenvalues and eigenvectors (This is called an eigenvalue problem, and will be illustrated with a simple numerical example) Generate principal component (PC) scores Plot the PC scores in the space with reduced dimensions All these can be automated by using SAS. When to use a correlation matrix: 1. When different units are used for different variables 2. When data are from species of very different mean densities When to use a correlation matrix: 1. When different units are used for different variables 2. When data are from species of very different mean densities

    14. Xuhua Xia Slide 14 Covariance or Correlation Matrix? We sample species along a sandy beach. If some species (e.g., Sp2) increases their abundance from lower to higher shore, while other species (e.g., Sp1) maintains their abundance, then clearly it is the variance in the abundance of Sp2 that reflects the tidal gradient. The variance of Sp1 is entirely uninformative with reference to the gradient. We note that the variance in abundance for Sp2 is large and that for Sp1 is small. If we use a correlation matrix in PCA, then the variance of the former and that of the latter are treated with equal weight, which is not a good idea. In such cases, we should use a covariance matrix.We sample species along a sandy beach. If some species (e.g., Sp2) increases their abundance from lower to higher shore, while other species (e.g., Sp1) maintains their abundance, then clearly it is the variance in the abundance of Sp2 that reflects the tidal gradient. The variance of Sp1 is entirely uninformative with reference to the gradient. We note that the variance in abundance for Sp2 is large and that for Sp1 is small. If we use a correlation matrix in PCA, then the variance of the former and that of the latter are treated with equal weight, which is not a good idea. In such cases, we should use a covariance matrix.

    15. Xuhua Xia Slide 15 Covariance or Correlation Matrix? Now we are sampling the same sandy beach. We note that the variance in abundance for Sp2 is greater than that for Sp3. However, the abundance of Sp3 is in fact a better predictor of the tidal gradient than that of Sp2. If we use a covariance matrix, then the variance in Sp3, which is smaller, is given less weight in PCA, which is clearly not a good idea. In such cases, we should use a correlation matrix so that the variance of both variables is scaled to be 1, i.e., they will carry equal weight in PCA.Now we are sampling the same sandy beach. We note that the variance in abundance for Sp2 is greater than that for Sp3. However, the abundance of Sp3 is in fact a better predictor of the tidal gradient than that of Sp2. If we use a covariance matrix, then the variance in Sp3, which is smaller, is given less weight in PCA, which is clearly not a good idea. In such cases, we should use a correlation matrix so that the variance of both variables is scaled to be 1, i.e., they will carry equal weight in PCA.

    16. Xuhua Xia Slide 16 Covariance or Correlation Matrix? What would happen if we have all three types of species in our data? This situation will almost certainly happen when you sample a lot of species. I recommend the use of correlation matrix in such cases for the following reason. When you have many variables, it is the correlation structure among variables that matters. Sp2 and Sp3 are positively correlated and they should determine the extraction of principal components. Sp1 is not correlated with either and will have little effect on the extraction of the first few (most important) principal components. SAS uses correlation matrix by default.What would happen if we have all three types of species in our data? This situation will almost certainly happen when you sample a lot of species. I recommend the use of correlation matrix in such cases for the following reason. When you have many variables, it is the correlation structure among variables that matters. Sp2 and Sp3 are positively correlated and they should determine the extraction of principal components. Sp1 is not correlated with either and will have little effect on the extraction of the first few (most important) principal components. SAS uses correlation matrix by default.

    17. Xuhua Xia Slide 17 The Eigenvalue Problem

    18. Xuhua Xia Slide 18 Get the Eigenvectors An eigenvector is a vector (x) that satisfies the following condition: A x = ?x In our case A is a variance-covariance matrix of the order of 2, and a vector x is a vector specified by x1 and x2.

    19. Xuhua Xia Slide 19 Get the Eigenvectors We want to find an eigenvector of unit length, i.e., x12 + x22 = 1 We therefore have

    20. Xuhua Xia Slide 20 Get the PC Scores

    21. Xuhua Xia Slide 21 What Are Principal Components? Principal components are a new set of variables, which are linear combinations of the observed ones, with these properties: Because of the decreasing variance property, much of the variance (information in the original set of p variables) tends to be concentrated in the first few PCs. This implies that we can drop the last few PCs without losing much information. PCA is therefore considered as a dimension-reduction technique. Because PCs are orthogonal, they can be used instead of the original variables in situations where having orthogonal variables is desirable (e.g., regression).

    22. Xuhua Xia Slide 22 Index of hidden variables The ranking of Asian universities by the Asian Week HKU is ranked second in financial resources, but seventh in academic research How did HKU get ranked third? Is there a more objective way of ranking? An illustrative example:

    23. Xuhua Xia Slide 23 A Simple Data Set School 5 is clearly the best school School 1 is clearly the worst school

    24. Xuhua Xia Slide 24 Graphic PCA

    25. Xuhua Xia Slide 25 Crime Data in 50 States

    28. Xuhua Xia Slide 28 Correlation Matrix

    29. Xuhua Xia Slide 29 Eigenvalues

    30. Xuhua Xia Slide 30 Eigenvectors Do these eigenvectors mean anything? All crimes are positively correlated with the first eigenvector, which is therefore interpreted as a measure of overall crime rate. The 2nd eigenvector has positive loadings on AUTO, LARCENY and ROBBERY and negative loadings on MURDER, ASSAULT and RAPE. It is interpreted to measure the preponderance of property crime over violent crime...

    31. Xuhua Xia Slide 31 PC Plot: Crime Data