1 / 43

Principal Component Analysis

Principal Component Analysis. Biosystems Data Analysis. From molecule to networks. Protein network of SRD5A2. Yeast metabolic network of Glycolysis. Disease gene network. Biological data. Genes or proteins or metabolites. DATA. Samples. 5.31. 5.31. Repeatability (herhaalbaarheid).

saki
Download Presentation

Principal Component Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Principal Component Analysis Biosystems Data Analysis UNIVERSITY OF AMSTERDAM

  2. From molecule to networks Protein network of SRD5A2 Yeast metabolic network of Glycolysis UNIVERSITY OF AMSTERDAM

  3. Disease gene network UNIVERSITY OF AMSTERDAM

  4. Biological data Genes or proteins or metabolites DATA Samples 5.31 5.31 Repeatability (herhaalbaarheid) Reproducibility (reproduceerbaarheid) Biological variability UNIVERSITY OF AMSTERDAM

  5. How to explore such networks Genes or proteins or metabolites DATA Genes or proteins or metabolites Samples Genes or proteins or metabolites Correlation matrix Results are specific for the selected samples/situation UNIVERSITY OF AMSTERDAM

  6. Goals • If you measure multiple variables on an object it can be important to analyze the measurements simultaneously. • Understand the most important tool in multivariate data analysis Principal Component Analysis. UNIVERSITY OF AMSTERDAM

  7. Multiple measurements • If there is a mutual relationship between two or more measurements they are correlated. • There are strong correlations and weak correlations Capabilities in sports and month of birth Mass of an object and the weight of that object on the earth surface UNIVERSITY OF AMSTERDAM

  8. Correlation • Correlation occurs everywhere! • Example: mean height vs. age of a group of young children • A strong linear relationship between height and age is seen. • For young children, height and age are correlated. Moore, D.S. and McCabe G.P., Introduction to the Practice of Statistics (1989). UNIVERSITY OF AMSTERDAM

  9. Correlation in spectroscopy 230 265 • Example: a pure compound is measured at two wavelengths over a range of concentrations 0.9 0.8 0.7 0.6 0.5 Absorbance (units) Conc. (MMol) Intensity at 230nm Intensity at 265nm 0.4 5 0.166 0.090 0.3 10 0.332 0.181 0.2 15 0.498 0.270 20 0.664 0.362 0.1 25 0.831 0.453 0 200 210 220 230 240 250 260 270 280 290 300 Wavelength (nm) UNIVERSITY OF AMSTERDAM

  10. The intensities at 230 and 265 are highly correlated. increasing concentration Correlation in spectroscopy 0.5 0.45 0.4 0.35 • The data is not two-dimensional, but one-dimensional. 0.3 0.25 Absorbance at 265nm (units) 0.2 0.15 0.1 • There is only one factor underlying the data: concentration. 0.05 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Absorbance at 230nm (units) UNIVERSITY OF AMSTERDAM

  11. The data matrix • For example, • Spectroscopy: sample  wavelength • Proteomics: patient  protein • Information often comes in the form of a matrix: variables objects UNIVERSITY OF AMSTERDAM

  12. Large amounts of data • In (bio)chemical analysis, the measured data matrices can be very large. • An infrared spectrum measured for 50 samples gives a data matrix of size 50  800 = 40,000 numbers! • The matabolome of a 100 patient yield a data matrix of size 100  1000 = 100,000 numbers. • We need a way of extracting the important information from large data matrices. UNIVERSITY OF AMSTERDAM

  13. Principal Component Analysis • Data reduction • PCA reduces large data matrices into two smaller matrices which can be more easily examined, plotted and interpreted. • Data exploration • PCA extracts the most important factors (principal components) from the data. These factors describe multivariate interactions between the measured variables. • Data understanding • Principal components can be used to classify samples, identify compound spectra, determine biomarker etc. UNIVERSITY OF AMSTERDAM

  14. Different views of PCA • Statistically, PCA is a multivariate analysis technique closely related to • eigenvector analysis • singular value decomposition (SVD) • In matrix terms, PCA is a decomposition of X into two smaller matrices plus a set of residuals: X = TPT + E • Geometrically, PCA is a projection technique in which X is projected onto a subspace of reduced dimensions. UNIVERSITY OF AMSTERDAM

  15. PCA: mathematics • The basic equation for PCA is written as where X (I J) is a data matrix, T(I R) are the scores, P(J R) are the loadings and E(I J) are the residuals. R is the number of principal components used to describe X. UNIVERSITY OF AMSTERDAM

  16. Principal comp. % X explained Total % X explained 1 45.6 45.6 2 23.9 69.5 3 18.1 87.6 4 1.3 88.9 Principal components • Principal components describe maximum variance and are calculated in order of importance, e.g. • A principal component is defined by one pair of loadings and scores, , sometimes also known as a latent variable. and so on... up to 100% UNIVERSITY OF AMSTERDAM

  17. = X principal component = + E PT T PCA: matrices loadings ... + + scores UNIVERSITY OF AMSTERDAM

  18. Scores and loadings • Scores • relationships between objects • orthogonal, TTT = diagonal matrix • Loadings • relationships between variables • orthonormal, PTP = identity matrix, I • Similarities and differences between objects (or variables) can be seen by plotting scores (or loadings) against each other. UNIVERSITY OF AMSTERDAM

  19. Numbers example UNIVERSITY OF AMSTERDAM

  20. scores plot PC1 PCA PC2 PCA: simple projection • Simplest case: two correlated variables • PC1 describes 99.77% of the total variation in X. • PC2 describes residual variation (0.23%). UNIVERSITY OF AMSTERDAM

  21. PCA: projections • PCA is a projection technique. • Each row of the data matrix X (IJ) can be considered as a point in J-dimensional space. This data is projected orthogonally onto a subspace of lower dimensionality. • In the previous example, we projected the two-dimensional data onto a one-dimensional space, i.e. onto a line. • Now we will project some J-dimensional data onto a two-dimensional space, i.e. onto a plane. UNIVERSITY OF AMSTERDAM

  22. •••••••••••••••  ••••••••••••••• • • ••••••••••••••• + = • • •••••••••••••••  • • • •  UNIVERSITY OF AMSTERDAM

  23. Example :Protein data • Protein consumption across Europe was studied. • 9 variables describe different sources of protein. • 25 objects are the different countries. • Data matrix has dimensions 25  9. • Which countries are similar? • Which foods are related to red meat consumption? Weber, A., Agrarpolitik im Spannungsfeld der internationalen Ernaehrungspolitik, Institut fuer Agrarpolitik und marktlehre, Kiel (1973) . UNIVERSITY OF AMSTERDAM

  24. UNIVERSITY OF AMSTERDAM

  25. PCA on the protein data • The data is mean-centred and each variable is scaled to unit variance. Then a PCA is performed. Percent Variance Captured by PCA Model Principal Eigenvalue % Variance %Variance Component of Captured Captured Number Cov(X) This PC Total --------- ---------- ---------- ---------- 1 4.01e+000 44.52 44.52 2 1.63e+000 18.17 62.68 3 1.13e+000 12.53 75.22 4 9.55e-001 10.61 85.82 5 4.64e-001 5.15 90.98 6 3.25e-001 3.61 94.59 7 2.72e-001 3.02 97.61 8 1.16e-001 1.29 98.90 9 9.91e-002 1.10 100.00 How many principal components do you want to keep? 4 UNIVERSITY OF AMSTERDAM

  26. 2 Albania Bulgaria Romania Austria Yugoslavia 1 Netherlands Hungary Ireland Switzerland Czechoslovakia Finland West Germany Sweden USSR UK 0 Belgium Denmark Italy East Germany Poland France Norway -1 Greece Scores PC 2 (18.17%) -2 Spain -3 -4 Portugal -5 -3 -2 -1 0 1 2 3 4 PC 2 Scores PC 1 (44.52%) Scores: PC1 vs PC2 UNIVERSITY OF AMSTERDAM

  27. 0.6 PC1 PC2 0.4 0.2 0 PC loadings -0.2 -0.4 -0.6 -0.8 Red meat White meat Eggs Milk Fish Cereals Starch Beans/nuts/oil Fruit & veg Loadings UNIVERSITY OF AMSTERDAM

  28. 2 Albania White meat Cereals Bulgaria Milk Romania Austria Yugoslavia 1 Netherlands Hungary Ireland Switzerland Czechoslovakia Finland Red meat West Germany Eggs Sweden USSR UK 0 Belgium Denmark Italy SE Europeans eat cereal crops East Germany Poland France Norway Beans/nuts/oil -1 Greece PC 2 -2 Starch Spain -3 Fruit & veg PC2 primarily says that the Spanish and Portuguese especially like fruit, vegetables, fish, oils. -4 Fish Portugal -5 -5 -4 -3 -2 -1 0 1 2 3 4 5 PC 1 Biplot: PC1 vs PC2 UNIVERSITY OF AMSTERDAM

  29. 4 White meat 3 Fruit & veg 2 Hungary Poland Austria East Germany Starch Czechoslovakia 1 Eggs West Germany Netherlands Spain Cereals PC 3 The Dutch like ‘patat’... Belgium Yugoslavia Bulgaria Italy Romania Portugal 0 France Ireland Switzerland Beans/nuts/oil USSR ...with mayonnaise!? Denmark Greece -1 UK Sweden Red meat Fish Norway Albania -2 Milk Finland -3 -5 -4 -3 -2 -1 0 1 2 3 4 5 PC 1 Red meat and milk are correlated Scandinavians eat fish! Biplot: PC1 vs PC3 UNIVERSITY OF AMSTERDAM

  30. Residuals • It is also important to look at the model residuals, E. • Ideally, the residuals will not contain any structure - just unsystematic variation (noise). UNIVERSITY OF AMSTERDAM

  31. Country 23 (USSR) fits the model least well Residuals • The (squared) model residuals can be summed along the object or variable direction: UNIVERSITY OF AMSTERDAM

  32. Centering and scaling • We are often interested in the differences between objects, not in their absolute values. • protein data: differences between countries • If different variables are measured in different units, some scaling is needed to give each variable an equal chance of contributing to the model. UNIVERSITY OF AMSTERDAM

  33. Mean-centering 6.525 0.0 Mean-centering • Subtract the mean from each column of X: 36.75 10840 0.0 0.0 UNIVERSITY OF AMSTERDAM

  34. Scaling 0.171 1.0 Scaling • Divide each column of X by its standard deviation: 1.139 704.8 1.0 1.0 UNIVERSITY OF AMSTERDAM

  35. How many PC’s to use? • Too few PC’s: • some systematic variation is not described. • model does not fully summarise the data. X = TPT + E systematic variation noise • Too many PC’s: • latter PC’s describe noise. • model is not robust when applied to new data. • How to select the correct number of PC’s? UNIVERSITY OF AMSTERDAM

  36. How many PC’s to use? • Eigenvalue plots ‘Knee’ here - select 4 PC’s • Select components where explained % variance > noise level • Look at PC scores and loadings - do they make sense?! Do residuals have structure? • Cross-validation UNIVERSITY OF AMSTERDAM

  37. Calculate PRESS: Cross-validation • Remove subset of the data - test set. ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• • Build model on remaining data - training set. • Project test set onto model - calculate residuals. • Repeat for next test set. ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• • Repeat for R = 1,2,3... UNIVERSITY OF AMSTERDAM

  38. 8 PC’s gives very high CV error Overall minimum at 4 PC’s First minimum at 2 PC’s PRESS plot 5 50 Eigenvalue of Cov(x) b) PRESS (r) 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Latent Variable UNIVERSITY OF AMSTERDAM

  39. Remove outlier Outliers • Outliers are objects which are very different from the rest of the data. These can have a large effect on the principal component model and should be removed. bad experiment UNIVERSITY OF AMSTERDAM

  40. 6 4 2 Scores PC 2 0 -2 -4 -6 -8 Scores PC 1 -8 -6 -4 -2 0 2 4 6 8 Outliers • Outliers can also be found in the model space or in the residuals. UNIVERSITY OF AMSTERDAM

  41. ...but is not valid for 30 year olds! Linear model was valid for this age range... Model extrapolation can be dangerous! UNIVERSITY OF AMSTERDAM

  42. Conclusions • Principal component analysis (PCA) reduces large, collinear matrices into two smaller matrices - scores and loadings: • Principal components • describe the important variation in the data. • are calculated in order of importance. • are orthogonal. UNIVERSITY OF AMSTERDAM

  43. Conclusions • Scores plots and biplots can be useful for exploring and understanding the data. • It is often correct to mean-center and scale the variables prior to analysis. • It is important to include the correct number of PC’s in the PCA model. One method for determining this is called cross-validation. UNIVERSITY OF AMSTERDAM

More Related