1 / 93

Acknowledgements

Bioinformatics Other data reduction techniques Kristel Van Steen, PhD, ScD (kristel.vansteen@ulg.ac.be) Université de Liege - Institut Montefiore 2008-2009. Acknowledgements. Material based on: work from Pradeep Mummidi class notes from Christine Steinhoff. Outline.

oki
Download Presentation

Acknowledgements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BioinformaticsOther data reduction techniquesKristel Van Steen, PhD, ScD(kristel.vansteen@ulg.ac.be)Université de Liege - Institut Montefiore2008-2009

  2. Acknowledgements Material based on: work from PradeepMummidi class notes from Christine Steinhoff

  3. Outline • Intuition behind PCA • Theory behind PCA • Applications of PCA • Extensions of PCA • Multidimensional scaling MDS (not to be confused with MDR)

  4. Intuition behind PCA

  5. Introduction Most of the scientific or industrial data is Multivariate data (huge size of data) Is all the data useful? If not, how do we quickly extract useful information only?

  6. Problem When we use traditional techniques, • 1. Not easy to extract useful information from the multivariate data • 1) Many bivariate plots are needed • 2) Bivariate plots, however, mainly represent correlations between variables (not samples).

  7. Visualization Problem • Not easy to visualize multivariate data • - 1D: dot • - 2D: Bivariate plot (i.e. X-Y plane) • - 3D: X-Y-Z plot • - 4D: ternary plot with a color code /Tetrahedron- 5D, 6D, etc. : ???

  8. Visualization????? As the number of variables increases, data space becomes harder to visualize

  9. Basics of PCA • PCA is useful when we need to extract useful information from multivariate data sets. • This technique is based on the reduced dimensionality. • Therefore, trends in multivariate data are easily visualized.

  10. Variable Reduction Procedure • Principal component analysis is a variable reduction procedure. It is useful when you have obtained data on a number of variables (possibly a large number of variables), and believe that there is some redundancy in those variables • Redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same construct. • Because of this redundancy, you believe that it should be possible to reduce the observed variables into a smaller number of principal components (artificial variables) that will account for most of the variance in the observed variables.

  11. What is Principal Component • A principal component can be defined as a linear combination of optimally-weighted observed variables. • Based on how subject scores on a principal component are computed.

  12. 7 Item measure of Job Satisfaction

  13. General Formula • Below is the general form for the formula to compute scores on the first component extracted (created) in a principal component analysis: C1 = b 11(X1) + b12(X 2) + ... b1p(Xp) where • C1 = the subject’s score on principal component 1 (the first component extracted) • b1p = the regression coefficient (or weight) for observed variable p, as used in • creating principal component 1 • Xp = the subject’s score on observed variable p.

  14. For example, assume that component 1 in the present study was the “satisfaction with supervision” component. You could determine each subject’s score on principal component 1 by using the following fictitious formula: • C1 = .44 (X1) + .40 (X2) + .47 (X3) + .32 (X4)+ .02 (X5) + .01 (X6) + .03 (X7)

  15. Obviously, a different equation, with different regression weights, would be used to compute subject scores on component 2 (the satisfaction with pay component). Below is a fictitious illustration of this formula: • C2 = .01 (X1) + .04 (X2) + .02 (X3) + .02 (X4)+ .48 (X5) + .31 (X6) + .39 (X7)

  16. Number of components Extracted • If a principal component analysis were performed on data from the 7-item job satisfaction questionnaire, only two components was created. However, such an impression would not be entirely correct. • In reality, the number of components extracted in a principal component analysis is equal to the number of observed variables being analyzed. • However, in most analyses, only the first few components account for meaningful amounts of variance, so only these first few components are retained, interpreted, and used in subsequent analyses (such as in multiple regression analyses).

  17. Characteristics of principal components • The first component extracted in a principal component analysis accounts for a maximal amount of total variance in the observed variables. • Under typical conditions, this means that the first component will be correlated with at least some of the observed variables. It may be correlated with many. • The second component extracted will have two important characteristics. First, this component will account for a maximal amount of variance in the data set that was not accounted for by the first component.

  18. Under typical conditions, this means that the second component will be correlated with some of the observed variables that did not display strong correlations with component 1. • The second characteristic of the second component is that it will be uncorrelated with the first component. Literally, if you were to compute the correlation between components 1 and 2, that correlation would be zero. • The remaining components that are extracted in the analysis display the same two characteristics: each component accounts for a maximal amount of variance in the observed variables that was not accounted for by the preceding components, and is uncorrelated with all of the preceding components.

  19. Generalization • A principal component analysis proceeds in this fashion, with each new component accounting for progressively smaller and smaller amounts of variance (this is why only the first few components are usually retained and interpreted). • When the analysis is complete, the resulting components will display varying degrees of correlation with the observed variables, but are completely uncorrelated with one another.

  20. References • http://support.sas.com/publishing/pubcat/chaps/55129.pdf • http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf • http://www.cis.hut.fi/jhollmen/dippa/node30.html

  21. Theory behind PCA

  22. Theory behind PCA Linear Algebra

  23. OUTLINE What do we need from „linear algebra“ for understanding principal component analysis ? • Standard deviation, Variance, Covariance • The Covariance matrix • Symmetric matrix and orthogonality • Eigenvalues and Eigenvectors • Properties

  24. Motivation

  25. Motivation Proteins 1 and 2 measured for 200 patients Protein 2 Protein1

  26. Motivation Patients 1 200 Genes 1 22,000 Microarray Experiment ? Visualize ? ? Which genes are important ? ? For which subgroup of patients ?

  27. Motivation Genes 1 200 Patients 1 10

  28. Basics for Principal Component Analysis • Orthogonal/Orthonormal • Some Theorems... • Standard deviation, Variance, Covariance • The Covariance matrix • Eigenvalues and Eigenvectors

  29. Standard Deviation The average distance from the mean of the data set to a point MEAN: Example: Measurement 1: 0,8,12,20 Measurement 2: 8,9,11,12 M1 M2 Mean 10 Mean 10 SD 8.33 SD 1.83

  30. Variance Example: Measurement 1: 0,8,12,20 Measurement 2: 8,9,11,12 M1 M2 Mean 10 Mean 10 SD 8.33 SD 1.83 Var 69.33 Var 3.33

  31. Covariance Standard Deviation and Variance are 1-dimensional How much do the dimensions vary from the mean with respect to each other ? Covariance measures between 2 dimensions We easily see, if X=Y we end up with variance

  32. Covariance Matrix Let Xbe a random vector. Then the covariance matrix of X, denoted by Cov(X), is The diagonals of Cov(X) are . In matrix notation, The covariance matrix is symmetric

  33. Symmetric Matrix Let be a square matrix of size nxn. The matrix A is symmetric, if for all

  34. 1.5 1 0.5 0.5 1.0 1.5 Orthogonality/Orthonormality <v1,v2> = <(1 0),(0 1)> = 0 Two vectors v1 and v2 for which <v1,v2>=0 holds are said to be orthogonal Unit vectors which are orthogonal are said to be orthonormal.

  35. Eigenvector Eigenvalue =0 Finding lambdas Eigenvalues/Eigenvectors Let A be an nxn square matrix and x an nx1 column vector. Then a (right) eigenvector of A is a nonzero vector x such that: For some scalar Procedure: Finding the eigenvalues Finding corresponding eigenvectors R: eigen(matrix) Matlab: eig(matrix)

  36. Some Remarks If A and B are matrices whose sizes are such that the given operations are defined and c is any scalar then,

  37. Now,… We have enough definitions to go into the procedure how to perform Principal Component Analysis

  38. Theory behind PCA Linear algebra applied

  39. OUTLINE What is principal component analysis good for? Principal Component Analysis: PCA • The basic Idea of Principal Component Analysis • The idea of transformation • How to get there ? The mathematics part • Some remarks • Basic algorithmic procedure

  40. Idea of PCA • Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data in terms of a set of uncorrelated variables • We typically have a data matrix of n observations on p correlated variables x1,x2,…xp • PCA looks for a transformation of the xi into p new variables yi that are uncorrelated

  41. Idea Genes x1 xp Patients 1 n Dimension high So how can we reduce the dimension ? Simplest way: take the first one, two, three; Plot and discard the rest: Obviously a very bad idea. Matrix: X

  42. Transformation We want to find a transformation that involves ALL columns, not only the first ones So find a new basis, order it such that in the first component lies almost ALL information of the whole dataset Looking for a transformation of the data matrix X (pxn) such that Y= TX=1 X1+ 2 X2+..+ p Xp

  43. Transformation What is a reasonable choice for the  ? Remember: We wanted a transformation that maximizes „information“ That means: captures „Variance in the data“ Maximize the variance of the projection of the observations on the Y variables ! Find  such that Var(T X) is maximal The matrix C=Var(X) is the covariance matrix of the Xi variables

More Related