1 / 60

Mutivariate statistical Analysis methods

Mutivariate statistical Analysis methods. Ahmed Rebaï Centre of Biotechnology of Sfax ahmed.rebai@cbs.rnrt.tn. Basic statistical concepts and tools. Statistics. Statistics are concerned with the ‘ optimal ’ methods of analyzing data generated from some chance mechanism (random phenomena).

tess
Download Presentation

Mutivariate statistical Analysis methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mutivariate statistical Analysis methods Ahmed Rebaï Centre of Biotechnology of Sfax ahmed.rebai@cbs.rnrt.tn

  2. Basic statistical concepts and tools

  3. Statistics • Statistics are concerned with the ‘optimal’ methods of analyzing data generated from some chance mechanism (random phenomena). • ‘Optimal’ means appropriate choice of what is to be computed from the data to carry out statistical analysis

  4. Random variables • A random variable is a numerical quantity that in some experiment, that involve some degree of randomness, takes one value from some set of possible values • The probability distribution is a set of values that this random variable takes together with their associated probability

  5. The Normal distribution • Proposed by Gauss (1777-1855) : the distribution of errors in astronomical observations (error function) • Arises in many biological processes, • Limiting distribution of all random variables for a large number of observations. • Whenever you have a natural phenomemon which is the result of many contributiong factor each having a small contribution you have a Normal

  6. The Quincunx Bell-shaped distribution

  7. Distribution function • The distribution function is defined F(x)=Pr(X<x) • F is called the cumulative distribution function (cdf) and f the probability distrbution function (pdf) of X •  and ² are respectively the mean and the variance of the distribution

  8. Moments of a distribution • The kth moment is defined as • The first moment is the mean • The kth moment about the mean is • The second moment about the mean is called the variance ²

  9. Kurtosis: a useful moments’ function • Kurtosis4=4-3²2 • 4  0 for a normal distribution so it is a measure of Normality

  10. Observations • Observations xi are realizations of a random variable X • The pdf of X can be visualized by a histogram: a graphics showing the frequency of observations in classes

  11. Estimating moments • The Mean of X is estimated from a set of nobservations(x1, x2, ..xn) as • The variance is estimated by Var(X) =

  12. The fundamental of statistics • Drawing conclusions about a population on the basis on a set of measurments or observations on a sample from that population • Descriptive: get some conclusions based on some summary measures and graphics (Data Driven) • Inferential: test hypotheses we have in mind befor collecting the data (Hypothesis driven).

  13. What about having many variables? • Let X=(X1, X2, ..Xp) be a set of p variables • What is the marginaldistribution of each of the variables Xi and what is their joint distribution • If f(X1, X2, ..Xp) is the joint pdf then the marginal pdf is

  14. Independance • Variables are said to be independent if f(X1, X2, ..Xp)= f(X1) . f(X2)…. f(Xp)

  15. Covariance and correlation • Covariance is the joint first moment of two variables, that is Cov(X,Y)=E(X-X)(Y- Y)=E(XY)-E(X)E(Y) • Correlation: a standardized covariance •  is a number between -1 and +1

  16. For example: a bivariate Normal • Two variables X and Y have a bivariate Normal if • is the correlation between X and Y

  17. Uncorrelatedness and independence • If =0 (Cov(X,Y)=0) we say that the variables are uncorrelated • Two uncorrelated variables are independent if and only if their joint distribution is bivariate Normal • Two independent variables are necessarily uncorrelated

  18. Bivariate Normal • If =0 then • So f(x,y)=f(x).f(y) • the two variables are thus independent

  19. Many variables • We can calculate the Covariance or correlation matrix of (X1, X2, ..Xp) • C=Var(X)= • A square (pxp) and symmetric matrix

  20. A Short Excursion into Matrix Algebra

  21. What is a matrix?

  22. Operations on matrices Transpose

  23. Properties

  24. Some important properties

  25. Other particular operations

  26. Eigenvalues and Eigenvectors

  27. Singular value decomposition

  28. Multivariate Data

  29. Multivariate Data • Data for which each observation consists of values for more than one variables; • For example: each observation is a measure of the expression level of a gene i in a tissue j • Usually displayed as a data matrix

  30. Biological profile data

  31. The data matrix n observations (rows) for p variables (columns) an nxp matrix

  32. Contingency tables • When observations on two categorial variables are cross-classified. • Entries in each cell are the number of individuals with the correponding combination of variable values

  33. Mutivariate data analysis

  34. Exploratory Data Analysis • Data analysis that emphasizes the use of informal graphical procedures not based on prior assumptions about the structure of the data or on formal models for the data • Data= smooth + rough where the smooth is the underlying regularity or pattern in the data. The objective of EDA is to separate the smooth from the rough with minimal use of formal mathematics or statistics methods

  35. Reduce dimensionality without loosing much information

  36. Overview on the techiques • Factor analysis • Principal components analysis • Correspondance analysis • Discriminant analysis • Cluster analysis

  37. Factor analysis A procedure that postulates that the correlations between a set of p observed variables arise from the relationship of these variables to a small number k of underlying, unobservable, latent variables, usually known as common factors where k<p

  38. Principal components analysis • A procedure that transforms a set of variables into new ones that are uncorrelated and account for a decreasing proportions of the variance in the data • The new variables, named principal components (PC), are linear combinations of the original variables

  39. PCA • If the few first PCs account for a large percentage of the variance (say >70%) then we can display the data in a graphics that depicts quite well the original observations

  40. Example

  41. Correspondance Analysis • A method for displaying relationships between categorial variables in a scatter plot • The new factors are combinations of rows and columns • A small number of these derived coordinate values (usually two) are then used to allow the table to be displayed graphically

  42. Example: analysis of codon usage and gene expression in E. coli (McInerny, 1997) A gene can be represented by a 59-dimensional vector (universal code) A genome consists of hundreds (thousands) of these genes Variation in the variables (RSCU values) might be governed by only a small number of factors For each gene and each codon i calculate RCSU=# observed codon /#expected codon

  43. Codon usage in bacterial genomes

  44. Evidence that all synonymous codons were not used with equal frequency:Fiers et al., 1975 A-protein gene of bacteriophage MS2, Nature 256, 273-278 UUU Phe 6 UCU Ser 5 UAU Tyr 4 UGU Cys 0 UUC Phe 10 UCC Ser 6 UAC Tyr 12 UGC Cys 3 UUA Leu 8 UCA Ser 8 UAA Ter * UGA Ter * UUG Leu 6 UCG Ser 10 UAG Ter * UGG Trp 12 CUU Leu 6 CCU Pro 5 CAU His 2 CGU Arg 7 CUC Leu 9 CCC Pro 5 CAC His 3 CGC Arg 6 CUA Leu 5 CCA Pro 4 CAA Gln 9 CGA Arg 6 CUG Leu 2 CCG Pro 3 CAG Gln 9 CGG Arg 3 AUU Ile 1 ACU Thr 11 AAU Asn 2 AGU Ser 4 AUC Ile 8 ACC Thr 5 AAC Asn 15 AGC Ser 3 AUA Ile 7 ACA Thr 5 AAA Lys 5 AGA Arg 3 AUG MeU 7 ACG Thr 6 AAG Lys 9 AGG Arg 4 GUU Val 8 GCU Ala 6 GAU Asp 8 GGU Gly 15 GUC Val 7 GCC Ala 12 GAC Asp 5 GGC Gly 6 GUA Val 7 GCA Ala 7 GAA Glu 5 GGA Gly 2 GUG Val 9 GCG Ala 10 GAG Glu 12 GGG Gly 5

  45. Multivariate reduction • Attempts to reduce a high-dimensional space to a lower-dimensional one. In other words, it tries to simplify the data set. Many of the variables might co-vary, therefore there might only be one, or a small few sources of variation in the dataset A gene can be represented by a 59-dimensional vector (universal code) A genome consists of hundreds (thousands) of these genes Variation in the variables (RSCU values) might be governed by only a small number of factors

More Related