1 / 59

Statistics

Statistics. Achim Tresch Gene Center LMU Munich. Descriptive Statistics Test theory III. Common tests IV. Bivariate Analysis V. Regression. Topics. Group 1 Group 2. Gene A. …. Which gene is „differentially“ expressed?. Gene B. Gene expression measurements. III. Common Tests.

damita
Download Presentation

Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics Achim Tresch Gene Center LMU Munich

  2. Descriptive Statistics Test theory III. Common tests IV. Bivariate Analysis V. Regression Topics

  3. Group 1 Group 2 Gene A … Which gene is„differentially“ expressed? Gene B Gene expressionmeasurements III. Common Tests Two group comparisons

  4. III. Common Tests Group 1Group 2 Two group comparisons Question / Hypothesis Is the expression of gene g in group 1 less than the expression in group 2? Data: Expression of gene g in different samples (absolute scale) Test statistic, e.g. difference of group means Decision for “less expressed“ if

  5. III. Common Tests Bad Idea: Subtract the group means Group 1Group 2 Two group comparisons Problem:d is not scale invariant Solution: Divide d by its standard deviation This is essentially the two sample t-statistic (for unpaired samples)

  6. III. Common Tests The t-test • Therearevariantsofthe t-test:Onegroup: • one sample t-test (isthegroupmean = μ ?)withmeanandvariance • Twogroupcomparisons: • Paired t-test (arethegroupmeansequal ?) -> later • two sample t-testassumingequalvariances • two sample t-test not assumingequalvariances(Baum-Welchtest)

  7. III. Common Tests The t-test Requirement: The data is approx. normally distributed in both groups (there are ways to check this, e.g. by the Kolmogoroff-Smirnov test) Decision for unpaired t-test or Baum-Welch test:

  8. III. Common Tests Wilcoxon rank sum test (Mann-Whitney-Test, U-test) Non parametric test for the comparison of two groups: Is the distribution in group 1 systematically shifted relative to group 2 ? 3 5 6 7 8 9 10 12 15 18 Original scale Rank scale 1 2 3 4 5 6 7 8 9 10 Rank sum Group 1:1+2+3+6+10 = 22 Rank sum Group 2:4+5+7+8+9 = 33

  9. III. Common Tests Wilcoxon rank sum test (Mann-Whitney-Test, U-test) The test statistic is the rank sum of group 1 The corrseponding p-value can be calculated exactly for small group sizes. There are approximations available for larger group sizes (N>20). P(W≤22, H0) = 0.15 The Wilcoxon test can be carried out as a one-sided and as a two-sided test (default) 15 20 25 30 35 40 22 Wilcoxon W Rank sum distribution for group 1, |Group 1| = 5, |Group 2| = 5

  10. III. Common Tests Tests for paired samples Reminder paired data: There are coupled measurements (xi, yi) of the same kind. x1,x2, .....,xn Data y1,y2, .....,yn Essential: Calculate the differences of the pairs d1 =x1 – y1, d2 =x2 – y2,..... dn =xn – yn. Now perform a one-sample t-test for the data (dj) with μ=0.Advantage over unpaired data: Removal of „interindividual“ = intra group variance NB: Approx. normal distribution of the data in both groups is a requirement.

  11. III. Common Tests t-Test for paired samples Graphical Description: Difference Boxplot pulse pulse trained untrained Difference

  12. III. Common Tests Wilcoxon signed rank test Nonparametric version for paired samples:Are the values in group 1 smaller than in group 2 ? Idea: If the groups are not different, the „mirrored“ distribution of the differences with a negative sign should be similar to the distribution of the differences with a positive sign.Check similarity with a Wilcoxon rank sum test for the comparison of -∎ und ∎ .

  13. III. Common Tests Wilcoxon signed rank test Negative Differences Positive Differences -3 -2 2 7 Original scale Absolute values 0 1 2 3 4 5 6 7 8 9 . . . Rank scale* 1 2 3 4 5 6 . . . Rank sums: Group 1: 1.5+3 = 4.5 Group 2: 1.5+4.5+4.5 = 10.5 → Perform a Wilcoxon rank sum test for |Gruppe 1| = 2 , |Gruppe 2| = 3 * In case of k ties (k identical values), replace the ranks j,…,j+k-1 of these by the common value j + (k-1)/2.

  14. III. Common Tests Summary: Group comparison of a continuous endpoint Question: Are group 1 and group 2 identical with respect to the distribution of the endpoint? Does the data follow a Gaussian distribution? yes no Paired data? Paired data? ja nein ja nein t-Test for paired data t-Test for unpaired data Wilcoxon signed rank test Wilcoxon rank sum test

  15. III. Common Tests Comparison of two binary variablesunpaired data: Fisher‘s Exact Test Are there differences in the distribution █and █? given

  16. III. Common Tests Comparison of two binary variables Paired data: McNemar Test („sparse Scotsman“) Are the measurements in █resp. █ concordant or discordant? Ex.: Clinical trial, Placebo vs. Verum (each individual obtains both treatments at different occasions) Discordant pairs Concordant pairs

  17. III. Common Tests Comparison of two binary variables Paired data: McNemar Test („Sparsamer Schotte“) Are the measurements in █resp. █ concordant or discordant?

  18. III. Common Tests Comparison of two categorial variablesUnpaired data Stichproben: Chisquared-Test (χ2-Test) H0: The two variables are independent. Idea: Measure deviation from this equality This test statistic follows asymptotically a χ2-distribution. -> Requirement: each cell contains ≥ ~5 counts.

  19. III. Common Tests Summary: Comparison of two categorial variables Question: Is there a difference in the frequency distributions of one variable w.r.t. the values of the second variable? Binary data? yes no Paired data? Paired data? yes no yes no McNemar Test Fisher‘s Exact Test Chisquared (χ2) -test (bivariate symmetry tests)

  20. III. Common Tests Summary Description and Testing (Two sample comparison) * For Gaussian distributions/ at least: symmetric distributions (|skewness|<1)

  21. III. Common Tests Summary Description and Testing (Two sample comparison) * For Gaussian distributions/ at least: symmetric distributions (|skewness|<1)

  22. III. Common Tests Summary Description and Testing (Two sample comparison) * For Gaussian distributions/ at least: symmetric distributions (|skewness|<1)

  23. III. Common Tests Summary Description and Testing (Two sample comparison) * For Gaussian distributions/ at least: symmetric distributions (|skewness|<1)

  24. III. Common Tests Summary Description and Testing (Two sample comparison) * For Gaussian distributions/ at least: symmetric distributions (|skewness|<1)

  25. III. Common Tests Summary Description and Testing (Two sample comparison) * For Gaussian distributions/ at least: symmetric distributions (|skewness|<1)

  26. III. Common Tests Caveat: • Statistical Significance ≠ Relevance For large sample numbers, very small differences may become significant For small sample numbers, an observed difference may be relevant, but not statistically significant

  27. III. Common Tests Multiple Testing Problems • Examples: • Simultaneous testing of many endpoints (e.g. genes in a microarray study) • Simultaneous pairwise comparison of many (k) groups (k pairwise tests = k(k-1)/2 tests) Although each individual test keeps the significance level (say α = 5%), the probability of obtaining (at least one) false positive increases dramatically with the number of tests:For 6 tests, the probability of a false positive is already >30%! (if there are no true positives)

  28. III. Common Tests Multiple Testing Problems One possible solution: p-value correction for multiple testing, e.g. Bonferroni correction: Each single test is performed at the level α/m („local significance level α/m“), where m is the number of tests. The probability of obtaining a (at least one) false positive is then at most α („multiple/global significance level α“) Ex.: m = 6 Desired multiple level: α = 5% → local level: α/m = 5%/6 = 0.83% Other solutions: Bonferroni-Holm, Benjamini-Hochberg,Control of False discovery rate (FDR) instead of significance at the group level (family wise error rate, FWER): SAM

  29. IV. Bivariate Analysis (Relation of two Variables) Ex.: How to quanty a relation between two continuous variables? From: A.Wakolbinger

  30. IV. Bivariate Analysis • Pearson-Correlation coefficient rxy • Useful for gaussian variables X,Y (but not only for those) • Measures the degree of linear dependence • Properties: • -1 ≤ rxy ≤ +1 • rxy = ±1: perfect linear dependenceThe sign indicates the direction of the relation (pos/neg dependence) From: A.Wakolbinger

  31. IV. Bivariate Analysis Pearson-Correlation coefficient rxy • The closer rxy to 0, the weaker the (linear) dependence From: A.Wakolbinger

  32. IV. Bivariate Analysis Pearson-Correlation coefficient rxy • The closer rxy to 0, the weaker the (linear) dependence From: A.Wakolbinger

  33. IV. Bivariate Analysis Pearson-Correlation coefficient rxy • The closer rxy to 0, the weaker the (linear) dependence From: A.Wakolbinger

  34. IV. Bivariate Analysis Pearson-Correlation coefficient rxy • The closer rxy to 0, the weaker the (linear) dependence From: A.Wakolbinger

  35. IV. Bivariate Analysis Pearson-Correlation coefficient rxy • The closer rxy to 0, the weaker the (linear) dependence From: A.Wakolbinger

  36. IV. Bivariate Analysis Pearson-Correlation coefficient rxy • The closer rxy to 0, the weaker the (linear) dependence • rxy = ryx (Symmetry) From: A.Wakolbinger

  37. IV. Bivariate Analysis Pearson-Correlation coefficient rxy Example: Relation body height – body weight – arm length rxy = 0,38 rxy = 0,84 The closer the data scatters around the regression line (see later), the larger |rxy|. From: A.Wakolbinger

  38. IV. Bivariate Analysis Pearson-Correlation coefficient rxy How large is r in these examples? rxy≈ 0 rxy≈ 0 rxy≈ 0 The Pearson correlation coefficient cannot measure non-linear dependence properly.

  39. IV. Bivariate Analysis Spearman correlation sxy Idea: Calculate the Pearson correlation coefficient on rank transformed data. sxy = 0,95 rxy = 0,88 Rang(Y) Y Rang(X) X Spearman correlation measures the monotony of a dependence.

  40. IV. Bivariate Analysis Pearson vs. Spearman correlation Original scale

  41. IV. Bivariate Analysis Pearson vs. Spearman correlation Pearson correlation

  42. IV. Bivariate Analysis Pearson vs. Spearman correlation Rank transformed data

  43. IV. Bivariate Analysis Pearson vs. Spearman correlation Spearman correlation

  44. IV. Bivariate Analysis Pearson vs. Spearman correlation Conclusion: Spearman correlation is more robust against outliers and insensitive against changes of scale. In case of a (expected) linear dependence however, Pearson correlation is more sensitive.

  45. IV. Bivariate Analysis Pearson vs. Spearman correlation • Summary • Pearson correlation is a measure for linear dependence • Spearman correlation is a measure for monotone dependence • Correlation coefficients do not tell anything about the (non-)existence of a functional dependence. • Correlation coefficients tell nothing about causal relations of two variables X and Y (on the contrary, they are symmetric in X and Y) • Correlation coefficients hardly tell anything about the shape of a scatterplot

  46. IV. Bivariate Analysis Gender Fake correlation, Confounding Example: r=0.6 Income Foot size corr. Foot size Income Confounding: A variable that „explains“ (part of) the dependence of two others.

  47. IV. Bivariate Analysis rxy ( ) + rxy ( ) 2 Difference due to sex = Mean(Inc.♂) - Mean(Inc.♀) Fake correlation, Confounding Partial correlation: = „remaining “ correlation(here: after correction w.r.t. gender) rXY | Geschl. = partial correlation = r=0.03 Income Foot size

  48. Sample Regression function  $  $  $  $ V. Regression (Approximation of one variable by a function of a second variable) Population Unknown functional dependence  $  $  $  $  $

  49. V. Regression 100 50 0 0 20 40 60 Regression: The Method • Choose a family of functions that you think is capable of capturing the functional dependence of the two variables.E.g. the set of linear functions, f(x) = ax+b or the set of quadratic functions, f(x) = ax2+bx+c

  50. V. Regression (Xj ,Yj) Actual value Y Yj 60 40 f(Xj) 20 Prediction 0 X Residual 0 20 40 60 Xj RSS = Σj(actual value – predicted value) 2 = Σj( Yj- f(Xj) )2 Regression: The Method • Choose a loss function = the quality measure for the approximation. E.g. for continuous data usually: Quadratic Loss (RSS, Residual Sum of Squares) Y= f(X)

More Related