1 / 106

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Module 6 Backgrounder in Statistical Methods. David Wishart Informatics and Statistics for Metabolomics June 16-17,2014. Schedule. Learning Objectives. Learn about distributions and significance

Download Presentation

Canadian Bioinformatics Workshops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Canadian Bioinformatics Workshops www.bioinformatics.ca

  2. Module #: Title of Module 2

  3. Module 6 Backgrounder in Statistical Methods David Wishart Informatics and Statistics for Metabolomics June 16-17,2014

  4. Schedule

  5. Learning Objectives • Learn about distributions and significance • Learn about univariate statistics (t-tests and ANOVA) • Learn about correlation and clustering • Learn about multivariate statistics (PCA and PLS-DA)

  6. Statistics • There are three kinds of lies: lies, damned lies, and statistics - Benjamin Disraeli • 98% of all statistics are made up – Unknown • Statistics are like bikinis.  What they reveal is suggestive, but what they conceal is vital  - Aaron Levenstein • Statistics is the mathematics of impressions

  7. Distributions & Significance

  8. Univariate Statistics

  9. Univariate Statistics • Univariate means a single variable • If you measure a population using some single measure such as height, weight, test score, IQ, you are measuring a single variable • If you plot that single variable over the whole population, measuring the frequency that a given value is reached you will get the following:

  10. A Bell Curve # of each Height Also called a Gaussian or Normal Distribution

  11. Features of a Normal Distribution • Symmetric Distribution • Has an average or mean value (m) at the centre • Has a characteristic width called the standard deviation (s) • Most common type of distribution known m = mean

  12. Normal Distribution • Almost any set of biological or physical measurements will display some some variation and these will almost always follow a Normal distribution • The larger the set of measurements, the more “normal” the curve • Minimum set of measurements to get a normal distribution is 30-40

  13. Gaussian Distribution

  14. Some Equations Mean m = Sxi N s2 = S(xi - m)2 Variance N s = S(xi - m)2 Standard Deviation N

  15. Standard Deviations (Z-values)

  16. Significance • Based on the Normal Distribution, the probability that something is >1 SD away (larger or smaller) from the mean is 32% • Based on the Normal Distribution, the probability that something is >2 SD away (larger or smaller) from the mean is 5% • Based on the Normal Distribution, the probability that something is >3 SD away (larger or smaller) from the mean is 0.3%

  17. Significance • In a test with a class of 400 students, if you score the average you typically receive a “C” • In a test with a class of 400 students, if you score 1 SD above the average you typically receive a “B” • In a test with a class of 400 students if you score 2 SD above the average you typically receive an “A”,

  18. The P-value • The p-value is the probability of obtaining a test statistic (a score, a set of events, a height) at least as extreme as the one that was actually observed • One "rejects the null hypothesis" when the p-value is less than the significance level α which is often 0.05 or 0.01 • When the null hypothesis is rejected, the result is said to be statistically significant

  19. P-value • If the average height of an adult (M+F) human is 5’ 7” and the standard deviation is 5”, what is the probability of finding someone who is more than 6’ 10”? • If you choose an a of 0.05 is a 6’ 11” individual a member of the human species? • If you choose an a of 0.01 is a 6’ 11” individual a member of the human species?

  20. P-value • If you flip a coin 20 times and the coin turns up heads 14/20 times the probability that this would occur is 60,000/1,048,000 = 0.058 • If you choose an a of 0.05 is this coin a fair coin? • If you choose an a of 0.10 is this coin a fair coin?

  21. Mean, Median & Mode Mode Median Mean

  22. Mean, Median, Mode • In a Normal Distribution the mean, mode and median are all equal • In skewed distributions they are unequal • Mean - average value, affected by extreme values in the distribution • Median - the “middlemost” value, usually half way between the mode and the mean • Mode - most common value

  23. Different Distributions Unimodal Bimodal

  24. Other Distributions • Binomial Distribution • Poisson Distribution • Extreme Value Distribution • Skewed or Exponential Distribution

  25. Binomial Distribution 1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1 P(x) = (p + q)n

  26. m =0.1 m = 1 m = 2 m = 3 m Proportion of samples = 10 P(x) x Poisson Distribution

  27. Extreme Value Distribution • Arises from sampling the extreme end of a normal distribution • A distribution which is “skewed” due to its selective sampling • Skew can be either right or left Gaussian Distribution

  28. Skewed Distribution • Resembles an exponential or Poisson-like distribution • Lots of extreme values far from mean or mode • Hard to do useful statistical tests with this type of distribution Outliers

  29. Fixing a Skewed Distribution • A skewed distribution or exponentially decaying distribution can be transformed into a “normal” or Gaussian distribution by applying a log transformation • This brings the outliers a little closer to the mean because it rescales the x-variable, it also makes the distribution much more Gaussian

  30. exp’t B linear scale exp’t B log transformed Log Transformation Skewed distribution Normal distribution

  31. Log Transformation on Real Data

  32. Distinguishing 2 Populations Normals Leprechauns

  33. The Result # of each Height Are they different?

  34. What about these 2 Populations?

  35. The Result # of each Height Are they different?

  36. Student’s t-Test • Also called the t-Test • Used to determine if 2 populations are different • Formally allows you to calculate the probability that 2 sample means are the same • If the t-Test statistic gives you a p=0.4, and the a is 0.05, then the 2 populations are the same • If the t-Test statistic gives you a p=0.04, and the a is 0.05, then the 2 populations are different • Paired and unpaired t-Tests are available, paired if used for “before & after” expts. while unpaired is for 2 randomly chosen samples

  37. Student’s t-Test • A t-Test can also be used to determine whether 2 clusters are different if the clusters follow a normal distribution Variable 1 Variable 2

  38. What if the Distributions are not Normal?

  39. Mann-Whitney U-Test • Also called the Wilcoxon Rank Sum Test • Used to determine if 2 non-normally distributed populations are different • More powerful and robust than the t-test • Formally allows you to calculate the probability that 2 sample medians are the same • If the U-Test statistic gives you a p=0.4, and the a is 0.05, then the 2 populations are the same • If the U-Test statistic gives you a p=0.04, and the a is 0.05, then the 2 populations are different

  40. Distinguishing 3+ Populations Normals Leprechauns Elves

  41. The Result # of each Height Are they different?

  42. Distinguishing 3+ Populations

  43. The Result # of each Height Are they different?

  44. ANOVA • Also called Analysis of Variance • Used to determine if 3 or more populations are different, it is a generalization of the t-Test • Formally ANOVA provides a statistical test (by looking at group variance) of whether or not the means of several groups are all equal • Uses an F-measure to test for significance • 1-way, 2-way, 3-way and n-way ANOVAs, most common is 1-way which just is concerned about whether any of the 3+ populations are different, not which pair is different

  45. ANOVA • ANOVA can also be used to determine whether 3+ clusters are different -- if the clusters follow a normal distribution Variable 1 Variable 2

  46. Distinguishing N Populations (False Discovery Rate) • Suppose you performed 100 different t-tests, and found 20 results with a p value of <0.05 • What are the odds that one of these findings is going to be false? • Roughly 20 X 0.05 = 1.00 • How many of these 20 tests are likely false positives? 20x0.05 = 1 • To correct for this you try to choose those results with a p value < 0.05/20 or p < 0.0025

  47. Example (Some Weather Predictions) • P = 0.08 It will rain • P = 0.05 It will be sunny • P = 0.06 It will be foggy • P = 0.02 It’ll be cloudy • P = 0.05 It will snow • P = 0.07 It will be windy • P = 0.06 It will be calm • P = 0.09 It will hail • P = 0.02 Lightning • P = 0.16 Thunder • P = 0.001 Eclipse • P = 0.09 Tornado • P = 0.18 Hurricane • P = 0.05 Sleet 100% certainty it will do something tomorrow Only one prediction is significant with FDR or Bonferroni correction (Eclipse)

  48. Normalization/Scaling

  49. Normalization/Scaling • What if we measured the top population using a ruler that was miscalibrated or biased (inches were short by 10%)? We would get the following result: # of each Height

  50. Normalization • Normalization adjusts for systematic bias in the measurement tool • After normalization we would get: # of each Height

More Related