1 / 81

Statistics for Genomicists

Statistics for Genomicists. Wolfgang Huber EMBL-EBI Oct 2007. Acknowledgement. Many of the slides are from, or based on, an EMBnet course taught in March 2007 by Frédéric Schütz and Darlene Goldstein from the Swiss Institute for Bioinformatics / EPFL. The main tasks of statistics.

kanoa
Download Presentation

Statistics for Genomicists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics for Genomicists Wolfgang Huber EMBL-EBI Oct 2007

  2. Acknowledgement Many of the slides are from, or based on, an EMBnet course taught in March 2007 by Frédéric Schütz and Darlene Goldstein from the Swiss Institute for Bioinformatics / EPFL

  3. The main tasks of statistics • Descriptive statistics: summarizing datasets by a few numbers • Exploratory data analysis and visualization: find patterns and construct hypotheses • Significance testing: do the data support the existence of a significant trend or is it just noise? • Clustering: finding patterns in the noise • Regression: can you explain the behaviour of a variable as a function of the others? • Classification: putting things into the right drawers

  4. Histogramme: summarizes a univariate sample Area of this bar represents the proportion of observations between 16 and 17. • main complication: choice of bin width/number of bins • most statistical programs do a good job at choosing a reasonable bin width, but manual override is sometimes necessary.

  5. R default parameters (here: 1 bin for 5 units) User choice (1 bin for 0.5 units) User choice (1 bin for 0.006 units)

  6. Measures of location: mean • “Arithmetic mean” • Sum of the values divided by the number of values • All observations treated equally • Suitable for symmetrical distributions • Sensitive to presence of outliers (“unusual values”) • Trimmed mean: • “Olympic scoring” • Remove extreme values (e.g. 10%) on each side before calculating the mean • In R: > mean(data) > mean(data, trim=0.1)

  7. Mean: not robust against outliers Mean

  8. Trimmed mean Trimmed mean (0.3)

  9. Side note: removing data • In the past, data was removed if it “looked” incorrect • Gregor Mendel’s peas (results too good to be true) • Albert Michelson’s data on the speed of light • Johannes Kepler on planet orbits • Outliers (unusual observations, far away from the rest of the data) do occur for various good reasons. • Data points can be removed (e.g. trimmed mean) • if the decision is made before looking at the data; or • if the discrepancies can be explained. • Otherwise, this is akin to data snooping. • There are statistical methods (called robust methods) which can handle outliers.

  10. In R: > library(MASS) > data(phones) > ?phones

  11. Least squares regression

  12. Measures of location: Median • More appropriate for skewed distributions • Mean=Median if the distribution is symmetrical • Not sensitive to the presence of outliers since it “ignores” almost all the values Median 50% of the data 50% of the data In R: > median(data)

  13. Quartiles and percentiles 1st quartile 3rd quartile 25% of the data 50% of the data 25% of the data xth percentile x% of the data In R: > quantile(data, 0.25) > quantile(data, 0.5) # Same as median(data) > quantile(data, x)

  14. Median: resistance to outliers Mean Median

  15. Mode Mean=Median=Mode • Most-common value in the data. • Bimodal data(important to recognize)

  16. Mode For continuous-valued data, the mode is an infinitesimal concept: it is defined as the maximum of the density. There is no simple finite-sample estimator of the mode, all depend on some sort of smoothing. Functions shorth and half.range.mode in Bioconductor package genefilter, and modeest package at CRAN.

  17. Spread Narrower spread Wider spread Same mean

  18. Standard Deviation • The standard deviation (SD, σ) of a variable is the square root of the average of squared deviations from the mean. • Used in conjunction with the mean. • Same unit as the data • In R: > sd(data) Mean

  19. Interquartile range (IQR) • Used in conjunction with the median • In R: > IQR(data) IQR= 3rd quartile – 1st quartile 1st quartile 3rd quartile 25% of the data 50% of the data 25% of the data

  20. Density • The density describes the theoretical probability distribution of a variable • Conceptually, it is obtained in the limit of infinitely many data points • When we estimate it from a finite set of data, we usually assume that the density is a smooth function • You can think of it “smoothed histogram” (but to actually compute it, there are much better methods!)

  21. Density for normal distribution and SD Area indicates the probability that a random observation will fall into this range. 1 SD from the mean mean

  22. Representing data: some bad practices Estimating an illegal phenomenon (unauthorized copy of computer programs) is hard, and the methodology is very contested. Estimations probably carry a large uncertaintly, which is missing here, making comparisons between percentages very difficult. Calculations of actual losses is even more contentious ! Third Annual BSA and IDC Global Software Piracy Study, May 2006

  23. Scientists seem to do better: a «random» sample

  24. Representing data: « bar+error » plot ** Mean + SD 2.7 2.4 2.0 SD 1.5 Mean 0 B A ** p<0.01 Legend: mean of measurement for groups A (25 subjects) and B (18 subjects); error bars indicate the standard deviation in each group; two-sided two-sample t-test.

  25. Boxplot 50% of the data is in the box “Interquartile range” • Outliers (unusual values) are those data points whose distance from the box is larger than 1.5 times the interquartile range. • The whiskers extend to the last point which is not an outlier. • A boxplot is a graphical representation of the Five-number summary: Minimum, First quartile, Median, Third quartile, Maximum Whisker Outliers Outlier Median 50% of the data is above 50% of the data is below 25% of the data is below the box 25% of the data is above the box

  26. Boxplots: example If there are only a few datapoints in the boxplot, it can be “degenerate” (i.e. not all features are present). From Moritz et al., Anal. Chem. 2004 Aug 15; 76(16):4811-24

  27. Boxplot: a different example With this definition, almost all datasets will produce outliers (20% of all points are “outliers”). In this case, the plots are made of several thousands of data points; a boxplot with outliers would not be very relevant because there would be too many of them.

  28. Comparisons of some graphs • In the next 4 slides, we are going to compare different methods for graphing univariate data • Four methods are shown in each case: • Individual data points on the x-axis; some random displacement (jitter) is added on the y-axis to avoid superimposition of too many points • Histogram with density superimposed • Mean +/- standard deviation • Boxplot • Other examples are given in the exercises.

  29. Dataset 1 (500 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot

  30. Dataset 2 (37 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot

  31. Dataset 3 (100 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot (courtesy Nadine Zangger)

  32. Dataset 4 (4 points) Individual points with jitter on y-axis Histogram and density Mean +/- SD Boxplot

  33. Fibronectin 0.42% a1-Antitrypsin 3.83% Prealbumin (thyroxine-binding) 0.32% a2-Macroglobulin 3.64% C1 Esterase inhibitor 0.32% Immunoglobulin A Immunoglobulin G Complement C4 a1B-Glycoprotein 3.45% 16.61% 0.45% 0.29% Transferrin Ceruloplasmin 3.32% b2-Glycoprotein I 0.48% 0.29% Haploglobin Gc-Globulin b2-Glycoprotein II 2.94% 0.48% 0.27% Immunoglobulin M 10% 1% 1.98% Complement C1 a-trypsin inhibitor Remaining 0.22% 0.58% a1-Antichymotrypsin Other 0.58% 9.91% a2HS-Glycoprotein 0.80% a1-Acid glycoprotein Albumin 1.25% Hemopexin 54.31% 1.05% Complement C3 1.12% Among the many other possible graphs:pie chart Relative protein abundances in human plasma

  34. Pie charts should be avoided ! • Pie charts are used a lot in the business and mass media world, but (fortunately) rarely in science. • It is difficult to compare different sections of a pie charts, and even more difficult to compare the sections of two pie charts • Bar charts or dot charts are almost always better • Main exception: when your data is split at 25%, 50% or 75% (humans are good at judging right angles)

  35. Pie chart vs bar chart See http://en.wikipedia.org/wiki/Pie chart (revision as of 23 February 2007, 18:25)

  36. Worse than a pie chart: a “3D pie chart”… PME Magazine, April 2006

  37. Bivariate data • Summarising univariate data is important • Most of the time, however, interesting information comes from looking at two variables simultaneously and comparing them, or finding a relation between them. • Scatterplots are the easiest way to display bivariate data.

  38. Scatterplots

  39. Example: univariate data

  40. Same data on a scatterplot

  41. Testing: is my coin fair ? • Formally: we want to make some inference about P(head) • Try it: toss coin several times (say 7 times) • Assume that it is fair ( P(head)=0.5 ), and see if thisassumption is compatible with the observations.

  42. Number of heads Number of tosses 8 Probability of obtaining 8 heads in 10 tosses Significant evidence (p<0.05) that the coin is biased towards head or tail. 0.04395 0.05469 10 Probability of obtaining at least 8 heads in 10 tosses

  43. Observations: taking samples • Samples are taken from an underlying real or fictitious population • Samples can be of different size • Samples can be overlapping • Samples are assumed to be taken at random and independently from each other • Samples are assumed to be representative of the population (the population “homogeneous”) Underlying Population Sample 1 Sample 2 Sample 3 Estimation: • Sample 1 => Mean 1 • Sample 2 => Mean 2 • Sample 3 => Mean 3

  44. Sampling variability • Say we sample from a population in order to estimate the population mean of some (numerical) variable of interest (e.g. weight, height, number of children, etc.) • We would use the sample mean as our guess for the unknown value of the population mean • Our sample mean is very unlikely to be exactly equal to the (unknown) population mean just due to chance variation in sampling • Thus, it is useful to quantify the likely size of this chance variation (also called ‘chance error’ or ‘sampling error’, as distinct from ‘nonsampling errors’ such as bias) • If we estimate the mean multiple times from different samples, we will get a certain distribution.

  45. Central Limit Theorem (CLT) • The CLT says that if we • repeat the sampling process many times • compute the sample mean (or proportion) each time • make a histogram of all the means (or proportions) • then that histogram of sample means should look like thenormal distribution • Of course, in practice we only get one sample from the population, we get one point of that normal distribution • The CLT provides the basis for making confidence intervals and hypothesis testsfor means • This is proven for a large family of distributions for the data that are sampled, but there are also distributions for which the CLT is not applicable

  46. Sampling variability of the sample mean • Say the SD in the population for the variable is known to be some number  • If a sample of n individuals has been chosen ‘at random’ from the population, then the likely size of chance error of the sample mean (called the standard error) is SE(mean) =  / n • This the typical difference to be expected if the sampling is done twice independently and the averages are compared • 4 time more data are required to half the standard error • If  is not known, you can substitute an estimate • Some people plot the SE instead of SD on barplots

  47. Central Limit Theorem (CLT) Sample 1 Mean 1 Sample 2 Mean 2 Sample 3 Mean 3 Sample 4 Mean 4 Sample 5 Mean 5 Normal distribution Sample 6 Mean 6 ... ... Mean : true mean of the population SD :  / n Note: this is the SD of the sample mean, also called Standard Error; it is not the SD of the original population.

  48. Hypothesis testing • 2 hypotheses in competition: • H0: the NULL hypothesis, usually the most conservative • H1 or HA: the alternative hypothesis, the one we are actually interested in. • Examples of NULL hypothesis: • The coin is fair • This new drug is no better (or worse) than a placebo • There is no difference in weight between two given strains of mice • Examples of Alternative hypothesis: • The coin is biased (either towards tail or head) • The coin is biased towards tail • The coin has probability 0.6 of landing on tail • The drug is better than a placebo

  49. Hypothesis testing • We need something to measure how far my observation is from what I expect to see if H0 is correct: a test statistic • Example: number of heads obtained when tossing my coin a given number of times: Low value of the test statistic  more likely not to reject H0 High value of the test statistic  more likely to reject H0 • Finally, we need a formal way to determine if the test statistic is “low” or “high” and actually make a decision. • We can never prove that the alternative hypothesis is true; we can only show evidence for or against the null hypothesis !

More Related