Data Analysis and Statistics in the Lab

1. Data Analysis and Statisticsin the Lab Southern California Bioinformatics Summer Institute

2. Topics Introduction Data Displaying Data Descriptive Statistics Inferential Statistics Q&A Wrap-up 2 �Richard Johnston

3. Objectives Introduce commonly used statistical concepts and measures Show how to compute statistical measures using Excel and R Minimize discussion of statistical theory as much as possible. Provide sample calculations in the downloaded materials 3 �Richard Johnston

4. Asking questions of the data Could I have gotten these results by chance? Is there a significant difference between these two samples? How can I describe my results? What can I say about the average of these measurements? What should I do with these outliers? � etc. 4 �Richard Johnston The answer to the first question is always yes. We need to quantify the likelihood.The answer to the first question is always yes. We need to quantify the likelihood.

5. Tools used Excel 2007 R (???) 5 �Richard Johnston

6. Excel 2007 What�s new (and different)? Mostly, the look and feel have changed. Up to 1,048,576 rows and 16,384 columns in a single worksheet Up to 32,767 characters in a single cell More sorting options Enhanced data importing Improved PivotTables �Richard Johnston 6

7. What�s new (and different)2 More options for conditional formatting of cells Multithreaded calculation of formulae, to speed up large calculations, especially on multi-core/multi-processor systems. Improved filtering New charting features � Other changes to make Excel and other Office programs more Vista-like �Richard Johnston 7

8. �Richard Johnston 8 What is R? R is a language and environment for statistical computing and graphics widely used in research institutions. R provides a wide variety of statistical and graphical techniques, and is highly extensible. One of R's strengths is the ease with which publication-quality plots can be produced. R is available as free software, R runs on Windows, MacOS, and a wide variety of UNIX platforms.

9. Types of data Primary (collected by you) or Secondary (obtained from another source) Observational or Experimental Quantitative or Qualitative Quantitative data uses numerical values to describe something (e.g., weight, temperature) Qualitative data uses descriptive terms to classify something (e.g., gender, color) 9 �Richard Johnston

10. Types of Measurement Scales Nominal (Qualitative) Examples are gender, color, species name, � Ordinal (Qualitative or quantitative) Allows rank ordering of values Examples: Grades A-F Rating Level 1 through 5 �Slow�, �Medium�, �Fast� 10 �Richard Johnston

11. Types of Measurement Scales2 Interval (Quantitative) Allows addition and subtraction, but not multiplication and division No real zero point Example: Temperature measurement in degrees Fahrenheit 100 degrees F is 50 more than 50 degrees F 100 degrees F is not twice as hot as 50 degrees F 11 �Richard Johnston

12. Types of Measurement Scales3 Ratio (Quantitative) Allows addition, subtraction, multiplication and division Has a true zero point A value zero means the absence of the measured quantity Examples: Weight, age, or speed To decide if a measurement is Interval or Ratio, see if the phrase �Twice as�� makes sense: e.g., Twice as (heavy, old, fast) 12 �Richard Johnston

13. Displaying Data Charts Pie Column Line Scatter Histograms 13 �Richard Johnston

14. Pie Chart 14 �Richard Johnston

15. Column Chart 15 �Richard Johnston

16. Line Chart 16 �Richard Johnston

17. Scatter (X-Y) Chart 17 �Richard Johnston

18. Histograms 18 �Richard Johnston

19. �Richard Johnston Frequency Distribution Histogram Using Excel Open AspirinStudyData.xlsx 0r .xls Create bin values 20,25,� ,100 Select Data Analysis Select Histogram Click Input Range and select the Age data Click on Bin Values and select the bin values. Check Labels and Chart Output Click OK

20. �Richard Johnston Frequency Distribution Histogram Using R Start R Select File>change dir� Browse to default directory Type the following (including capital letters):

21. �Richard Johnston Frequency Distribution Histogram Using R

22. Descriptive Statistics Measures of Central Tendency Mean Median Mode Measures of Dispersion Range Variance Standard Deviation Quartiles Interquartile Range 22 �Richard Johnston

23. Descriptive Statistics using Excel Use built-in functions to perform basic analyses: Formulas|MoreFunctions|Statistical Use the Data Analysis Add-in for more complex analyses: Open AspirinStudyData.xlsx Select Data|Data Analysis Select Descriptive Statistics Select the Age data Check Labels in first row Check Summary Statistics 23 �Richard Johnston

24. Descriptive Statistics using Excel 24 �Richard Johnston

25. Descriptive Statistics using R 25 �Richard Johnston Type the following:

26. Central Tendency Mean Arithmetic average of the values Median Midpoint of the values (half are higher and half are lower) If there are an even number the median is the average of the two points. Mode The value that occurs most frequently. There may be more than one mode. 26 �Richard Johnston

27. Measures of Dispersion Range Gives an idea of the spread of values, but depends only on two of them � the largest and the smallest. Variance Averages the squared deviations of each value from the mean. Standard Deviation Calculated by taking the square root of the variance More useful than the variance since it�s in the same measurement units as the data. 27 �Richard Johnston

28. Measures of Dispersion2 Quartiles Divides the data into four equal segments 28 �Richard Johnston

29. Measures of Dispersion3 Interquartile Range (IQR) Measures the spread of the center 50% of the data IQR = Q3 � Q1 Used to help identify outliers (more on this later) 29 �Richard Johnston

30. Measures of Dispersion4 Predicting the distribution of values Empirical rule for �Bell Shaped� curves: Approximately 68% of the values will fall within 1 SD of the mean, 95% will fall within 2SD of the mean, and 99.7% will fall within 3SD of the mean 30 �Richard Johnston

31. Measures of Dispersion5 31 �Richard Johnston

32. Outliers Outliers are values that are (or seem to be) out of line with the rest of the observations. Outliers can distort statistical measures They may be indicative of transient errors in equipment or errors in transcription. They may also indicate a flaw in experimental assumptions. As we�ve just shown, 3 observations out of 1000 can be expected to be over 3 SD from the mean. Outliers that can�t be readily explained should receive careful attention. 32 �Richard Johnston

33. Identifying Outliers Quartiles and boxplots can help identify outliers. In R, type boxplot(Age) The middle box is the IRQ. The horizontal line is the median. The whiskers are 1.5 x IRQ In this example, four values are outliers 33 �Richard Johnston

34. The Normal Distribution 34 �Richard Johnston

35. The Normal Distribution2 The mean, median, and mode are the same value The distribution is bell shaped and symmetrical around the mean The total area under the curve is equal to 1 The left and right sides extend indefinitely x = the normally distributed random variable of interest � = the mean of the normal distribution s = the standard deviation of the normal distribution z = the number of standard deviations between x and �, otherwise known as the standard z-score �Richard Johnston 35

36. The Normal Distribution3 �Richard Johnston 36

37. The Standard Normal Distribution The standard normal distribution is a normal distribution with � = 0 s = 1 The total area under the standard normal curve is equal to 1. �Richard Johnston 37

38. The Standard Normal Distribution2 �Richard Johnston 38

39. Inferential Statistics Sampling Sampling Distributions Confidence Intervals Hypothesis Testing 39 �Richard Johnston

40. Sampling The term �population� in statistics represents all possible outcomes or measurements of interest in a particular study. A �sample� is a subset of the population that is representative of the whole population. Analysis of a sample allows us to infer characteristics of the entire population with a quantifiable degree of certainty. 40 �Richard Johnston

41. Sample Examples In the 1980�s Harvard did a study of the effectiveness of aspirin in the prevention of heart attacks. They followed over 22,000 physicians for five years. Half of the physicians were given a daily dose of aspirin, and half were given a placebo. Neither the subjects nor the investigators knew which was being administered. (More on this later.) A coin is flipped 20 times in each of 20 trials to determine whether it is �fair�. Seeds are divided randomly into two groups and planted. One group receives fertilizer A and the other group receives fertilizer B. All other factors (light, water, etc.) are kept the same. 41 �Richard Johnston

42. Sample Examples2 Patients at 6 US hospitals were randomly assigned to 1 of 3 groups: 604 received intercessory prayer after being informed that they may or may not receive prayer; 597 did not receive intercessory prayer also after being informed that they may or may not receive prayer; and 601 received intercessory prayer after being informed they would receive prayer. Intercessory prayer was provided for 14 days, starting the night before coronary artery bypass graft surgery (CABG). The primary outcome was presence of any complication within 30 days of CABG. Secondary outcomes were any major event and mortality. (American Heart Journal, 2006) 42 �Richard Johnston

43. Sample Size Several factors contribute to the determination of the sample size needed for a particular study: Desired confidence level (95%, 99%) Margin of error (5%,3%) Population size (Results don�t change much for populations of 20,000 or more) Expected proportion (p=q=.5 is conservative) For example, a 99% confidence level with a margin of error of 6% would require about 450 samples. Formulas for sample size vary, and are not presented here Several online tools are available for determining sample sizes (e.g., http://www.raosoft.com/samplesize.html) �Richard Johnston 43

44. Aspirin Study Scenario �Richard Johnston 44 Your company has just completed a five year study on the effectiveness of aspirin in preventing heart attacks. Five hundred physician volunteers were divided randomly into two groups. One group received 325mg of aspirin every other day. The other group received a placebo instead of aspirin. Neither the subjects nor the test administrators knew whether aspirin or placebo was being administered. The subjects were monitored for five years to determine whether or not they experienced a heart attack

45. Aspirin Study Data �Richard Johnston 45

46. Exploring the Data �Richard Johnston 46

47. Exploring the Data2 �Richard Johnston 47

48. Confidence Intervals If we perform an experiment such as flipping a coin we can count the number of �successes� (i.e., heads) in a number of trials to get an estimate of the underlying probability that a flip of the coin will result in heads. The larger the number of trials, the more confident we are that the true probability is within a certain range, or confidence interval. The number of successes in repeated experiments such as this form the familiar normal (bell-shaped) curve. Assuming a normal distribution of results allows us to calculate statistical characteristics for a wide variety of experiments, including clinical trials. �Richard Johnston 48

49. Confidence Intervals2 The �true� chance of attack in the Placebo Group is referred to as p1. Similarly, the �true� chance of attack in the Aspirin Group is referred to as p2. Our objective is to estimate the true difference of p1 and p2 using the results of this study. �Richard Johnston 49

50. Confidence Intervals3 �Richard Johnston 50




54. What you should know about confidence intervals �Richard Johnston 54 A confidence level is a range of values used to estimate a population parameter such as the mean. A confidence level is the probability that the interval estimate will include the mean. Increasing the confidence level makes the interval wider (less precise). Increasing the sample size reduces the width of the interval (more precise).

55. Hypothesis Testing �Richard Johnston 55

56. Hypothesis Testing2 �Richard Johnston 56









65. Alternate Hypothesis Tests �Richard Johnston 65

66. Testing for differences between samples 66 �Richard Johnston

67. Testing for differences between samples2 67 �Richard Johnston



70. Chi-Square Goodness of Fit Allows hypothesis testing of nominal and ordinal data. Used to test whether a frequency distribution fits a predicted distribution. Hypotheses: H0: The actual distribution can be described by the expected distribution Ha: The actual distribution differs from the expected distribution �Richard Johnston 70

71. Chi-Square Goodness of Fit2 �Richard Johnston 71





76. Where to go from here Use Excel�s Help resources to explore the various types of tests and statistics A list of useful books and websites is provided with the handouts. Most importantly- Have a statistician look at your results before publishing them 76 �Richard Johnston

77. Q&A 77 �Richard Johnston

78. On Line Resources 78 �Richard Johnston

79. References 79 �Richard Johnston

80. Thanks to everyone who made this presentation possible 80 �Richard Johnston

Data Analysis and Statistics in the Lab

Data Analysis and Statistics in the Lab

Presentation Transcript

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis