1 / 79

Data Analysis and Statistics in the Lab

Topics. IntroductionDataDisplaying DataDescriptive StatisticsInferential StatisticsQ

cwen
Download Presentation

Data Analysis and Statistics in the Lab

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Data Analysis and Statistics in the Lab Southern California Bioinformatics Summer Institute

    2. Topics Introduction Data Displaying Data Descriptive Statistics Inferential Statistics Q&A Wrap-up 2 ©Richard Johnston

    3. Objectives Introduce commonly used statistical concepts and measures Show how to compute statistical measures using Excel and R Minimize discussion of statistical theory as much as possible. Provide sample calculations in the downloaded materials 3 ©Richard Johnston

    4. Asking questions of the data Could I have gotten these results by chance? Is there a significant difference between these two samples? How can I describe my results? What can I say about the average of these measurements? What should I do with these outliers? … etc. 4 ©Richard Johnston The answer to the first question is always yes. We need to quantify the likelihood.The answer to the first question is always yes. We need to quantify the likelihood.

    5. Tools used Excel 2007 R (???) 5 ©Richard Johnston

    6. Excel 2007 What’s new (and different)? Mostly, the look and feel have changed. Up to 1,048,576 rows and 16,384 columns in a single worksheet Up to 32,767 characters in a single cell More sorting options Enhanced data importing Improved PivotTables ©Richard Johnston 6

    7. What’s new (and different)2 More options for conditional formatting of cells Multithreaded calculation of formulae, to speed up large calculations, especially on multi-core/multi-processor systems. Improved filtering New charting features … Other changes to make Excel and other Office programs more Vista-like ©Richard Johnston 7

    8. ©Richard Johnston 8 What is R? R is a language and environment for statistical computing and graphics widely used in research institutions. R provides a wide variety of statistical and graphical techniques, and is highly extensible. One of R's strengths is the ease with which publication-quality plots can be produced. R is available as free software, R runs on Windows, MacOS, and a wide variety of UNIX platforms.

    9. Types of data Primary (collected by you) or Secondary (obtained from another source) Observational or Experimental Quantitative or Qualitative Quantitative data uses numerical values to describe something (e.g., weight, temperature) Qualitative data uses descriptive terms to classify something (e.g., gender, color) 9 ©Richard Johnston

    10. Types of Measurement Scales Nominal (Qualitative) Examples are gender, color, species name, … Ordinal (Qualitative or quantitative) Allows rank ordering of values Examples: Grades A-F Rating Level 1 through 5 “Slow”, “Medium”, “Fast” 10 ©Richard Johnston

    11. Types of Measurement Scales2 Interval (Quantitative) Allows addition and subtraction, but not multiplication and division No real zero point Example: Temperature measurement in degrees Fahrenheit 100 degrees F is 50 more than 50 degrees F 100 degrees F is not twice as hot as 50 degrees F 11 ©Richard Johnston

    12. Types of Measurement Scales3 Ratio (Quantitative) Allows addition, subtraction, multiplication and division Has a true zero point A value zero means the absence of the measured quantity Examples: Weight, age, or speed To decide if a measurement is Interval or Ratio, see if the phrase “Twice as…” makes sense: e.g., Twice as (heavy, old, fast) 12 ©Richard Johnston

    13. Displaying Data Charts Pie Column Line Scatter Histograms 13 ©Richard Johnston

    14. Pie Chart 14 ©Richard Johnston

    15. Column Chart 15 ©Richard Johnston

    16. Line Chart 16 ©Richard Johnston

    17. Scatter (X-Y) Chart 17 ©Richard Johnston

    18. Histograms 18 ©Richard Johnston

    19. ©Richard Johnston Frequency Distribution Histogram Using Excel Open AspirinStudyData.xlsx 0r .xls Create bin values 20,25,… ,100 Select Data Analysis Select Histogram Click Input Range and select the Age data Click on Bin Values and select the bin values. Check Labels and Chart Output Click OK

    20. ©Richard Johnston Frequency Distribution Histogram Using R Start R Select File>change dir… Browse to default directory Type the following (including capital letters):

    21. ©Richard Johnston Frequency Distribution Histogram Using R

    22. Descriptive Statistics Measures of Central Tendency Mean Median Mode Measures of Dispersion Range Variance Standard Deviation Quartiles Interquartile Range 22 ©Richard Johnston

    23. Descriptive Statistics using Excel Use built-in functions to perform basic analyses: Formulas|MoreFunctions|Statistical Use the Data Analysis Add-in for more complex analyses: Open AspirinStudyData.xlsx Select Data|Data Analysis Select Descriptive Statistics Select the Age data Check Labels in first row Check Summary Statistics 23 ©Richard Johnston

    24. Descriptive Statistics using Excel 24 ©Richard Johnston

    25. Descriptive Statistics using R 25 ©Richard Johnston Type the following:

    26. Central Tendency Mean Arithmetic average of the values Median Midpoint of the values (half are higher and half are lower) If there are an even number the median is the average of the two points. Mode The value that occurs most frequently. There may be more than one mode. 26 ©Richard Johnston

    27. Measures of Dispersion Range Gives an idea of the spread of values, but depends only on two of them – the largest and the smallest. Variance Averages the squared deviations of each value from the mean. Standard Deviation Calculated by taking the square root of the variance More useful than the variance since it’s in the same measurement units as the data. 27 ©Richard Johnston

    28. Measures of Dispersion2 Quartiles Divides the data into four equal segments 28 ©Richard Johnston

    29. Measures of Dispersion3 Interquartile Range (IQR) Measures the spread of the center 50% of the data IQR = Q3 – Q1 Used to help identify outliers (more on this later) 29 ©Richard Johnston

    30. Measures of Dispersion4 Predicting the distribution of values Empirical rule for “Bell Shaped” curves: Approximately 68% of the values will fall within 1 SD of the mean, 95% will fall within 2SD of the mean, and 99.7% will fall within 3SD of the mean 30 ©Richard Johnston

    31. Measures of Dispersion5 31 ©Richard Johnston

    32. Outliers Outliers are values that are (or seem to be) out of line with the rest of the observations. Outliers can distort statistical measures They may be indicative of transient errors in equipment or errors in transcription. They may also indicate a flaw in experimental assumptions. As we’ve just shown, 3 observations out of 1000 can be expected to be over 3 SD from the mean. Outliers that can’t be readily explained should receive careful attention. 32 ©Richard Johnston

    33. Identifying Outliers Quartiles and boxplots can help identify outliers. In R, type boxplot(Age) The middle box is the IRQ. The horizontal line is the median. The whiskers are 1.5 x IRQ In this example, four values are outliers 33 ©Richard Johnston

    34. The Normal Distribution 34 ©Richard Johnston

    35. The Normal Distribution2 The mean, median, and mode are the same value The distribution is bell shaped and symmetrical around the mean The total area under the curve is equal to 1 The left and right sides extend indefinitely x = the normally distributed random variable of interest µ = the mean of the normal distribution s = the standard deviation of the normal distribution z = the number of standard deviations between x and µ, otherwise known as the standard z-score ©Richard Johnston 35

    36. The Normal Distribution3 ©Richard Johnston 36

    37. The Standard Normal Distribution The standard normal distribution is a normal distribution with µ = 0 s = 1 The total area under the standard normal curve is equal to 1. ©Richard Johnston 37

    38. The Standard Normal Distribution2 ©Richard Johnston 38

    39. Inferential Statistics Sampling Sampling Distributions Confidence Intervals Hypothesis Testing 39 ©Richard Johnston

    40. Sampling The term ”population” in statistics represents all possible outcomes or measurements of interest in a particular study. A “sample” is a subset of the population that is representative of the whole population. Analysis of a sample allows us to infer characteristics of the entire population with a quantifiable degree of certainty. 40 ©Richard Johnston

    41. Sample Examples In the 1980’s Harvard did a study of the effectiveness of aspirin in the prevention of heart attacks. They followed over 22,000 physicians for five years. Half of the physicians were given a daily dose of aspirin, and half were given a placebo. Neither the subjects nor the investigators knew which was being administered. (More on this later.) A coin is flipped 20 times in each of 20 trials to determine whether it is “fair”. Seeds are divided randomly into two groups and planted. One group receives fertilizer A and the other group receives fertilizer B. All other factors (light, water, etc.) are kept the same. 41 ©Richard Johnston

    42. Sample Examples2 Patients at 6 US hospitals were randomly assigned to 1 of 3 groups: 604 received intercessory prayer after being informed that they may or may not receive prayer; 597 did not receive intercessory prayer also after being informed that they may or may not receive prayer; and 601 received intercessory prayer after being informed they would receive prayer. Intercessory prayer was provided for 14 days, starting the night before coronary artery bypass graft surgery (CABG). The primary outcome was presence of any complication within 30 days of CABG. Secondary outcomes were any major event and mortality. (American Heart Journal, 2006) 42 ©Richard Johnston

    43. Sample Size Several factors contribute to the determination of the sample size needed for a particular study: Desired confidence level (95%, 99%) Margin of error (5%,3%) Population size (Results don’t change much for populations of 20,000 or more) Expected proportion (p=q=.5 is conservative) For example, a 99% confidence level with a margin of error of 6% would require about 450 samples. Formulas for sample size vary, and are not presented here Several online tools are available for determining sample sizes (e.g., http://www.raosoft.com/samplesize.html) ©Richard Johnston 43

    44. Aspirin Study Scenario ©Richard Johnston 44 Your company has just completed a five year study on the effectiveness of aspirin in preventing heart attacks. Five hundred physician volunteers were divided randomly into two groups. One group received 325mg of aspirin every other day. The other group received a placebo instead of aspirin. Neither the subjects nor the test administrators knew whether aspirin or placebo was being administered. The subjects were monitored for five years to determine whether or not they experienced a heart attack

    45. Aspirin Study Data ©Richard Johnston 45

    46. Exploring the Data ©Richard Johnston 46

    47. Exploring the Data2 ©Richard Johnston 47

    48. Confidence Intervals If we perform an experiment such as flipping a coin we can count the number of “successes” (i.e., heads) in a number of trials to get an estimate of the underlying probability that a flip of the coin will result in heads. The larger the number of trials, the more confident we are that the true probability is within a certain range, or confidence interval. The number of successes in repeated experiments such as this form the familiar normal (bell-shaped) curve. Assuming a normal distribution of results allows us to calculate statistical characteristics for a wide variety of experiments, including clinical trials. ©Richard Johnston 48

    49. Confidence Intervals2 The “true” chance of attack in the Placebo Group is referred to as p1. Similarly, the “true” chance of attack in the Aspirin Group is referred to as p2. Our objective is to estimate the true difference of p1 and p2 using the results of this study. ©Richard Johnston 49

    50. Confidence Intervals3 ©Richard Johnston 50

    51. Confidence Intervals4 ©Richard Johnston 51

    52. Confidence Intervals5 ©Richard Johnston 52

    53. Confidence Intervals6 ©Richard Johnston 53

    54. What you should know about confidence intervals ©Richard Johnston 54 A confidence level is a range of values used to estimate a population parameter such as the mean. A confidence level is the probability that the interval estimate will include the mean. Increasing the confidence level makes the interval wider (less precise). Increasing the sample size reduces the width of the interval (more precise).

    55. Hypothesis Testing ©Richard Johnston 55

    56. Hypothesis Testing2 ©Richard Johnston 56

    57. Hypothesis Testing3 ©Richard Johnston 57

    58. Hypothesis Testing4 ©Richard Johnston 58

    59. Hypothesis Testing5 ©Richard Johnston 59

    60. Hypothesis Testing6 ©Richard Johnston 60

    61. Hypothesis Testing7 ©Richard Johnston 61

    62. Hypothesis Testing8 ©Richard Johnston 62

    63. Hypothesis Testing9 ©Richard Johnston 63

    64. Hypothesis Testing10 ©Richard Johnston 64

    65. Alternate Hypothesis Tests ©Richard Johnston 65

    66. Testing for differences between samples 66 ©Richard Johnston

    67. Testing for differences between samples2 67 ©Richard Johnston

    68. Testing for differences between samples3 68 ©Richard Johnston

    69. Testing for differences between samples4 69 ©Richard Johnston

    70. Chi-Square Goodness of Fit Allows hypothesis testing of nominal and ordinal data. Used to test whether a frequency distribution fits a predicted distribution. Hypotheses: H0: The actual distribution can be described by the expected distribution Ha: The actual distribution differs from the expected distribution ©Richard Johnston 70

    71. Chi-Square Goodness of Fit2 ©Richard Johnston 71

    72. Chi-Square Goodness of Fit3 ©Richard Johnston 72

    73. Chi-Square Goodness of Fit4 ©Richard Johnston 73

    74. Chi-Square Goodness of Fit5 ©Richard Johnston 74

    75. Chi-Square Goodness of Fit6 ©Richard Johnston 75

    76. Where to go from here Use Excel’s Help resources to explore the various types of tests and statistics A list of useful books and websites is provided with the handouts. Most importantly- Have a statistician look at your results before publishing them 76 ©Richard Johnston

    77. Q&A 77 ©Richard Johnston

    78. On Line Resources 78 ©Richard Johnston

    79. References 79 ©Richard Johnston

    80. Thanks to everyone who made this presentation possible 80 ©Richard Johnston

More Related