790 likes | 935 Views
Topics. IntroductionDataDisplaying DataDescriptive StatisticsInferential StatisticsQ
E N D
1. Data Analysis and Statisticsin the Lab Southern California
Bioinformatics Summer Institute
2. Topics Introduction
Data
Displaying Data
Descriptive Statistics
Inferential Statistics
Q&A
Wrap-up
2 ©Richard Johnston
3. Objectives Introduce commonly used statistical concepts and measures
Show how to compute statistical measures using Excel and R
Minimize discussion of statistical theory as much as possible.
Provide sample calculations in the downloaded materials
3 ©Richard Johnston
4. Asking questions of the data Could I have gotten these results by chance?
Is there a significant difference between these two samples?
How can I describe my results?
What can I say about the average of these measurements?
What should I do with these outliers?
… etc. 4 ©Richard Johnston The answer to the first question is always yes. We need to quantify the likelihood.The answer to the first question is always yes. We need to quantify the likelihood.
5. Tools used Excel 2007
R (???) 5 ©Richard Johnston
6. Excel 2007 What’s new (and different)? Mostly, the look and feel have changed.
Up to 1,048,576 rows and 16,384 columns in a single worksheet
Up to 32,767 characters in a single cell
More sorting options
Enhanced data importing
Improved PivotTables ©Richard Johnston 6
7. What’s new (and different)2 More options for conditional formatting of cells
Multithreaded calculation of formulae, to speed up large calculations, especially on multi-core/multi-processor systems.
Improved filtering
New charting features
… Other changes to make Excel and other Office programs more Vista-like ©Richard Johnston 7
8. ©Richard Johnston 8 What is R? R is a language and environment for statistical computing and graphics widely used in research institutions.
R provides a wide variety of statistical and graphical techniques, and is highly extensible.
One of R's strengths is the ease with which publication-quality plots can be produced.
R is available as free software,
R runs on Windows, MacOS, and a wide variety of UNIX platforms.
9. Types of data Primary (collected by you) or Secondary (obtained from another source)
Observational or Experimental
Quantitative or Qualitative
Quantitative data uses numerical values to describe something (e.g., weight, temperature)
Qualitative data uses descriptive terms to classify something (e.g., gender, color) 9 ©Richard Johnston
10. Types of Measurement Scales Nominal (Qualitative)
Examples are gender, color, species name, …
Ordinal (Qualitative or quantitative)
Allows rank ordering of values
Examples:
Grades A-F
Rating Level 1 through 5
“Slow”, “Medium”, “Fast”
10 ©Richard Johnston
11. Types of Measurement Scales2 Interval (Quantitative)
Allows addition and subtraction, but not multiplication and division
No real zero point
Example: Temperature measurement in degrees Fahrenheit
100 degrees F is 50 more than 50 degrees F
100 degrees F is not twice as hot as 50 degrees F 11 ©Richard Johnston
12. Types of Measurement Scales3 Ratio (Quantitative)
Allows addition, subtraction, multiplication and division
Has a true zero point
A value zero means the absence of the measured quantity
Examples: Weight, age, or speed
To decide if a measurement is Interval or Ratio, see if the phrase “Twice as…” makes sense:
e.g., Twice as (heavy, old, fast) 12 ©Richard Johnston
13. Displaying Data Charts
Pie
Column
Line
Scatter
Histograms 13 ©Richard Johnston
14. Pie Chart 14 ©Richard Johnston
15. Column Chart 15 ©Richard Johnston
16. Line Chart 16 ©Richard Johnston
17. Scatter (X-Y) Chart 17 ©Richard Johnston
18. Histograms 18 ©Richard Johnston
19. ©Richard Johnston Frequency Distribution Histogram Using Excel Open AspirinStudyData.xlsx 0r .xls
Create bin values 20,25,… ,100
Select Data Analysis
Select Histogram
Click Input Range and select the Age data
Click on Bin Values and select the bin values.
Check Labels and Chart Output
Click OK
20. ©Richard Johnston Frequency Distribution Histogram Using R Start R
Select File>change dir…
Browse to default directory
Type the following (including capital letters):
21. ©Richard Johnston Frequency Distribution Histogram Using R
22. Descriptive Statistics Measures of Central Tendency
Mean
Median
Mode
Measures of Dispersion
Range
Variance
Standard Deviation
Quartiles
Interquartile Range 22 ©Richard Johnston
23. Descriptive Statistics using Excel Use built-in functions to perform basic analyses: Formulas|MoreFunctions|Statistical
Use the Data Analysis Add-in for more complex analyses:
Open AspirinStudyData.xlsx
Select Data|Data Analysis
Select Descriptive Statistics
Select the Age data
Check Labels in first row
Check Summary Statistics
23 ©Richard Johnston
24. Descriptive Statistics using Excel 24 ©Richard Johnston
25. Descriptive Statistics using R 25 ©Richard Johnston Type the following:
26. Central Tendency Mean
Arithmetic average of the values
Median
Midpoint of the values (half are higher and half are lower)
If there are an even number the median is the average of the two points.
Mode
The value that occurs most frequently.
There may be more than one mode. 26 ©Richard Johnston
27. Measures of Dispersion Range
Gives an idea of the spread of values, but depends only on two of them – the largest and the smallest.
Variance
Averages the squared deviations of each value from the mean.
Standard Deviation
Calculated by taking the square root of the variance
More useful than the variance since it’s in the same measurement units as the data.
27 ©Richard Johnston
28. Measures of Dispersion2 Quartiles
Divides the data into four equal segments
28 ©Richard Johnston
29. Measures of Dispersion3 Interquartile Range (IQR)
Measures the spread of the center 50% of the data
IQR = Q3 – Q1
Used to help identify outliers (more on this later)
29 ©Richard Johnston
30. Measures of Dispersion4 Predicting the distribution of values
Empirical rule for “Bell Shaped” curves:
Approximately
68% of the values will fall within 1 SD of the mean,
95% will fall within 2SD of the mean, and
99.7% will fall within 3SD of the mean
30 ©Richard Johnston
31. Measures of Dispersion5 31 ©Richard Johnston
32. Outliers Outliers are values that are (or seem to be) out of line with the rest of the observations.
Outliers can distort statistical measures
They may be indicative of transient errors in equipment or errors in transcription.
They may also indicate a flaw in experimental assumptions.
As we’ve just shown, 3 observations out of 1000 can be expected to be over 3 SD from the mean.
Outliers that can’t be readily explained should receive careful attention. 32 ©Richard Johnston
33. Identifying Outliers Quartiles and boxplots can help identify outliers.
In R, type
boxplot(Age)
The middle box is the IRQ.
The horizontal line is the median.
The whiskers are 1.5 x IRQ
In this example, four values are outliers 33 ©Richard Johnston
34. The Normal Distribution 34 ©Richard Johnston
35. The Normal Distribution2 The mean, median, and mode are the same value
The distribution is bell shaped and symmetrical around the mean
The total area under the curve is equal to 1
The left and right sides extend indefinitely
x = the normally distributed random variable of interest
µ = the mean of the normal distribution
s = the standard deviation of the normal distribution
z = the number of standard deviations between x and µ, otherwise known as the standard z-score ©Richard Johnston 35
36. The Normal Distribution3 ©Richard Johnston 36
37. The Standard Normal Distribution The standard normal distribution is a normal distribution with
µ = 0
s = 1
The total area under the standard normal curve is equal to 1.
©Richard Johnston 37
38. The Standard Normal Distribution2 ©Richard Johnston 38
39. Inferential Statistics Sampling
Sampling Distributions
Confidence Intervals
Hypothesis Testing 39 ©Richard Johnston
40. Sampling The term ”population” in statistics represents all possible outcomes or measurements of interest in a particular study.
A “sample” is a subset of the population that is representative of the whole population.
Analysis of a sample allows us to infer characteristics of the entire population with a quantifiable degree of certainty. 40 ©Richard Johnston
41. Sample Examples In the 1980’s Harvard did a study of the effectiveness of aspirin in the prevention of heart attacks. They followed over 22,000 physicians for five years. Half of the physicians were given a daily dose of aspirin, and half were given a placebo. Neither the subjects nor the investigators knew which was being administered. (More on this later.)
A coin is flipped 20 times in each of 20 trials to determine whether it is “fair”.
Seeds are divided randomly into two groups and planted. One group receives fertilizer A and the other group receives fertilizer B. All other factors (light, water, etc.) are kept the same. 41 ©Richard Johnston
42. Sample Examples2 Patients at 6 US hospitals were randomly assigned to 1 of 3 groups: 604 received intercessory prayer after being informed that they may or may not receive prayer; 597 did not receive intercessory prayer also after being informed that they may or may not receive prayer; and 601 received intercessory prayer after being informed they would receive prayer. Intercessory prayer was provided for 14 days, starting the night before coronary artery bypass graft surgery (CABG). The primary outcome was presence of any complication within 30 days of CABG. Secondary outcomes were any major event and mortality. (American Heart Journal, 2006) 42 ©Richard Johnston
43. Sample Size Several factors contribute to the determination of the sample size needed for a particular study:
Desired confidence level (95%, 99%)
Margin of error (5%,3%)
Population size (Results don’t change much for populations of 20,000 or more)
Expected proportion (p=q=.5 is conservative)
For example, a 99% confidence level with a margin of error of 6% would require about 450 samples.
Formulas for sample size vary, and are not presented here
Several online tools are available for determining sample sizes (e.g., http://www.raosoft.com/samplesize.html) ©Richard Johnston 43
44. Aspirin Study Scenario ©Richard Johnston 44 Your company has just completed a five year study on the effectiveness of aspirin in preventing heart attacks.
Five hundred physician volunteers were divided randomly into two groups. One group received 325mg of aspirin every other day. The other group received a placebo instead of aspirin.
Neither the subjects nor the test administrators knew whether aspirin or placebo was being administered.
The subjects were monitored for five years to determine whether or not they experienced a heart attack
45. Aspirin Study Data ©Richard Johnston 45
46. Exploring the Data ©Richard Johnston 46
47. Exploring the Data2 ©Richard Johnston 47
48. Confidence Intervals If we perform an experiment such as flipping a coin we can count the number of “successes” (i.e., heads) in a number of trials to get an estimate of the underlying probability that a flip of the coin will result in heads.
The larger the number of trials, the more confident we are that the true probability is within a certain range, or confidence interval.
The number of successes in repeated experiments such as this form the familiar normal (bell-shaped) curve.
Assuming a normal distribution of results allows us to calculate statistical characteristics for a wide variety of experiments, including clinical trials.
©Richard Johnston 48
49. Confidence Intervals2 The “true” chance of attack in the Placebo Group is referred to as p1. Similarly, the “true” chance of attack in the Aspirin Group is referred to as p2.
Our objective is to estimate the true difference of p1 and p2 using the results of this study.
©Richard Johnston 49
50. Confidence Intervals3 ©Richard Johnston 50
51. Confidence Intervals4 ©Richard Johnston 51
52. Confidence Intervals5 ©Richard Johnston 52
53. Confidence Intervals6 ©Richard Johnston 53
54. What you should know about confidence intervals ©Richard Johnston 54 A confidence level is a range of values used to estimate a population parameter such as the mean.
A confidence level is the probability that the interval estimate will include the mean.
Increasing the confidence level makes the interval wider (less precise).
Increasing the sample size reduces the width of the interval (more precise).
55. Hypothesis Testing ©Richard Johnston 55
56. Hypothesis Testing2 ©Richard Johnston 56
57. Hypothesis Testing3 ©Richard Johnston 57
58. Hypothesis Testing4 ©Richard Johnston 58
59. Hypothesis Testing5 ©Richard Johnston 59
60. Hypothesis Testing6 ©Richard Johnston 60
61. Hypothesis Testing7 ©Richard Johnston 61
62. Hypothesis Testing8 ©Richard Johnston 62
63. Hypothesis Testing9 ©Richard Johnston 63
64. Hypothesis Testing10 ©Richard Johnston 64
65. Alternate Hypothesis Tests ©Richard Johnston 65
66. Testing for differences between samples 66 ©Richard Johnston
67. Testing for differences between samples2 67 ©Richard Johnston
68. Testing for differences between samples3 68 ©Richard Johnston
69. Testing for differences between samples4 69 ©Richard Johnston
70. Chi-Square Goodness of Fit Allows hypothesis testing of nominal and ordinal data.
Used to test whether a frequency distribution fits a predicted distribution.
Hypotheses:
H0: The actual distribution can be described by the expected distribution
Ha: The actual distribution differs from the expected distribution
©Richard Johnston 70
71. Chi-Square Goodness of Fit2 ©Richard Johnston 71
72. Chi-Square Goodness of Fit3 ©Richard Johnston 72
73. Chi-Square Goodness of Fit4 ©Richard Johnston 73
74. Chi-Square Goodness of Fit5 ©Richard Johnston 74
75. Chi-Square Goodness of Fit6 ©Richard Johnston 75
76. Where to go from here Use Excel’s Help resources to explore the various types of tests and statistics
A list of useful books and websites is provided with the handouts.
Most importantly-
Have a statistician look at your results before publishing them 76 ©Richard Johnston
77. Q&A 77 ©Richard Johnston
78. On Line Resources 78 ©Richard Johnston
79. References 79 ©Richard Johnston
80. Thanks to everyone who made this presentation possible 80 ©Richard Johnston