Epidemiology and Applied Statistics Review Module 2 – Data Types & Applied Statistics

Download Presentation

Epidemiology and Applied Statistics Review Module 2 – Data Types & Applied Statistics

Loading in 2 Seconds...

- 144 Views
- Uploaded on
- Presentation posted in: General

Epidemiology and Applied Statistics Review Module 2 – Data Types & Applied Statistics

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Epidemiology and Applied Statistics Review Module 2 – Data Types & Applied Statistics

American College of Veterinary Preventive Medicine Review Course

Katherine Feldman, DVM, MPH, DACVPM

kfeldman@umd.edu

301-314-6820

- Students review modules on their own
- Send questions by email to Katherine Feldman (kfeldman@umd.edu) by Friday March 23 a.m.
- Conference call Friday March 23 2-3 p.m.
- Watch email and Blackboard for conference call details

- Gordis L. Epidemiology, 3rd ed. Elsevier Saunders, Philadelphia, 2004.
- $47.95 from Amazon.com

- Norman GR, Streiner DL. PDQ statistics, 3rd ed. BC Decker Inc., Hamilton, 2003.
- $17.79 from Amazon.com

- Independent variables
- The characteristic being observed or measured that is hypothesized to influence an event of manifestation
- E.g., Risk factors

- Dependent variables
- The value of which is dependent on the effect of other variable(s). A manifestation or outcome whose variation we seek to explain or account for by the influence of independent variables.
- E.g., Disease outcome

- Continuous
- Quantitative with potentially infinite number of values along continuum
- Can be measured to as many decimal places as measuring instrument allows
- E.g., Weight, height

- Discrete
- Count – quantitative data that can be arranged into discrete, naturally occurring or arbitrarily selected groups or sets of values e.g., pulse rate
- Categorical
- Nominal – qualitative, named category; the order of the categories is irrelevant to statistical analyses e.g., gender, reproductive status
- Ordinal – ordered categories, qualitative e.g., disease staging in cancer, education level

Once we take a sample from a population we have two goals:

- Summarize and describe the data
- Test hypotheses and make inferences about the population based on what has been observed in the sample

- Descriptive statistics
- Communicate results without attempting to generalize
- Important first step in epidemiologic studies

- Inferential statistics
- Used to infer the likelihood that the observed results can be generalized to other samples of individuals

- Data is distributed in some manner among various categories or throughout possible values
- When values are plotted, we get a distribution
- Frequency histogram
- Frequently used in epidemiology
- We can see proportion of the total sample in each category
- E.g., If we want to know the probability that entrants are age 78 or 79 years, then add up the probabilities in those categories

- Mean
- The average, determined by adding all values and dividing by total number of subjects

- Mode
- The most common value in the data

- Median
- Value in dataset where ½ subjects are smaller and ½ are larger
- List data in ascending order
- Find the median location as (n+1) / 2

- Value in dataset where ½ subjects are smaller and ½ are larger

- For a symmetrical distribution, the mean, median, and mode all occur at same point

- The median is less sensitive to extreme observations than the mean
- The mean uses all data and has nicer statistical properties than the median
- The mode is mainly useful for nominal variables

Modeapproximately $50,000

Medianapproximately $60,000

Meanapproximately $70,000

Skew

- Need to be able to measure the extent to which individual values differ from mean
- Range
- The difference between the highest and lowest values

- Variance
- Average squared deviation of each value from the mean
Σ(Individual value – mean value)2

Number of values - 1

- Because variance is reported in squared units, take square root of the variance and report standard deviation

- Average squared deviation of each value from the mean
- Standard deviation (SD)
- Average measure of how individual values differ from the mean
- The smaller the SD, the less each score varies from the mean
- The larger the spread of scores, the larger the SD.
___________________________

SD = Σ(Individual value – mean value)2

Number of values - 1

- When reporting estimates of central tendency, report measure of dispersion, e.g., mean ± SD

√

- Many variables tend to follow bell-shaped distribution
- Most values clustered symmetrically near mean
- Few values falling in the tails
- Shape of curve can be expressed in terms of mean and SD
- 68% are in 1 SD of mean
- 95.5% are in 2 SDs
- 2.3% in each tail

- Epidemiologists never achieve degree of control possible in experimental settings.
- While our results may reflect the truth, it is also possible that there are alternative explanations
- Findings are due to random error (chance)
- Findings are due to systematic error (bias)
- Findings are confounded by other variables that were unmeasured or uncontrolled

- A principal assumption underlying use of measures of disease frequency is that we can make inferences to the population based on a sample
- Because of random variation from sample to sample, the observed results will probably reflect the play of chance

- We can quantify the degree to which chance variability may account for the results observed in any individual study
- By performing appropriate test of statistical significance and determining the p-value

- Performing a test of statistical significance to determine likelihood that sampling variability (chance) explains the observed results
- Make explicit statement of hypothesis to be tested
- Null hypothesis (H0)
- Always the hypothesis of no difference
- The assertion that there is no association between exposure and disease, e.g., RR = 1, OR = 1

- Alternative hypothesis (H1 or HA)
- The assertion that there is some association between exposure and disease, e.g., RR ≠ 1, OR ≠ 1

- Null hypothesis (H0)

- Will vary by study design, data type and situation
- Generates a test statistic that is a function of
- The difference between observed values in the study and expected values if null hypothesis were true, and
- The variability in the sample

- Will lead to a probability statement (p-value)

- Probability that an effect at least as extreme as that observed in a particular study could have occurred by chance alone, given H0 is true
- The larger the test statistic, the lower the p-value
- Convention in medical research is when p ≤ 0.05, then association between exposure and disease is statistically significant
- There is no more than a 5% (1 in 20) probability of observing results as extreme as that observed due solely to chance

- If p > 0.05, then chance cannot be excluded as a likely explanation

- Parametric test for differences between means of independent samples
- Continuous data
- H0: mean1 = mean2
HA: mean1 ≠ mean2

- Test whether observed differences in proportions between study groups are statistically significant
- I.e., Whether there is an association between exposure and outcome
- Categorical data
- H0: proportions are equal; no association
HA: proportions are different; there is an association

- O = observed count in a category
- E = Expected count in that category under the null hypothesis
- Expected counts, E, can be determined by
- R = Row total
- C = Column total
- T = Table total

- One of the major determinants of the degree to which chance affects the findings in any particular study is sample size
- In general, the smaller the sample from which our inference is made, the more variability there will be in the estimates and the less likely the findings will reflect the experience of the total population
- Conversely, the larger the sample on which the estimate is based, the less variability and the more reliable the inference

- p-values are composite measures that reflect
- Magnitude of the difference between the groups, AND
- Sample size

- Therefore, if sample size is sufficiently large then small differences may be statistically significant
- Conversely, large effects may not achieve statistical significance if sample size is not sufficient
- It is not possible to determine the contribution of the sample size just by looking at the p-value

- A related but more informative measure of the role of chance is the confidence interval (CI)
- The range within which the true magnitude of effect lies with a certain degree of assurance

- When using 0.05 as your cutoff for statistical significance, then use corresponding 95% CI
- If null value (e.g., OR=1) is included in 95% CI, then the corresponding p-value is greater than 0.05
- E.g., 95% CI = (0.8, 1.7) NOT SIGNIFICANT

- If null value (e.g., OR=1) is not included in CI, then the p-value is less than 0.05 and the association is statistically significant
- E.g., 95% CI = (1.4, 4.8) SIGNIFICANT

- If null value (e.g., OR=1) is included in 95% CI, then the corresponding p-value is greater than 0.05

- Width of CI indicates amount of variability inherent in the estimate and thus the effect of sample size
- The larger the study sample, the more stable the estimate, and the narrower the CI
- The wider the CI, the greater the variability in the estimate and the smaller the sample size