Introduction to Statistics

Introduction to Statistics Dr Linda Morgan Clinical Chemistry Division School of Clinical Laboratory Sciences

Outline • Types of data • Descriptive statistics • Estimates and confidence intervals • Hypothesis testing • Comparing groups • Relation between variables • Statistical aspects of study design • Pitfalls

Types of data • Categorical data • Ordered categorical data • Numerical data • Discrete • Continuous

Descriptive statisticsCategorical variables • Graphical representation – bar diagram • Numbers and proportions in each category

Descriptive statisticsContinuous variables • Distributions • Gaussian • Lognormal • Non-parametric • Central tendency • Mean • Median • Scatter • Standard deviation • Range • Interquartile range

Gaussian (normal) distribution

Gaussian (normal) distribution • Central tendency • Mean =  x • n • Scatter • Variance = S(x-mean)2 • n –1 • Standard deviation =  variance

Lognormal distribution

Lognormal distribution • Mean =  log x n • Geometric mean = antilog of mean (10mean) • Median • Rank data in order • Median = (n+1) / 2th observation

Variability • Variance = S(x-mean)2 n –1 • Standard deviation =  variance • Range • Interquartile range

Variability of Sample Mean • The sample mean is an estimate of the population mean • The standard error of the mean describes the distribution of the sample mean • Estimated SEM = SD/  n • The distribution of the sample mean is Normal providing n is large

Standard error of the difference between two means • SEM = SD/  n • Variance of the mean = SD2/n • Variance of the difference between two sample means = sum of the variances of the two means = (SD2/n)1 + (SD2/n)2 • SE of difference between means =  [(SD2/n)1 + (SD2/n)2 ]

Variability of a sample proportion • Assume Normal distribution when np and n(1-p) are > 5 • SE of a Binomial proportion = (pq/n) where q = 1-p

Standard error of the difference between two proportions • SE (p1 – p2) =  [variance (p1) + variance (p2) ] =  [ (p1 q1 /n1) + (p2 q2 /n2) ]

Confidence intervals of means • 95% ci for the mean = Sample mean  1.96 SEM • 95% ci for difference between 2 means = (mean1 – mean2 )  1.96 SE of difference

Confidence intervals of proportions • 95% ci for proportion = p  1.96 (pq/n) • 95% ci for difference between two proportions = (p1 – p2)  1.96 x SE (p1 – p2)

Hypothesis testing • The null hypothesis • The alternative hypothesis • What is a P value?

Comparing 2 groups of continuous data • Normal distribution: paired or unpaired t test • Non-Normal distribution: transform data OR Mann-Whitney-Wilcoxon test

Paired t test We wish to compare the fasting blood cholesterol levels in 10 subjects before and after treatment with a new drug. What is the null hypothesis?

Paired t test Subject Fasting cholesterol D Number Predrug Postdrug 01 6.7 4.4 2.3 02 7.8 7.0 0.8 03 8.1 6.0 2.1 04 5.5 5.8 -0.3 05 8.6 9.0 -0.4 06 6.7 6.1 0.6 07 7.1 7.3 -0.2 08 9.9 9.9 0 09 8.2 6.3 1.9 10 6.5 7.1 -0.6

Paired t test • Calculate the mean and SEM of D • The null hypothesis is that D = 0 • The test statistic t = mean(d) – 0 SEM (d)

Paired t test • Mean = 0.62 • SEM = 0.351 • t = 1.766 • Degrees of freedom = n - 1 = 9 • From tables of t, 2-tailed probability (P) is between 0.1 and 0.2 • How would you interpret this?

Comparing 2 groups of categorical data • In a study of the effect of smoking on the risk of developing ischaemic heart disease, 250 men with IHD and 250 age-matched healthy controls were asked about their current smoking habits. • What is the null hypothesis?

Results • 70 of the 250 patients were smokers • 30 of the healthy controls were smokers

Calculate expected values, E, for each cell

Calculate (observed – expected) value, D

Calculate D2/E

Calculate the sum of D2/E 8 + 8 + 2 + 2 = 20 This is the test statistic, chi squared Compare with tables of chi squared with (r-1)(c-1) degrees of freedom In this case, chi squared with 1 df has a P value of < 0.001 How do you interpret this?

Statistical analysis using computer software SPSS as an example

Planning • Experimental design • Suitable controls • Database design

Statistical power • The power of a study to detect an effect depends on: • The size of the effect • The sample size • The probability of failing to detect an effect where one exists is called b • The power of a study is 100(1-b)% • Wide confidence intervals indicate low statistical power

Statistical power • The necessary sample size to detect the effect of interest should be calculated in advance • Pilot data are usually required for these calculations

Statistical power - example • 30% of the population are carriers of a genetic variant. You wish to test whether this variant increases the risk of Alzheimers Disease. • For P < 0.05, and 80% power, number of controls and cases required: Control carriers Case carriers Sample size 30% 50% 100 30% 40% 350 30% 35% 1400

Multiple testing Number of Probability of Tests false positive 1 0.05 2 0.10 3 0.14 4 0.19 5 0.23 10 0.40 20 0.64 Bonferroni correction: Divide 0.05 by the number of tests to provide the required P value for hypothesis testing at the conventional level of statistical significance

Data trawling • Decide in advance which statistical tests are to be performed • Post hoc testing of subgroups should be viewed with caution • Multiple correlations should be avoided

HELP! • “In house” support • Cripps Computing Centre • Trent Institute for Health Service Research • Practical Statistics for Medical Research Douglas G Altman

Introduction to Statistics