Introduction

Introduction • Statistics are increasingly prevalent in medical practice, and for those doing research, statistical issues are fundamental. It is extremely important therefore, to understand basic statistical ideas relating to research design and data analysis, and to be familiar with the most commonly used methods of analysis.

Although data analysis is certainly an important part of the statistical process, there is an equally vital role to be played in the design of the research project. Without a properly designed study, the subsequent analysis may be unsafe, and/or a complete waste of time and resources.

Types of data • Descriptive statistics • Data distributions • Comparative statistics • Non-parametric tests • Paired data • Comparison of several means • Comparing proportions • Exploring the relationship between 2 variables • Correlation • Linear regression • Survival analysis

.15 .1 Proportion of total 900 .05 0 0 100 400 1000 1500 Platelet count

Types of Data • Categorical • binary or dichotomous e.g. diabetic/non-diabetic, smoker/non-smoker • nominal e.g. AB/B/AB/O, short-sighted/long-sighted/normal • ordered categorical (ordinal) e.g. stage 1/2/3/4, mild/moderate/severe

Discrete numerical - e.g. number of children - 0/1/2/3/4/5+ • Continuous - e.g. Blood pressure, age • Other types of data • ranks, e.g. preference between treatments • percentages, e.g. % oxygen uptake • rates or ratios, e.g. numbers of infant deaths/1000 • scores, e.g. Apgar score for evaluating new-born babies • visual analogue scales, e.g. perception of pain • survival data – two components, outcome and time to outcome

Descriptive Statistics • For continuous variables there are a number of useful descriptive statistics • Mean - equal to the sum of the observations divided by the number of observations, also known as the arithmetic mean • Median - the value that comes half-way when the data are ranked in order • Mode - the most common value observed • Standard Deviation - is a measure of the average deviation (or distance) of the observations from the mean • Standard Error of the mean - is measure of the uncertainty of a single sample mean as an estimate of the population mean

Data Distributions • Frequency distribution • If there are more than about 20 observations, a useful first step in summarizing quantitative data is to form a frequency distribution. This is a table showing the number of observations at different values or within certain ranges. If this is then plotted as a bar diagram a frequency distribution is obtained.

The Normal Distribution • In practice it is found that a reasonable description of many variables is provided by the normal distribution (Gaussian distribution). The curve of the normal distribution is symmetrical about the mean and bell-shaped. The bell is tall and narrow for small standard deviations, and short and wide for large ones.

Descriptives Statistic Std. Error DOD Mean 3.1813 9.515E-02 95% Confidence Lower Bound 2.9946 Interval for Mean Upper Bound 3.3679 5% Trimmed Mean 2.5920 Median 1.8000 Variance 15.174 Std. Deviation 3.8954 Minimum .10 Maximum 33.40 Range 33.30 Interquartile Range 2.3000 Skewness 3.115 .060 Kurtosis 12.507 .119

Comparative statistics • When there are two or more sets of observations from a study there are two types of design that must be distinguished: independent or paired. The design will determine the method of statistical analysis • If the observations are from different groups of individuals, e.g. ages of males and females, or spectacle use in diabetics/non-diabetics, then the data is independent. The sample size may vary from group to group

If each set of observations is made on the same group of individuals, e.g. WBC count pre- and post- treatment, then the data is said to be paired. This indicates that the observations are on the same individuals rather than from independent samples, and so we have the same number of observations in each set of data

Independent data • With independent continuous data, we are interested in the mean difference between the groups, but the variability between subjects becomes important. This is because the two sample t test (the most common test used), is based on the assumption that each set of observations is sampled from a population with a Normal Distribution, and that the variances of the two populations are the same.

Non-parametric test • If the continuous data is not normally distributed, or the standard deviations are very different, a non-parametric alternative to the t test known as the Mann-Whitney test can be utilised (another derivation of the same test is due to Wilcoxon)

T-test

Mann-Whitney Test

Neutrophil engraftment following allogeneic SCT for CML Cases Valid Missing Total TBIDOSE N Percent N Percent N Percent NEUTS 12 141 88.1% 19 11.9% 160 100.0% 14.4 122 95.3% 6 4.7% 128 100.0%

Descriptives TBIDOSE Statistic Std. Error NEUTS 12 Mean 22.9787 .5816 95% Confidence Lower Bound 21.8289 Interval for Mean Upper Bound 24.1286 5% Trimmed Mean 22.5816 Median 22.0000 Variance 47.692 Std. Deviation 6.9060 Minimum 11.00 Maximum 56.00 Range 45.00 Interquartile Range 9.0000 Skewness 1.162 .204 Kurtosis 3.184 .406 14.4 Mean 26.6148 .5544 95% Confidence Lower Bound 25.5172 Interval for Mean Upper Bound 27.7123 5% Trimmed Mean 26.1184 Median 26.0000 Variance 37.495 Std. Deviation 6.1233 Minimum 15.00 Maximum 53.00 Range 38.00 Interquartile Range 7.2500 Skewness 1.453 .219 Kurtosis 4.157 .435

Descriptives TBIDOSE Statistic Std. Error PLATES 12 Mean 32.7891 1.6694 95% Confidence Lower Bound 29.4857 Interval for Mean Upper Bound 36.0924 5% Trimmed Mean 30.5556 Median 29.5000 Variance 356.703 Std. Deviation 18.8866 Minimum 14.00 Maximum 186.00 Range 172.00 Interquartile Range 11.7500 Skewness 5.244 .214 Kurtosis 37.479 .425 14.4 Mean 42.8776 3.9481 95% Confidence Lower Bound 35.0417 Interval for Mean Upper Bound 50.7134 5% Trimmed Mean 37.1973 Median 27.0000 Variance 1527.572 Std. Deviation 39.0842 Minimum 14.00 Maximum 185.00 Range 171.00 Interquartile Range 18.0000 Skewness 2.368 .244 Kurtosis 4.780 .483

Test Statistics PLATES NEUTS Mann-Whitney U 6172.500 5543.500 Wilcoxon W 11023.500 15554.500 Z -.204 -4.977 P-value (2-tailed) 0.83 0.0006

Describing continuous data • If the data is normally distributed • Mean and standard deviation • If the data is skewed or non-normally distributed or is from a small sample (N<20) • Median and range

Comparison of several means • Data sets comprising more than two groups are common, and their analysis often involves the comparison of the means for the component subgroups. It is obviously possible to compare each pair of groups using t tests, but this is not a good approach. It is far better to use a single analysis that enables us to look at all the data in one go, and the method of choice is called analysis of variance • If the data are not normally distributed or have different variances, a non-parametric equivalent to the analysis of variance can be used, and is known as the Kruskal-Wallis test

Paired data • When we have more than one group of observations it is vital to distinguish the case where the data are paired from that where the groups are independent. Paired data arise when the same individuals are studied more than once, usually in different circumstances. Also, when we have two different groups of subjects who have been individually matched, for example in a matched pair case-control study, then we should treat the data as paired.

A one sample t test is used to examine the data. The value t is calculated from • t = sample mean - hypothesised mean standard error of sample mean • In a paired analysis where one set of observations are subtracted from the other set, the hypothesised mean is zero. Thus the calculation of the t statistic reduces to • t = sample mean / standard error of sample mean • The non-parametric equivalent to this test is the Wilcoxon matched pairs signed rank sum test

Wilcoxon Signed Ranks Test

Telomere length in Dyskeratosis Congenita

Comparison of groups : continuous data • Paired on non-paired? • If non-paired and normally distributed with similar variances : T-test • If non-paired non-normally distributed or with non-similar variances or very small numbers : Mann-Whitney test • Paired data – paired t-test or Wilcoxon Signed Ranks Test

Comparing Proportions • Qualitative or categorical data is best presented in the form of table, such that one variable defines the rows, and the categories for the other variable define the columns. Thus in a European study of ASCT for HD, patient gender was compared between the UK and Europe • The data are arranged in a contingency table • Individuals are assigned to the appropriate cell of the contingency table according to their values for the two variables

COUNTRYG * PSEX Crosstabulation Count PSEX Female Male Total COUNTRYG europe 16 610 828 1454 uk 100 160 260 Total 16 710 988 1714

COUNTRYG * PSEXG Crosstabulation PSEXG 1.00 2.00 Total COUNTRYG europe Count 828 610 1438 % within COUNTRYG 57.6% 42.4% 100.0% % within PSEXG 83.8% 85.9% 84.7% uk Count 160 100 260 % within COUNTRYG 61.5% 38.5% 100.0% % within PSEXG 16.2% 14.1% 15.3% Total Count 988 710 1698 % within COUNTRYG 58.2% 41.8% 100.0% % within PSEXG 100.0% 100.0% 100.0%

Chi-squared test (2) • A chi-squared test (2) is used to test whether there is an association between the row variable and the column variable. When the table has only two rows or two columns this is equivalent to the comparison of proportions.

The first step in interpreting contingency table data is to calculate appropriate proportions or percentages. The chi-squared test compares the observed numbers in each of the four categories and compares them with the numbers expected if there were no difference between the distribution of patient gender • The greater the differences between the observed and expected numbers, the larger the value of 2 and the less likely it is that the difference is due to chance

COUNTRYG * PSEXG Crosstabulation PSEXG 1.00 2.00 Total COUNTRYG europe Count 828 610 1438 % within COUNTRYG 57.6% 42.4% 100.0% % within PSEXG 83.8% 85.9% 84.7% uk Count 160 100 260 % within COUNTRYG 61.5% 38.5% 100.0% % within PSEXG 16.2% 14.1% 15.3% Total Count 988 710 1698 % within COUNTRYG 58.2% 41.8% 100.0% % within PSEXG 100.0% 100.0% 100.0%

Chi-Square Tests Asymp. Sig. Exact Sig. Exact Sig. Value df (2-sided) (2-sided) (1-sided) b Pearson Chi-Square 1.418 1 .234 a Continuity Correction 1.260 1 .262 Likelihood Ratio 1.428 1 .232 Fisher's Exact Test .246 .131 Linear-by-Linear 1.417 1 .234 Association N of Valid Cases 1698 a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 108.72.

Fisher’s Exact Test • When the overall total of the table is less than 20, or if it is between 20 and 40 with the smallest of the four expected values is less than 5, then Fisher’s Exact Test should be used.

Introduction

Introduction

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction