Inferential Statistics

Inferential Statistics Friday 19thFebruary 2010

Outline: • Inference • Confidence intervals • Sampling distributions • The normal distribution and z-scores • Working out confidence intervals • Hypothesis testing • Types of error • T-tests (and ANOVA) We recommend Statistics for the Terrified ‘Standard Error and Confidence Intervals’

What is inference? • Most of the time we care about the attributes of a population – adults in the UK; women workers; small businesses… • But we usually only study a sample of the population. • Inferential statistics give you the tools to infer population characteristics from the sample. • Inferential statistics usually assume a random sample. This is why it is so important to use methods of random sampling when at all possible. • Instead of, say, reporting that 35% of our sample have some characteristic, using inferential statistics we are able to estimate, or infer, the proportion of the population that is likely to have that characteristic. • In order to do this we use confidence intervals.

What is a ‘Confidence Interval’? • A ‘Confidence Interval’ for a particular sample statistic (e.g. the mean) is a range of values around the statistic that is believed to contain, with a certain level of probability (often 95%) the ‘true’ value of that statistic (i.e. the population value). • For example, if we see a report that 37% of people (plus or minus 3%) intend to vote Labour. What is being said is that the pollsters are reasonably confident that the true number of people who intend to vote Labour is between 34% and 40%. If they have not said otherwise, it is very likely that this is a 95% confidence interval. • Statistics for the Terrified Ch.4, pp.8-11

How do we arrive at a confidence interval? • How do we judge how big a confidence interval should be (plus or minus 2% or 5% or 15%...)? • What does it mean to be 95% certain that it is the size that we say it is? • And how do we know that the results we got in our sample of the population are not just a quirk of our particular sample (or ‘sampling error’)? • Part of the answer to these questions can be seen in common-sense assessments…

Example: Judging whether differences occur by chance… How do we judge whether it is plausible that two population means are the same and that any difference between them simply reflects sampling error? • Example: Household size of minority ethnic groups(HOH = Head of household; data adapted from 1991 Census) • The size of the difference between the two sample means • Mean Indian HOH: 3.0 Bangladeshi HOH: 5.0 • Mean Indian HOH: 3.0 Pakistani HOH: 4.0 • The first difference is more ‘convincing’

Judging whether differences occur by chance… 2. The sample sizes of the two samples Mean Pakistani HOH: 3 4 5 4.0 Bangladeshi HOH: 4 5 6 5.0 Mean Pakistani HOH: 2 2 3 4 4 4 5 5 5 6 4.0 Bangladeshi HOH: 2 3 4 4 5 5 6 6 7 8 5.0 The second difference is more ‘convincing’

Judging whether differences occur by chance… 3. The amount of variation in each of the two groups (samples) Mean Pakistani HOH: 2 2 3 4 4 4 5 5 5 6 4.0 Bangladeshi HOH: 2 3 4 4 5 5 6 6 7 8 5.0 Mean Pakistani HOH: 4 4 4 4 4 4 4 4 4 4 4.0 Bangladeshi HOH: 5 5 5 5 5 5 5 5 5 5 5.0 The second difference is more ‘convincing’.

Example continued: the impact of variability on a difference in means. The three graphs each show two groups with the same mean difference. However the groups in each of the three graphs have different levels of variability. Where there is lower variability there is less cross-over between the groups, and so the difference of the means expresses a more ‘real’ difference (there is almost no one in group A with the same score as anyone in group B).

Judging whether differences occur by chance… As we’ll see these three things – the size of the difference between the means, sample size, and the amount of variation (measured by the standard deviation) within the sample(s) – are critical to our determination of whether a difference we observe in a sample (or between samples) is likely to represent a real difference in the population (or between populations).

So, what is the relation of the sample to the population? • If the sample is a random sample of the population, it may sometimes have a large number of extremely high values (for example: very happy people) • And sometimes it may have a large number of extremely low values cases (for example: very sad people) • But over the long run (if we kept on taking a sample, and then putting it back and taking another one), we would expect that most of the samples would fairly well represent the population (for example: with a mean happiness that corresponds fairly closely to the mean happiness of the population).

Sampling Distributions • The distribution of different possible samples that could be taken from a population is known as a sampling distribution. • The more we understand about this distribution the better because it will help us to work out the likely relationship of our particular sample to the population • What we find is that as more and more samples are taken, the average (i.e. mean) of the sample means tends to equal the mean of the population. • The sampling distribution of means also looks like a normal distribution (Central Limit Therorem). • However the sampling distribution of means is less varied than the population. • See sampling distribution simulation at: http://onlinestatbook.com/stat_sim/ (you can access this via the links page of the module website). • Or Statistics for the Terrified, Chapter 4: Standard error and confidence Intervals, pp.3-5. Sampling from a Population: Sample Means (from Field, 2005).

The formal therom: “If repeated (simple random) samples of size N are drawn from a normally distributed population, the means of such samples will be normally distributed with mean  and standard error [i.e. standard deviation] /N... if the N of each sample drawn is large, then regardless of the shape of the population distribution the sample means will tend to distribute themselves normally with mean  and standard error /N”. Reminder: = population mean = population standard deviation N = number in sample

So where does this get us…??? • Well, we know that over the long run the mean of our samples is likely end up as the population mean. • We know that over the long run (when the sample is ‘large’ enough) that the distribution of sample means looks normal. [Note: A “large sample” is sometimes considered to be one of size 30+, but a size of 100+ can more ‘safely’ be viewed as adequately large.] • And we know that the variation in the sample means, known as the standard error, is (more or less) /N. • Although we usually only have a single sample, this information means we can work out a fairly reliable estimate of the population mean by combining the sample with what we know about normal distributions.

What’s so special about the Normal Curve? • The normal curve is a symmetrical distribution of scores with an equal number of scores above and below the midpoint of the abscissa (the horizontal axis, or ‘x-axis’, for the curve). • Since the distribution of scores is symmetric, the mean, median, and mode are all at the same point on the abscissa. In other words, the mean = the median = the mode. • If we divide the distribution up into standard deviation units, a known proportion of scores lies within each portion under the curve. • From published or online tables, we can find the proportion of scores above and below any point on the abscissa, expressed in standard deviation units. Scores expressed in standard deviation units, are referred to as Z-scores. 34% of cases are between the mean and one SD away

z-scores z-Scores can be calculated for any value. They are a means of standardizing values that are measured on different scales by showing these values just in terms of the number of standard deviations away from the mean they fall. z-scores are calculated by subtracting the mean from any value and dividing it by the standard deviation. z =x -mean s z-scores will always have: a mean of 0 and standard deviation of 1. We can quickly see that this is true of the mean, since when x = mean, the numerator (top bit!) will equal 0, and therefore z must = 0. It may be a little less clear that it is true of the standard deviation. However if you think about the instance when x is one standard deviation bigger than the mean (i.e. x= mean + s)  z = (mean + s) - mean = s = 1 ss

Finding the 95% point on a normal distribution… • From the table of we can see that when z = 1.96 (sometimes simplified to z = 2) the p-value, which represents the probability of being in the larger area (to the left), is 0.975. • Therefore the area under one (small) tail of the curve is p=0.025. • This means that scores greater than z = 1.96 occur just 2.5% of the time. • Further (because the normal curve is symmetric) we can calculate that the area under both tails (beyond z = 1.96 and z = -1.96) is 0.05. • In other words 95% of the area is in the middle, between z = -1.96 and z = 1.96 • And scores further from the mean than 1.96 thus only occur 5% of the time 2.5% 97.5% \z = 1.96 95% z = 1.96 z = -1.96

Note: What happens if the sample size is too small for one to safely assume that the sample mean has a Normal distribution? • When a sample is small (i.e. less than about 25) the assumption that the sample mean is normally distributed is not reasonable. • In fact, regardless of sample size, the sample mean can be assumed to have a t-distribution; the precise shape of a t-distribution depends on the sample size, and for moderate-to-large sample sizes the t-distribution is very similar to the Normal distribution (and, as the sample size approaches infinity, eventually converges with it).

Combining that with what we know about the sampling distribution: • 95% of cases lie within +/- 1.96 standard deviations of the mean in a normal distribution. • The distribution of sample means is normal. • And the standard error of sample means is approximately /n Frequency 95% of sample means 2.5% of sample means 2.5% of sample means 1.96/n 1.96/n  (population mean) Sample mean Therefore 95% of sample means fall into the range:  - 1.96(/n) to  + 1.96(/n)

Example • If we take a sample of 100 people and find that they work a mean of 34 hours per week with a standard deviation of 8 hours, how do we construct a 95% confidence interval for the mean number of hours worked by the population? • We know that 95% of sample means fall in the range:  - 1.96(/n) to  + 1.96(/n) • We estimate  using the sample standard deviation, which is 8. • The sample size (n) is 100. Therefore n = 10. • Therefore 1.96(/n) = 1.96 x (8 / 10) = 1.96 x .8 = 1.568 • Therefore there is a 95% likelihood that the sample mean that we have found is within (about) 1.57 hours of the actual mean. • And so we can say with 95% confidence that the population’s mean weekly hours of work will fall somewhere between 34 minus 1.57 and 34 plus 1.57. • A 95% confidence interval of 32.43 to 35.57 hours per week.

Why 95%? • A confidence interval need not be 95%. • However this is the generally accepted level for statistical testing. It is considered that errors occurring only 5% (or 1/20) times are acceptable. Furthermore, a higher value can produce confidence intervals that may be viewed too wide (producing an unacceptable risk of Type I errors – discussed later). • However for some purposes a more cautious approach may be necessary. • For instance, if you were an antiquarian librarian sampling over time the humidity in your rare book storage facility, you might want to be confident that the average humidity level was neither destructively high or low at a 99.9% level at least! In this case you would construct a 99.9% confidence interval (where only 0.1% of cases fell outside of the range). You could use the normal distribution to do this, in a similar fashion to the way in which we used it to work out that the 95% confidence level relates to plus or minus 1.96 standard errors.

Note: Small samples continued… • The procedure for producing 95% confidence intervals remains very similar to the one for larger sample sizes (i.e. the one using the ‘normal distribution’, which might just as well be referred to as the z-distribution), as does the test to see whether a suggested population mean is plausible. • The only difference is that the ‘magic number’ 1.96 is replaced by a slightly larger number, the magnitude of which gets bigger as the sample size gets smaller. • Thus, for a sample size of 25, 1.96 is replaced by 2.06 and, for a sample size of 15, by 2.13. (You can sometimes find a table of values for the t-distribution at the back of a statistics textbook). However another problem arises with small samples: the distribution of sample means can be asymmetric. In fact, the assumption that the sample mean has a t-distribution is only reasonable for small samples if the distribution of the variable under consideration approximates the normal distribution.

For you to do: 1. If we take a sample of 144 people and find that they eat a mean of 2,450 calories per day with a standard deviation of 840 calories, how do we construct a 95% confidence interval for the mean number of calories eaten by the whole population? 2. If you have time, start thinking about this one: Imagine that we already know that the mean income of university graduates is £16,500. We then do a survey of 64 sociology graduates and find that they earn a mean income of £15,400 with a standard deviation of £4,000. Can we say that this is convincing evidence that sociology graduate students earn less than other graduate students? Why?

Hypothesis testing • Going back to the second question: Imagine that we already know that the mean income of university graduates is £16,500. We then do a survey of 64 sociology graduates and find that they earn a mean income of £15,400 with a standard deviation of £4,000. Can we say that this is convincing evidence that sociology graduate students earn less than other graduate students? • The null hypothesis here is that sociology graduates earn the same as other graduates. This is a hypothesis of no difference. • The alternative hypothesis is that there is a difference. • The null hypothesis (or Ho) is usually of no difference. And the alternative hypothesis (or Ha) is usually of difference. • When we carry out statistical tests, we attempt, as here, to reject the null hypothesis at a 95% level of significance (or sometimes at a 99% or 99.9% level).

Hypothesis testing • So to think about the example again: Imagine that we already know that the mean income of university graduates is £16,500. We then do a survey of 64 sociology graduates and find that they earn a mean income of £15,400 with a standard deviation of £4,000. Can we say that this is convincing evidence that sociology graduate students earn less than other graduate students? • If we conduct a 95% confidence interval for the population mean income of sociology graduates it will look like this: • 15,400 plus or minus 1.96 x (4,000 / 64) • 15,400 plus or minus 1.96 x (4,000 / 8) • 15,400 plus or minus 980  £14,420 to £16,380 • The top point of this range is still below the mean income for graduates generally – there is no overlap. This means that there is less than a 5% chance that a difference as big as £1,100 would have occurred if there is no difference between sociology graduates’ mean income and the mean income for all graduates.

Hypothesis testing Theory You test out particular hypotheses with reference to your sample statistics. However these hypotheses are about underlying population characteristics (parameters) Procedure • Set up ‘null’ (and ‘alternative’) hypothesis • Note sample size and design • Establish sampling distribution under the assumption that the null hypothesis is true • Identify decision rule (i.e. what constitutes acceptance/rejection of the null hypothesis) • Compute sample statistic(s), and apply the decision rule (N.B. This is where Type I and Type II errors can occur).

Error Types Note: Reducing the chance of one type of error occurring increases the chance that the other type will!

Applying the logic of a statistical test… Today and next week we will look at a number of different statistical tests that use inferential methods to ask : • Is the sample mean sufficiently different from the suggested population mean that it is implausible that the suggested population mean is correct? Testing the plausibility of a suggested population mean (via a z-test). [This is what we’ve just done]. • Are the means from two samples sufficiently different for it to be implausible that the populations from which they come are actually the same?Test via a two-sample t-test, or if comparing more than two (sub-) samples (i.e. more than two groups) testing for differences via Analysis of Variance (usually referred to as ANOVA). • Are the observed frequencies in a cross-tabulation sufficiently different from what one would have expected to have seen if there were no relationship in the population for the idea that there is no relationship in the population to be implausible? Test this via a chi-square test. In each instance we are asking whether the difference between the actual (observed) data and what one would have expected to have seen, given some hypothesis Ho, is sufficiently large that the hypothesis is implausible. Thus we are always trying to disprove a hypothesis.

t-tests • Test the null hypothesis, which is: H0: 1 = 2 or H0: 1- 2 = 0 The equality of means • The alternative hypothesis is: Ha: 1 2 or Ha: 1- 2  0

What does a t-test measure? Note: T = treatment group and C = control group. (The above depicts a comparison in experimental research; in most discussions these will just be shown as groups 1 and 2, indicating different groups.)

Example • We want to compare the average amounts of television watched by Australian and by British children. • We have a sample of Australian and a sample of British children. We could say that what we have and want to do are something like this: Want to compare Population of British children Population of Australian children inference inference Sample of Australian children Sample of British children

t distribution critical values Example continued • Here the dependent variable is number of hours of TV watched each night • And the independent variable is nationality (or, perhaps, national context). • When we are comparing means SPSS calls the independent variable the grouping variable and the dependent variable the test variable. For a more detailed view of statistics go all the way to Australia: SurfStat

Example continued • If the null hypothesis, hypothesising no difference between the two groups, was correct (and children thus watch the same average amount of television in Australia as in Britain), we would assume that if we took repeated samples from the two groups the difference in means between them would generally be small or zero. • However it is highly likely that the difference between any two particular samples will not be zero. • Therefore we build up a sampling distribution of the difference between the two sample means. • We use this distribution to determine the probability of getting an observed difference (of a given size) between two sample means from populations with no difference.

If we take a large number of random samples and calculate the difference between each pair of sample means, we will end up with a sampling distribution that has the following properties: It will be a t-distribution The mean of the difference between sample means will be zero if the null hypothesis is correct. Mean (M1 – M2) = 0 The ‘average’ spread of scores around this mean of zero (the standard error) will be defined by the formula: This estimate ‘pools’ the variance in the groups – just take it for granted for now!

Back to the example… • When we are choosing the test of significance it is important to note that: • We are making an inference from TWO samples (of Australian and of British children). And these samples are independent (the number of hours of TV watched by British children doesn’t affect the number of hours watched by Australian children) Therefore we need an two-sample test (what SPSS calls an ‘independent samples’ test) • The two samples are being compared in terms of an interval-ratio variable (hours of TV watched). Therefore the relevant descriptive statistic is the mean. •  These facts lead us to select the two sample t-test for the equality of means as the relevant test of significance. Table 1. Descriptive statistics for the samples

t-test of independent means: formulae Note: 1 + 1 = N1 + N2 N1 N2 N1 N2 Where: M = meanSDM= Standard error of the difference between meansN = number of subjects in a groups = Sample standard deviation of a groupdf = degrees of freedom

   S DM = (20-1)292 + (20-1)30220+20 = 9.3 20 + 20 – 2 20 x 20  Example: Calculating the t-value tsample = 166 – 187 = – 2.3 9.3

Example:Obtaining a p-value for a t-value • To obtain the p-value for this t-value (score) we need to consult the table for critical values for the t-distribution (see handout) • The number of degrees of freedom we refer to in the table is the combined sample size minus two (this is because we already know two values: the two means). df = N1 + N2 – 2 • Here the above gives 20 + 20 – 2 = 38 • The table doesn’t have a row of probabilities for 38. In that case we (to be cautious) refer to the row for the nearest reported number of degrees of freedom below the desired number. Here that is 30. • For 38 degrees of freedom and a two-tailed test, tsamplefalls between the two stated t-scores of 2.042 and 2.457. • The p-value, which falls between the significance levels for these scores is therefore between 0.02 and 0.05 • Therefore the p-value is statistically significant at the 0.05 level but not at the 0.02 (or 0.01) level.

Example: Reporting the results “The mean number of minutes of TV watched by the sample of 20 British children is 187 minutes, which is 21 minutes higher than the mean of 166 minutes for the sample of 20 Australian children; this difference is statistically significant at the 0.05 level (t(38)= -2.3, p = 0.03, two-tailed test). Based on these results we can reject the hypothesis that British and Australian children watch the same average amount of television every night.”

t-tests and ANOVA • ANOVA (Analysis of Variance) works on broadly similar principles, but is a technique allowing one to look simultaneously at differences between the means of more than two groups. • We will use both t-tests and ANOVA in the computing session this afternoon. • You DO NOT have to remember the equations for any of these tests! • What is important to remember is the principles of hypothesis testing: • That we start with a null hypothesis (of no difference in the population). • That, using our sample we can test whether this is plausible. • The p-values that we get (and that we report) show the likelihood of the observed results given no difference. • Therefore (to simplify), the lower the p-value the more likely it is that there is a real difference between the groups. • The three things that affect these test statistics are the sample size (of each group), the size of the differences in the means (between groups) and the variability of scores (within each group).

Inferential Statistics