Probability & Statistical Inference Lecture 4

Probability & Statistical Inference Lecture 4 MSc in Computing (Data Analytics)

Lecture Outline • Recap to Statistical Inference • Central Limit Theorem • Confidence Intervals • Section Takeaways

Statistical Analysis Process MakeInference Describe

Populations vs. Samples • How do Irish voters intend voting in the next election? The voting population of Ireland: 2,680,0001 A sample of 1,008 adults was taken and surveyed for their voting intention in the next election2 Source - http://www.nationmaster.com/graph/dem_pre_ele_vot_age_pop-presidential-elections-voting-age-population http://redcresearch.ie/wp-content/uploads/2012/01/Report.pdf

Populations vs. Samples • How do Irish voters intend voting in the next election? 1,008 voters were asked how they intended to vote in the next election • Fine Gael: 30% • Labour: 14% • Fianna Fail: 18% • Sinn Fein: 17% • Other: 21%

Populations vs. Samples • The term population is used in statistics to represent all possible measurements or outcomes that are of interest to us in a particular study or piece of analysis • In the example the population of interest was the voting intentions of all voters in Ireland • The term sample refers to a subset of the population that is selected for analysis • In the example the polling company selected a sample of 1,008 voters

Sampling • In choosing a sample it is important that it is representative of the population • No bias should exist in the sample • There are a number of sampling methods available to ensure that your data is representative • A simple random sample is the most straight forward of these methods

Statistical Inference • The statistical methods used to draw conclusions about populations based on the statistics describing a sample is known as statistical inference • We want to make decisions based on evidence from a sample i.e. extrapolate from sample evidence to a general population • To make such decisions we need to be able to quantify our (un)certainty about how good or bad our sample information is

Statistical Inference • Statistical Inference is divided into two major areas: • Parameter Estimation: This is where sample statistics are used to estimate population parameters • Hypothesis Testing: A statistical hypothesis is a statement about the parameters of one or more populations. Hypothesis testing tests whether a hypothesis is supported by data collected

Population Statistics – Point Estimation • The population mean is denoted by µ (mu) • In general, given a sufficiently large sample, we use the sample mean as a point estimate of µ • The population variance is denoted by σ2 (sigma-squared) • In general, given a sufficiently large sample, we use the sample variance s2 as a point estimate of σ2

Population Statistics – Point Estimation • An estimate of proportion,p, of items in a population that belong to a class of interest is calculated as: • where c is the number of items in a random sample of size n that belong to the class of interest • This is known as the sample proportion

Central Limit Theorem

Demonstration

Central Limit Theorem Explained by Example The distribution shown is a poission distribution with λ=3 This could represent the distribution of the number of clicks on a particular link in one second

Create 200 sample distributions each with a large sample size Calculate the mean of each distribution

Central Limit Theorem • Explain what has happened? • As the sample sizes increased the shape of the histogram of means tended towards a normal distribution • As the sample sizes increased the spread (standard deviation) between the sample means decreased

Central Limit Theorem • These histograms are pictures of The Sampling Distribution of the Mean • This phenomenon will happen in ALL cases • The proof of this is called the Central Limit Theorem (CLT) and involves some fairly non-trivial mathematics

Definition: Central Limit Theorem continued… • The sampling distribution of the mean has a average value = (the population mean). • The sampling distribution of the mean has a standard deviation • Where σ is the population standard deviation, and n is the sample size taken. • This value is called the standard error of the mean. • The Sampling Distribution of the Mean will be a Normal distribution if the sample size is large.

Central Limit Theorem - Definition • If a random sample is taken from a population, where: • Each member of the sample can be considered to be independent of each other • The are all members of the same population • That population has a mean value μ and a standard deviation σ • Then.......

Central Limit Theorem - Definition • ......... • This is a non-mathematical definition of the Central Limit Theorem (CLT) The central limit theorem states that given a distribution with a mean μ and variance σ², the sampling distribution of the mean approaches a normal distribution with a mean μ and a variance σ²/n as n, the sample size, increases

The Distribution of the Sample Means

Confidence Intervals

How can we use the CLT • The Central Limit Theorem avoids the necessity of specifying a complete statistical model for all the sampled data. • All we have to do is specify a probability model for the sample mean. • For any sample mean, calculated from a large independent random sample taken from any population with a mean μ and standard deviation σ, we know from the CLT, that this sample mean is a random variable from a Normal distribution with a mean = μ and a standard deviation =

Practical use for the CLT continued… • Take a single sample and calculate • This is an estimate of μ – the true (but unknown) population mean. • But, how good is this estimate? • We assume that is not exactly , but  is somewhere near - but how near is it likely to be?

Confidence Intervals Introduction • We would like to make probability statements as to how close is likely to be to . • If sample size is sufficiently large – then the estimate can be considered as: • a random variable from a Normal distribution, • so probability statements are possible. • This is how we use the CLT in practical data analysis.

Confidence Intervals Introduction • For a Normal distribution, we know that 95% of values will be within 1.96 Standard deviations of  • So, given one estimate we can say that this estimate is within 1.96 standard errors of the actual population mean , with 95% confidence 95% in shaded area • We can turn this knowledge on its head: given we can be 95% confident that the true mean  is within 1.96 standard errors of it.

Confidence Interval • From this we can specify a range of values within which we are 95% confident that the population mean () lies • This is called a confidence interval • 95% Confidence Interval for a population mean (from large enough sample): • Remarkably, this result holds for samples of size 30 or more. So, a large sample in this context, is a sample of 30 or more.

Example One sample of size 30 from the electronic components yields a sample mean = 5,873 hours .We know  = 3,959 so a 95% confidence interval would be: Interpretation: we would say that the average lifetime of all components (μ) is between 4,456 and 7,290 hours with 95% confidence

Confidence Intervals • Why is this any good? • Before: one estimate, = 5,873 but no idea of how good or bad it was, i.e. how close to μ is was likely to be. • Now: 95% confident that μ is between 4,456 and 7,290 hours. • So, using CLT leads to Confidence Intervals that enables us to estimate a statistic with certain level of confidence. • In other wordit gives us an objective measure of the actual amount of information contained in our sample about the likely location of μ.

Problem with σ • All of the above assumes that the population standard deviation (i.e. ) is known. • In practice this is not known (just like ). • So, we need to estimate  as well as  • we get this estimate from the standard deviation of the sample, given that the sample is large enough. • Sample Standard Deviation is called ‘s’ • Estimate by s

General Confidence Interval for μ (Large Samples) • The general formula is: • Where: •  is between a value between 0-1, • (1-)×100% is the confidence level you want • Z1-/2 is a value from the Normal distribution table. • Example: for a 95% CI,  = 0.05 • (1-)×100% = 95% • Z1-/2= 1.96

Z-Values • The value of Z1-/2 for other % confidence intervals are given in standard tables.

Example • Using these we get the following results for the electronic component example: • Note as  gets smaller the CI gets wider • Also, at the same time as n gets bigger the CI narrows – So big samples leads to more precise estimates (i.e. narrower confidence intervals)

What CI’s and sample sizes should I use? • You can’t control s – it is inherent in the data (population). • You can’t control x-bar either. • You can control Z1-/2but in practice scientific convention sets this to reflect 90%, 95% or 99% confidence, with 95% being the accepted default. • You can choose n – but resources may limit you. • There is a whole topic called sample size determination which you may want to review before collecting data or starting research

Confidence Interval Assumptions • Sample size 40 or greater • Experimental units are independent or each other • Experimental units were randomly sampled • The independence assumption requires that value of the variable for one experimental unit should not tell us anything about the value of another. • Randomness is required to avoid systematic bias in selection.

Exercise • Complete Exercise 1 & 2

Calculation of CIs for small samples • What about small samples? • In the case of CIs about a mean we can use the Student-t distribution. • The process turns of to be very similar – but the CLT no longer works

History of the Student t test • William Gosset used the publishing pseudonym ‘Student’. He derived the correct sampling distribution for the mean of samples < 40 – and called it the ‘t distribution’. • In his honour, it is often called the ‘Student t’ distribution. • Gosset was a chief brewer for Guinness. • The mathematical details are complicated, but, it turns out that we perform exactly the same calculations as before, with the one change that the t distribution instead of the normal distribution is used.

Assumptions • Student t’s result only referred to a mean where the distribution of the population was normally distributed with some mean μ and finite standard deviation σ. • This is in contrast to the CLT for large samples that required no such assumption about normality. • The t-test also requires the assumption regarding independence in the sample.

Statistical Model for mean from small samples • The experimental units are independently sampled from a population with mean=μ and standard deviation = σ • The population is normally distributed (we don’t need this with large samples) • So, to use the t-test for a small sample, you need to establish that data is sampled from a population that is normally distributed – you could look at the histogram of the sample and see if it is symmetric and bell shaped – or use other methods.

The t - Statistic • If Assumptions met: • The statistic: • Can be shown to be distributed according to a (student) t-distribution. • The t-distribution has one parameter, called ‘degrees of freedom’ (df).

The t-Distribution • The t-distribution itself is bell shaped and symmetric – just like the normal distribution but is ‘flatter’. • There are many t distributions – one for each sample size. • The rule used is: for a sample of size n – use the t distribution with degrees of freedom = n−1 Example: if the sample size is 15, then use a t distribution with degrees of freedom 15 − 1=14. • Note the degrees of freedom often abbreviated to df.

The t-Distribution The t probability density function with kdegrees of freedom:

General Confidence Interval for μ (small Samples) • The general formula is: • Where (1-) 100% is the confidence level you want and t(n-1, /2) is a value from the t distribution with df=n-1, and with a specified  level. • What is t(n−1, 1−/2)? • A value from the t distribution with n−1 df such that 100(1 − )% of values lie within that range around the mean.

How do you find t(n−1, 1−/2)? • from a table specifically designed to give it to you or use a computer Note: as  gets smaller then CI gets wider as df gets smaller then CI gets wider

Example • Internal temperature of autoclaved aerated concrete used in building. An engineer recorded the following data: 23.01, 22.22, 22.04, 22.62, 22.59 • 95% CI for the population mean?

Exercise • Answer Questions 3-6

Confidence Intervals for Proportions (Large Samples) • Proportions (including %) are often a statistic of interest • Think of the proportion of defective items on a production line, the proportion of people who respond favourably to a survey question, to proportion of success versus failures in some experiment • Proportions are also covered by the CLT - remember that a proportion is a different kind of average

Confidence Intervals for Proportions (Large Samples) • Take a sample of size n of electronic components coming off a production line, a test each one for defects. The statistic of interest is the proportion of defectives produced by the production process. • The estimated proportion from the sample is, • where (p-hat) is the symbol used for the estimated proportion from the sample

Confidence Intervals for Proportions (Large Samples) • If the sample size is sufficiently large and we repeat the experiment a large number of times, then: • The sampling distribution of the proportion will be normally distributed by the CLT • The mean of this distribution will be p - i.e. the 'true' population proportion • The standard deviation of the sampling distribution of the proportion, called the standard error of the proportion is estimated by

Probability & Statistical Inference Lecture 4