Ch 4

Ch 4 Estimating with uncertainty

Recall: • In chapter 3, we learned about • mean, median, mode • (different measures of population or sample location/central tendency) • and standard deviation / variance / interquartile distance • (different measures of population or sample spread)

The problem: • when we wish to learn about a population: • we take samples • but we can’t tell how much sampling error is involved • (recall: sampling error is that error due to chance in sampling a variable population) • we must estimate how much sampling error is involved • this will give us an idea of the precision of our estimate

But how do we estimate the sampling error of our estimate? • (we are going to ignore other sources of error – for the moment) • we can study the magnitude of sampling error by pretending it is occurring in a known population • In fact, many researchers have done this • so we have a large body of work that helps us to understand – and PREDICT – the magnitude of sampling error

Our motivating example: • the human genome • DNA sequences of all 23 human chromosomes • published in 2005 • (www.ensembl.org for genome information)

Our motivating example: • the human genome • DNA sequences of all 23 human chromosomes • published in 2005 • (www.ensembl.org for genome information) • 20290 genes (build 35, they’re on 37 now)

Frequency dist of number of nucleotides relative frequency

Frequency dist of number of nucleotides Note: this is the population. µ=2622; σ=2036.9

To study the magnitude of sampling error, • we sample from the population • Let’s choose 100 genes (with associated lengths) at random • (recall how we would do this if we had the spreadsheet of genes in front of us)

100 genes chosen at random: a sample This sample has Ybar = 2411.8 (compared to µ=2622) and s = 1463.5 (compared to σ=2036.9) Sample estimates underestimate population parameters!

What if we were to repeat this sampling process, over and over? • Every time, we would choose a number of genes at random (could be 100, could be any number), and for that sample, we would calculate a mean (Ybar) and a standard deviation (s).

What if we were to repeat this sampling process, over and over? • Every time, we would choose a number of genes at random (could be 100, could be any number), and for that sample, we would calculate a mean (Ybar) and a standard deviation (s). Sampling distribution (of the sample mean) Frequency

Sampling distribution of the mean – can be interpreted as a probability distribution where does the population mean fall?

Effect of samples of different sizes

Effect of samples of different sizes Larger samples yield more precise estimates with lower spread (and lower sampling error)

Standard error • a measure of sampling error • easy to calculate: • standard error decreases as n increases

Standard error of Ybar • also easy to calculate:

Standard error of Ybar • also easy to calculate: • Recall our sample of 100 genes from the human genome:

Standard error of Ybar • also easy to calculate: • Recall our sample of 100 genes from the human genome: This sample has Ybar = 2411.8 and s = 1463.5 and SE = 1463.5/10 = 146.3 Report: Ybar = 2411.8 ± 146.3

Standard error of Ybar • also easy to calculate: • Recall our sample of 100 genes from the human genome: This sample has Ybar = 2411.8 and s = 1463.5 and SE = 1463.5/10 = 146.3 Report: Ybar = 2411.8 ± 146.3 allows reporting estimate of error (in the Ybar estimate)

Another way to estimate error/precision: • confidence intervals • Note that the frequency dist of the sampling distribution was normal, even though the freq dist of the human genome data was right-skewed

Another way to estimate error/precision: • confidence intervals • Note that the frequency dist of the sampling distribution was normal, even though the freq dist of the human genome data was right-skewed • Remember also, from chapter 3, the rule of thumb about normally-distributed data: 95% of data will fall within 2 standard deviations of the mean -> we extend that here to incorporate standard error rather than plain old s: • 95% of data will fall within 2 standard errors of the mean

Rule of thumb in practice • 95% of data will fall within 2 SE of the mean • Recall our sample of 100 genes from the human genome project: • In practice: for any sample you take, the confidence interval is (Ybar-2SE, Ybar+2SE) • For 95% of confidence intervals calculated this way, the population mean falls inside the confidence interval CI: (2411.8 – 2*146.3, 2411.8+2*146.3) (2119.2, 2704.4)

In practice:

Pseudoreplication • from “pseudo” and “replicate” • A replicate is an additional measurement • Example: We are interested in average blood pressure in men over age 65 • We find 10 men over age 65, and take their blood pressures, once in the morning, and once in the evening (total of 20 measurements). • Are these 20 measurements independent of one another?

Pseudoreplication • from “pseudo” and “replicate” • A replicate is an additional measurement • Example: We are interested in average blood pressure in men over age 65 • We find 10 men over age 65, and take their blood pressures, once in the morning, and once in the evening (total of 20 measurements). • Are these 20 measurements independent of one another? Sampling design is still really important!

Ch 4

Ch 4

Presentation Transcript

Ch-4

Ch 4

Ch. 4

Ch. 4 Sec. 4

Ch.4.

Ch 4

Ch 4

Ch 4

Ch. 4

Ch 4

Ch. 4-4

Ch. 4 Sec. 4

Ch.4.

Ch. 4

Ch.4

Ch 4

Ch. 4

Ch 4

CH 4: