1 / 47

Welcome to the Graduate Workshop in Statistics Session 4

This session provides a brief review of probability and introduces concepts such as random samples, sampling distribution, and confidence intervals for the mean. The session also discusses the T-distribution and introduces hypothesis testing. Examples and demonstrations are provided to enhance understanding.

otisroach
Download Presentation

Welcome to the Graduate Workshop in Statistics Session 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Welcome to the Graduate Workshop in StatisticsSession 4 Instructor: Kam Hamidieh Monday August 1, 2005

  2. Today’s Agenda • (Very) Brief Review of Probability • New Stuff today: • Random Samples • Sampling distribution • Confidence intervals for the mean • T-distribution • Next time: Hypothesis Testing

  3. Negative Predictivity Results (from last time) • A blood test is 99% effective in detecting a certain disease when the disease is present. However, the test also yields a false-positive result for 2% of the healthy patients tested. (That is, if a healthy person is tested, then with a probability of 0.02 the test will say that this person has the disease.) Suppose 0.5% (5 out of 1000) of the population has the disease. • What is the negative predictivity of this test? That is P( don’t have disease | test is negative) = ?

  4. Brief Review of Probability • A probability is a number between 0 and 1 that is assigned to a possible outcome of a random circumstance. • Random Variable: assigns a number to each outcome of a random circumstance, or, equivalently, to each unit in a population. • Just remember that if X ~ N(,2) then Z = (Xi - )/ ~ N(0,1).

  5. What do we mean by a random sample? • Recall that a random sample of size n is denoted by X1, X2, …, Xn. • X1, X2, …, Xn is a random sample from our population of interest if (for example): • The X’s are independent so P(X2>10 | X1>10) = P(X2>10) • The X’s have the same distribution/model P(X2>10) = P(X1>10) • Random samples are iid = independent & identically distributed. Independent Identically Distributed

  6. Our Imagined Population This is what the distribution of the population make look like. (Shaded Histogram) This is how we think about the population. (Dotted Red Line)

  7. The Big Picture in Statistics Use a small group of units to make some conclusions (inference) about a larger group Population (Characteristics Unknown) Sample

  8. Statistics vs. Parameters • Recall… • Statistics is a characteristic of the sample (any function of the sample data.) Example: sample mean x. • Parameter is a characteristic (most often unknown) of the population in which we have a particular interest. Example: population mean µ. • We often use the value of statistics (known) to estimate the value of the parameter (unknown).

  9. Big Picture in Point Estimation (means) Population , population mean is unknown. Sample Compute the sample mean… Now is our point estimates of .

  10. Big Picture in Point Estimation (SD) Population , population SD is unknown. Sample Compute the sample SD… Now s is our point estimates of .

  11. Big Picture in Point Estimation (any parameter) Population , population characteristic is unknown. Sample Compute the sample statistic… Now is our point estimates of .

  12. Sample Means as R.V.s The sample mean is a random variable! Population X1, 1st value not known ahead of time X2, 2nd value not known ahead of time Xn, nth value not known ahead of time Not known ahead of time.

  13. Sampling Distribution • Since the sample mean is a random variable, then it has a distribution! • We call this distribution, the sampling distributionof the sample mean. • In general, the sampling distribution of a statistic is the probability distribution that specifies probabilities for the possible values the statistic can take. • What will this distribution look like? What will it be? Could it be useful? • The answers lie in the Celebrated Central Limit Theorem!

  14. The CLT • The Central Limit Theorem is often affectionately called the CLT by statisticians. • The CLT states that if n is sufficiently large, the samples from a population with mean  and finite standard deviation  are approximately normally distributed with mean  and standard deviation D

  15. Technical but Important Ideas • The mean and standard deviation given in the CLT hold for any sample size; it is only the “approximately normal” shape that requires n to be sufficiently large. • However, if the population is normal then the sampling distribution of the sample mean will be exactly (not approximately) normal.

  16. CLT Demo • http://www.ruf.rice.edu/%7Elane/stat_sim/sampling_dist/index.html

  17. Standard Error of the Mean • Note the sample mean is approximately normal with mean  and standard deviation of • Often we do not know the value of  so we estimate it with the sample standard deviation s. • The quantity is called the standard error of the mean. • We can interpret standard error of the mean as estimating, approximately, the average distance of the possible sample mean values (for repeated samples of the same size n) from the true population mean .

  18. Sampling Distribution for Any Statistic • Every statistic has a sampling distribution. • They may not be always normal, or even approximately normal. • Statisticians try to develop theories about the sampling distribution of the statistics. • Sampling distributions provide the key for telling us how close a sample statistics such as sample mean falls to the unknown parameter we’d like to make an inference about.

  19. Recall Inference? • The previous results form the backbone for statistical inference, a set of procedures to make some conclusions about the population. • Two common procedures for statistical inference are: • Hypothesis testing • Estimation: point and confidence intervals • Statistical inference methods use probability calculations that assume the data are gathered with a random sample or a randomized experiment.

  20. Properties of Good Estimators • Remember that the sample mean and the sample standard deviation are (point) estimates of the population mean and the population standard deviation respectively. • Good point estimators have two properties: • Unbiased • Small Standard Error • Consistency

  21. Consistency - SLLN Strong laws of large numbers (SLLN) says that the average of a random sample from a large population is likely to be close to the mean of the whole population. We say the sample mean, ,is a consistent estimator of the population mean .

  22. Unbiased Estimators A good estimator has a sampling distribution that is centered at the parameter, in the sense that the parameter is the mean of the sampling distribution. An estimator with this property is said to be unbiased. Population Curve Sampling Dis. Population Mean

  23. SE Small • A good estimator has a small standard deviation (standard error) compared to other estimators. • For example, for estimating the center of a normal distribution, it turns out that the sample mean has a smaller standard deviation (standard error) than the sample median.

  24. Estimation and Target Shooting Shoot a bullet (sample statistic) Try to hit center of target (population parameter) Bullet: Center of Target:

  25. Bull’s Eye is our goal! Unbiased, small SE Unbiased, big SE Biased, Big SE Biased, small SE

  26. The T-Distribution • Remember if X ~ N(,2) then Z = (Xi - )/ ~ N(0,1). • What if we do not know ? We estimate  with s, the sample standard deviation. • (Xi - )/s is no longer N(0,1). It is a t-distribution! • What is it? • Looks like a Normal Distribution • Depends on the sample size • For sample size n the t-distribution has n-1 “degrees of freedom” • t(n-1)  N(0,1) as n approaches infinity • Discovered by W. S. Gosset (Guiness Beer)

  27. T-Distribution • http://www.stat.sc.edu/~west/applets/tdemo1.html • http://www-stat.stanford.edu/~naras/jsm/TDensity/TDensity.html

  28. Confidence Intervals • A confidence interval is an interval of values computed from sample data that is likely to include the true population value. • The most general formula is: point estimate  ( multiplier × SE ) This is our single best “guess”. This depends on the distribution of the statistic. This is the standard error of the point estimate

  29. CI for a Population Mean • Consider the construction of an intervalC(data) = (lower end, upper end), that we believe is likely to contain the population mean. • We need to specify the lower and the upper limits some how. • One approach is to specify a probability so that P(lower limit <μ < upper limit) = y, where we refer to y as the confidence level of the interval. • Note that the confidence intervals themselves are really random intervals.

  30. CI for a Population Mean • Can we ever set y=%100? Sure! Just take the entire range of possible values for your population mean! • We usually set y high. Set y=0.95. • Very important: before you have gathered data and computed your confidence interval, P(lower limit(random variable)<μ < upper limit(random variable)) = 0.95but after you got your data and found the confidence interval then P(lower limit <μ < upper limit) is either 1 or 0!

  31. Confidence Intervals for the Mean - I If all necessary assumptions are met, the confidence interval for the population mean when the population standard deviation is NOT known is Sample Mean Multiplier, Based t(n-1) Standard Error of the sample mean

  32. Confidence Intervals for the Mean - II If all necessary assumptions are met, the confidence interval for the population mean when the population standard deviation is known is Sample Mean Multiplier, based on N(0,1) Standard deviation of the sample mean

  33. T-Multiplier

  34. Assumptions Population Variance known? Population normal? n large? (CLT?) Yes z* Yes z* Yes No No Can’t do it Yes t* No Yes No t* No Can’t do it

  35. Example 1 – Confidence Interval • The following are the activity values (micromoles per gram of tissue) of a certain enzyme measured in normal gastric tissue of n=31 patients with gastric carcinoma:

  36. Look at the Data First!

  37. Normal Population? - Histograms

  38. Normal Population? – Q-Q Plots

  39. Aside: Example of “Normal” Data Set

  40. How About the Random Sample (iid) Assumption? • You can create a time/sequence plot, if your data have some time dependency to check for the identically distributed assumption? • Independence is harder to verify. Look and see how the data was gathered.

  41. Example Based on the given activity values (micromoles per gram of tissue) of the enzyme measured in normal gastric tissue of n=31 patients with gastric carcinoma, what is the 95% confidence interval for the mean level of activity in the normal gastric tissue of people with gastric carcinoma? Df=31 -1 = 30 , from table use 2.04 0.242412/sqrt(31) = 0.04354 0.56174 ± (2.04)(0.04354) -> (0.4729,0.6506)

  42. Example Based on the given activity values (micromoles per gram of tissue) of the enzyme measured in normal gastric tissue of n=31 patients with gastric carcinoma, what is the 95% confidence interval for the mean level of activity in the normal gastric tissue of people with gastric carcinoma? Our calculations yielded (0.4729,0.6506).

  43. Interpretation • If this procedure were repeated many times, we would expect that the 95% of the confidence intervals to contain the population mean level of activity value in the normal gastric tissue of people with gastric carcinoma. • Practical Interpretation: We are 95% confident that the interval (0.4729,0.6506) contains the population mean level of activity in the normal gastric tissue of people with gastric carcinoma.

  44. More on Interpretation of CI • http://www.ruf.rice.edu/%7Elane/stat_sim/conf_interval/index.html

  45. Example 2 – Confidence Intervals Volunteers who had developed a cold within the previous 24 hours were randomized to take either zinc or placebo lozenges every 2 to 3 hours until their colds were gone. 25 took the zinc lozenges, and 23 took the placebo lozenges. The mean overall duration for of symptoms for the zinc lozenges group was 4.5 days and the standard deviation of overall duration of symptoms was 1.6 days. For the placebo group, the mean overall duration of symptoms was 8.1 days, and the standard deviation was 1.8 days. What are the 95% confidence intervals for the mean duration of symptoms for the population of individuals who took zinc lozenges and the placebo?Assume all the necessary assumptions to create the confidence intervals are met. For the zinc group, the confidence interval is about 3.84 to 5.14 days, computed as . Compute interval as Sample estimate  Multiplier  Standard error, which here is . Sample estimate is = 4.5 days. Standard error is Multiplier is 2.06. Use df = n1 = 251=24. For the placebo, the confidence interval is about 6.32 to 8.84 days, computed as . Compute interval as Sample estimate  Multiplier  Standard error, which here is . Sample estimate is = 8.1 days. Standard error is . Multiplier is 2.07 . Use df = n1 = 231=22. No overlap!

  46. Interpretation of Confidence Interval • If this procedure were repeated many times, we would expect that the 95% of the confidence intervals to contain the population mean overall duration of symptoms for those who take the zinc lozenges. • Practical Interpretation: We are 95% confident that the interval (3.84,5.14) contains the population mean overall duration of symptoms for those who take the zinc lozenges.

  47. Next Time… • Please keep up! • Finally we will start hypothesis testing! • Try out the exercises!

More Related