1 / 74

Short Course in Statistics

Short Course in Statistics. Learning Statistics through Computer Notice that Microsoft Chinese Windows is needed in some slides. Random Sampling. To obtain information through sampling Population and Sample Parameter and Statistic. Population

dolan-oneal
Download Presentation

Short Course in Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Short Course in Statistics • Learning Statistics through Computer • Notice that Microsoft Chinese Windows is needed in some slides

  2. Random Sampling • To obtain information through sampling • Population and Sample • Parameter and Statistic

  3. Population The entire group of individuals about which we want information Sample A part of the population from which we actually collect information, used to draw conclusions about the whole population. Population versus Sample

  4. Population = the measurements of weights of all children under 18 Sample = the measurements of weights of students in 20 secondary and primary schools Example

  5. Parameter A number that describes the population. Statistic A number that describes a sample. Parameter versus Statistic

  6. Drawing balls from a box • A box contains 10 balls: 5 red, 5 black • Population: 10 balls • Parameter: proportion of red balls • Draw a random sample of size 3 • Statistic: red balls in the sample e.g. 2/3

  7. Statistical Science • Statistics provides methodology to estimate the parameter through the (random) sample

  8. How to draw a random sample • Construct a sampling frame---give a number (name) to each individual in the population • Use “random number table” to draw a random sample of prescribed size

  9. Random Number Table • Imagine that a box containing 10 identical balls with numbers 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. • Each time you draw a ball and record the number before returning it to the box and draw the next ball --- this list (record) is the “random number table”

  10. Example • Objective---draw a sample of size 5 from a class of 30 students • Sampling frame---label each student with the numbers 00, 01,…29. • Read the random number table at line 130 ---- 69051 64817 87174 09517 • 69 05 16 48 17 87 17 40 95 17

  11. Multiple Label • 00=30=60, 01=31=61, 02=32=62, etc. • Notice 01 will correspond to the second individual

  12. Measurements in the Laboratory • Each measurement in the physics lab or chemistry lab can be regarded as an element in a random sample

  13. http://www.cuhk.edu.hk/webct • User ID & Password =STA2103(Surname)(Initials) • Go to the above website and learn sample survey, design of experiment and regression

  14. Henry,Chau,STA2103chauhKa Ho Enoch,Chan,STA2103chankheJane,Tang,STA2103tangjVincent,Pong,STA2103pongvClara,Yip,STA2103yipc

  15. Why Random Sampling • To be representative • Some laws governing the statistic---sampling distribution and compute the • Probability---the chance of the occurrence of an event in n independent samplings---can be computed

  16. Not representative • Call in • Voluntary response on the Web • Telephone survey asking the respondents to respond with the number keys • Readers’ letters to the newspaper

  17. Sampling Distribution • Random sampling  the statistic would change as the sample varies • That is, the conclusion might be changed for different sample • But, if the samples are randomly drawn, we can predict the result with high probability

  18. Example • Population: Hong Kong adult residents • Sample (random): 600 persons • Parameter: proportion of the population supporting one more public holiday • Statistic; proportion in the sample

  19. Consequence of Random Sampling • If we draw 1000 samples (with each sample of size 600), and we compute the statistic for each sample, the histogram of these 1000 (sample) proportion is approximately a bell-shaped curve---normal density

  20. Normal and Probability • Normal density has 2 parameters: • Mean --- true proportion (p) • Variance ---var=p(1-p)/n • Standard deviation (std)=sqrt(var) • The one sample we draw has probability .95 in the interval (p-1.96 std, p+1.96 std)

  21. Mean of normal=true parameter • If you draw a sample 1000 times, you have 1000 sample proportions. • The average of these 1000 sample proportions would be approximately the true proportion --- sample proportion is an unbiased estimate of the population proportion

  22. Variance=p(1-p)/n • If it is truly random, we can estimate the variance of these 1000 sample proportions using p (parameter) only. • If I have only one sample with accurate estimate of p, then the variance of the 1000 sample proportion can be computed without using the 1000 sample proportions

  23. Intuition behind the formula p(1-p)/n • Symmetric about ½ • It is maximized by p=1/2 (very uncertain) • When p is closer to 0 or 1, I.e., things are more definite, the variance gets smaller

  24. Confidence Interval • Conversely, p will be covered by the interval (p-1.96 std, p+1.96 std) 95 times out of 100 such experiments. • Notice std=sqrt(p(1-p)/n)

  25. 95% Confidence Interval • Use the formula for 100 surveys, we obtain 100 different interval estimates • 95 out of these 100 intervals would contain the true p

  26. Opinion Polls • People may not give the true response --- response error • People may not answer the questions --- nonresponse error • Unit nonresponse (the person does not response at all) • Item nonresponse (the person does not respond to some questions)

  27. Response rate • If the response rate is less than 80%, we would doubt about the validity of the inference

  28. Election Polls • The respondent may not be voters • The respondent may not vote even he/she has registered • The respondent may lie (response error)

  29. Questionnaire • The way to set questions would affect the response (well-known)

  30. Other Data Collection Methods • Experimental Design • Observational Data (e.g. registry Data)

  31. How to know the effect of vaccine in preventing polio • We cannot apply the vaccine to all children and compare the results in the past • We need two groups: control group (no “real” treatment) treatment group (apply the vaccine)

  32. We should compare the two groups under “equal” conditions • People are different from each other • By random assignment of participants into the two groups, we can make the two groups have almost identical conditions – e.g., around the same on average

  33. Design of an Experiment • For comparing one treatment (A) with the other treatment (B), we need to randomize the patient into each group receiving the one of the treatments

  34. Some possible mistakes • Data---from hospital record • Death rates of surgical patients are different for operations with different anesthetics • Halothane (1.7%), Pentothal (1.7%), Cyclopropane (3.4%), Ether (1.9%) • Can we say that cyclopropane is more dangerous than the other anesthetics?

  35. Answer • No! the worst patients were receiving cyclopropane.

  36. The vaccine can prevent Polio • 1956---USA---over two million children involved • Should they all receive vaccine? • Should the male receive vaccine while the female receive placebo?

  37. Placebo • In this case, placebo is another kind of liquid, which is similar to the vaccine in its outlook, injected into the children. • It is used so that all children were receiving “same” treatment. So that the difference in the results would not be explained as psychological effect

  38. Data

  39. Analysis • The proportion of control group having polio after ½ year --- a/(a+b)=0.00057 • The proportion of treatment group having polio after ½ year---c/(c+d)=0.00016 • The effect of treatment---- • RD (risk difference)=c/(c+d) - a/(a+b) =0.00041

  40. Formulation of the Hypotheses • Null Hypothesis: no difference in the proportions • Alternative Hypothesis: the two proportions are different

  41. Analysis • We need to compare RD with its variation • That is, if we have different experiments, the results are different. The variation of these results can be measured by its variance. • But we have only one experiment

  42. Estimate the variation • If there are no effect of the vaccine, the true risk (probability) of getting polio is pr=(a+c)/(a+b+c+d)=0.00037 • Under above hypothesis, the variance of RD is given by pr(1-pr) / (1/(a+b)+1/(c+d)) • The standard deviation is 0.000061.

  43. Contd. • Thus the ratio 0.00041/0.000061=6.76 measures the effect of vaccine. • Is 6.76 indicates a large or small or no effect? • We need a yardstick.

  44. Intuition • Thus the ratio (RD/std) measures the effect of the vaccine. • That is, if it is large in absolute value, the effect of vaccine is significant • How large is large?

  45. Random assignment of patients to treatments • If we do the experiment 1000 times and each time we calculate the ratio • We also assume that the effect of vaccine is zero.. • Then we plot the histogram of the 1000 ratios. We find the histogram is close to a bell-shape curve---normal density curve.

  46. Normality • Since we know that the ratio is normal and we now obtain 6.76. • We can compute the area to the right of 6.76----the probability that the ratio is larger than 6.76 under the hypothesis of no effect. We find the area is very small (6.9 x 10^{-12})

  47. P value • The area correspond to the probability of the event which is more extreme to the observed value • The usual rule --- p-value <0.05 reject the null hypothesis • 0.05 can be interpreted as 5 wrong conclusions among 100 experiments

  48. Chi Square Test-Another approach • We can apply the chi square test to the same data set. • The chi square test is used to test whether the proportion of getting polio is the same for the two groups (homogeneity). Equivalently, whether the occurrence of polio is independent of the treatment (group)

  49. Analysis • The chi square test statistic is given by N(ad - bc)**2/((a+b)(a+c)(b+d)(c+d)) • N=a+b+c+d • When the statistic is large, the hypothesis is likely to be wrong

  50. Statistical Reasoning • The above statistic can be expressed as the summation of the quantities • (observed counts-expected counts)**2 divided by the expected counts • Here expected counts means the average counts under the hypothesis that the two groups are the same

More Related