740 likes | 862 Views
Short Course in Statistics. Learning Statistics through Computer Notice that Microsoft Chinese Windows is needed in some slides. Random Sampling. To obtain information through sampling Population and Sample Parameter and Statistic. Population
E N D
Short Course in Statistics • Learning Statistics through Computer • Notice that Microsoft Chinese Windows is needed in some slides
Random Sampling • To obtain information through sampling • Population and Sample • Parameter and Statistic
Population The entire group of individuals about which we want information Sample A part of the population from which we actually collect information, used to draw conclusions about the whole population. Population versus Sample
Population = the measurements of weights of all children under 18 Sample = the measurements of weights of students in 20 secondary and primary schools Example
Parameter A number that describes the population. Statistic A number that describes a sample. Parameter versus Statistic
Drawing balls from a box • A box contains 10 balls: 5 red, 5 black • Population: 10 balls • Parameter: proportion of red balls • Draw a random sample of size 3 • Statistic: red balls in the sample e.g. 2/3
Statistical Science • Statistics provides methodology to estimate the parameter through the (random) sample
How to draw a random sample • Construct a sampling frame---give a number (name) to each individual in the population • Use “random number table” to draw a random sample of prescribed size
Random Number Table • Imagine that a box containing 10 identical balls with numbers 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. • Each time you draw a ball and record the number before returning it to the box and draw the next ball --- this list (record) is the “random number table”
Example • Objective---draw a sample of size 5 from a class of 30 students • Sampling frame---label each student with the numbers 00, 01,…29. • Read the random number table at line 130 ---- 69051 64817 87174 09517 • 69 05 16 48 17 87 17 40 95 17
Multiple Label • 00=30=60, 01=31=61, 02=32=62, etc. • Notice 01 will correspond to the second individual
Measurements in the Laboratory • Each measurement in the physics lab or chemistry lab can be regarded as an element in a random sample
http://www.cuhk.edu.hk/webct • User ID & Password =STA2103(Surname)(Initials) • Go to the above website and learn sample survey, design of experiment and regression
Henry,Chau,STA2103chauhKa Ho Enoch,Chan,STA2103chankheJane,Tang,STA2103tangjVincent,Pong,STA2103pongvClara,Yip,STA2103yipc
Why Random Sampling • To be representative • Some laws governing the statistic---sampling distribution and compute the • Probability---the chance of the occurrence of an event in n independent samplings---can be computed
Not representative • Call in • Voluntary response on the Web • Telephone survey asking the respondents to respond with the number keys • Readers’ letters to the newspaper
Sampling Distribution • Random sampling the statistic would change as the sample varies • That is, the conclusion might be changed for different sample • But, if the samples are randomly drawn, we can predict the result with high probability
Example • Population: Hong Kong adult residents • Sample (random): 600 persons • Parameter: proportion of the population supporting one more public holiday • Statistic; proportion in the sample
Consequence of Random Sampling • If we draw 1000 samples (with each sample of size 600), and we compute the statistic for each sample, the histogram of these 1000 (sample) proportion is approximately a bell-shaped curve---normal density
Normal and Probability • Normal density has 2 parameters: • Mean --- true proportion (p) • Variance ---var=p(1-p)/n • Standard deviation (std)=sqrt(var) • The one sample we draw has probability .95 in the interval (p-1.96 std, p+1.96 std)
Mean of normal=true parameter • If you draw a sample 1000 times, you have 1000 sample proportions. • The average of these 1000 sample proportions would be approximately the true proportion --- sample proportion is an unbiased estimate of the population proportion
Variance=p(1-p)/n • If it is truly random, we can estimate the variance of these 1000 sample proportions using p (parameter) only. • If I have only one sample with accurate estimate of p, then the variance of the 1000 sample proportion can be computed without using the 1000 sample proportions
Intuition behind the formula p(1-p)/n • Symmetric about ½ • It is maximized by p=1/2 (very uncertain) • When p is closer to 0 or 1, I.e., things are more definite, the variance gets smaller
Confidence Interval • Conversely, p will be covered by the interval (p-1.96 std, p+1.96 std) 95 times out of 100 such experiments. • Notice std=sqrt(p(1-p)/n)
95% Confidence Interval • Use the formula for 100 surveys, we obtain 100 different interval estimates • 95 out of these 100 intervals would contain the true p
Opinion Polls • People may not give the true response --- response error • People may not answer the questions --- nonresponse error • Unit nonresponse (the person does not response at all) • Item nonresponse (the person does not respond to some questions)
Response rate • If the response rate is less than 80%, we would doubt about the validity of the inference
Election Polls • The respondent may not be voters • The respondent may not vote even he/she has registered • The respondent may lie (response error)
Questionnaire • The way to set questions would affect the response (well-known)
Other Data Collection Methods • Experimental Design • Observational Data (e.g. registry Data)
How to know the effect of vaccine in preventing polio • We cannot apply the vaccine to all children and compare the results in the past • We need two groups: control group (no “real” treatment) treatment group (apply the vaccine)
We should compare the two groups under “equal” conditions • People are different from each other • By random assignment of participants into the two groups, we can make the two groups have almost identical conditions – e.g., around the same on average
Design of an Experiment • For comparing one treatment (A) with the other treatment (B), we need to randomize the patient into each group receiving the one of the treatments
Some possible mistakes • Data---from hospital record • Death rates of surgical patients are different for operations with different anesthetics • Halothane (1.7%), Pentothal (1.7%), Cyclopropane (3.4%), Ether (1.9%) • Can we say that cyclopropane is more dangerous than the other anesthetics?
Answer • No! the worst patients were receiving cyclopropane.
The vaccine can prevent Polio • 1956---USA---over two million children involved • Should they all receive vaccine? • Should the male receive vaccine while the female receive placebo?
Placebo • In this case, placebo is another kind of liquid, which is similar to the vaccine in its outlook, injected into the children. • It is used so that all children were receiving “same” treatment. So that the difference in the results would not be explained as psychological effect
Analysis • The proportion of control group having polio after ½ year --- a/(a+b)=0.00057 • The proportion of treatment group having polio after ½ year---c/(c+d)=0.00016 • The effect of treatment---- • RD (risk difference)=c/(c+d) - a/(a+b) =0.00041
Formulation of the Hypotheses • Null Hypothesis: no difference in the proportions • Alternative Hypothesis: the two proportions are different
Analysis • We need to compare RD with its variation • That is, if we have different experiments, the results are different. The variation of these results can be measured by its variance. • But we have only one experiment
Estimate the variation • If there are no effect of the vaccine, the true risk (probability) of getting polio is pr=(a+c)/(a+b+c+d)=0.00037 • Under above hypothesis, the variance of RD is given by pr(1-pr) / (1/(a+b)+1/(c+d)) • The standard deviation is 0.000061.
Contd. • Thus the ratio 0.00041/0.000061=6.76 measures the effect of vaccine. • Is 6.76 indicates a large or small or no effect? • We need a yardstick.
Intuition • Thus the ratio (RD/std) measures the effect of the vaccine. • That is, if it is large in absolute value, the effect of vaccine is significant • How large is large?
Random assignment of patients to treatments • If we do the experiment 1000 times and each time we calculate the ratio • We also assume that the effect of vaccine is zero.. • Then we plot the histogram of the 1000 ratios. We find the histogram is close to a bell-shape curve---normal density curve.
Normality • Since we know that the ratio is normal and we now obtain 6.76. • We can compute the area to the right of 6.76----the probability that the ratio is larger than 6.76 under the hypothesis of no effect. We find the area is very small (6.9 x 10^{-12})
P value • The area correspond to the probability of the event which is more extreme to the observed value • The usual rule --- p-value <0.05 reject the null hypothesis • 0.05 can be interpreted as 5 wrong conclusions among 100 experiments
Chi Square Test-Another approach • We can apply the chi square test to the same data set. • The chi square test is used to test whether the proportion of getting polio is the same for the two groups (homogeneity). Equivalently, whether the occurrence of polio is independent of the treatment (group)
Analysis • The chi square test statistic is given by N(ad - bc)**2/((a+b)(a+c)(b+d)(c+d)) • N=a+b+c+d • When the statistic is large, the hypothesis is likely to be wrong
Statistical Reasoning • The above statistic can be expressed as the summation of the quantities • (observed counts-expected counts)**2 divided by the expected counts • Here expected counts means the average counts under the hypothesis that the two groups are the same