STT430/530: Nonparametric Statistics

STT430/530:Nonparametric Statistics Chapter 4: Other Single-sample Inferences Dr. Cuixian Chen

Ch4: Other single-sample inferences • In the previous Chapters, we discussed about the inferences of centrality (median or mean) using a sample from a continuous distribution. • Question: • Are the observations from a simple random sample (SRS)? • Is there any trend or pattern in the dataset? • Are observations from a (family) of distribution? Eg: Normal or Uniform? In R: x=c(1,2,50,88,90); plot.ecdf(x)

Ch4: Other single-sample inferences • Are observations from a (family) of distribution? Eg: Normal or Uniform? • To address the question of distribution, empirical distribution function (edf) is a widely used approach. Recall: The edf of a sample is: • The idea is to compare the edf with cdf(cumulative distribution function) of the distribution function that we have in our mind.

Ch4: Review: OSI, the cdf • Recall: A r.v. X follows uniform distribution on the interval [a, b]: X~unif(a,b). Probability density function (pdf) for Unif(a, b): Cumulative distribution Function (cdf) for Unif(a, b):

Ch4: OSI, the Kolmogorov’s test Kolmogorov’s test: to test whether observations are sampled from some specified continuous distribution. Kolmogorov’s test statistics: use greatest discrepancy between edf and hypothesized cdf. Eg1: The distances from one end at which each of 4 threads 6 cm long break when subjected to strain are given below (round to whole #’s and ordered as increasing): 1, 3, 4, 5. Is it reasonable to suppose breaking points are uniformly distributed over (0, 6)? Q1: Find Sn(Xi) and draw the picture of it. Q3: What is the largest deviance? In view of the stepwise nature of Sn(x), the maximum difference may not be in set of F(xi)-Sn(xi). Maximum discrepancy may also occur at the previous step so we also look at: F(xi)-Sn(xi-1).

Ch4: OSI, the Kolmogorov’s test Eg 4.2: The distances from one end at which each of 20 threads 6 cm long break when subjected to strain are given below: 0.6 0.8 1.1 1.2 1.4 1.7 1.8 1.9 2.2 2.4 2.5 2.9 3.1 3.4 3.4 3.9 4.4 4.9 5.2 5.9 . Is it reasonable to suppose breaking points are uniformly distributed over (0, 6)?

Ch4: OSI, the Kolmogorov’s test Eg 4.2: The distances from one end at which each of 20 threads 6 cm long break when subjected to strain are given below: 0.6 0.8 1.1 1.2 1.4 1.7 1.8 1.9 2.2 2.4 2.5 2.9 3.1 3.4 3.4 3.9 4.4 4.9 5.2 5.9 . Is it reasonable to suppose breaking points are uniformly distributed over (0, 6)? Comments: If a sample comes from a population with cdf F(x), the step function S(x) should not depart markedly from F(x). This figure shows some large deviance, and intuition may suggest this is NOT a good fit. But from published table, exact p-value=0.458. So there is NO evidence against H0.

Kolmogorov-Smirnov Table One-sided on the left, and Two-sided on the right. Ch4: OSI, the Kolmogorov’s test

Ch4: OSI, the Kolmogorov’s test: is it working for Uniform?

Ch4: OSI, the Kolmogorov’s test in R The test requires special table or appropriate software for implementation. In R: To test whether observation are from a specified distribution, use ks.test(x, y, ..., alternative = c("two.sided", "less", "greater"), exact = NULL) Eg4.1: The distances from one end at which each of 20 threads 6 cm long break when subjected to strain are given below: 0.6 0.8 1.1 1.2 1.4 1.7 1.8 1.9 2.2 2.4 2.5 2.9 3.1 3.4 3.4 3.9 4.4 4.9 5.2 5.9 . Q1: Is it reasonable to suppose breaking points are uniform (0, 6)? Q2:Is it reasonable to suppose breaking points are Normally distributed? Solution: Which normal distribution? What would be the average and sd? x<-c( 0.6, 0.8, 1.1, 1.2, 1.4 ,1.7 ,1.8, 1.9, 2.2, 2.4, 2.5 ,2.9, 3.1, 3.4 ,3.4, 3.9 ,4.4 ,4.9, 5.2 ,5.9); ks.test(x,"punif",0,6); x<-c( 0.6, 0.8, 1.1, 1.2, 1.4 ,1.7 ,1.8, 1.9, 2.2, 2.4, 2.5 ,2.9, 3.1, 3.4 ,3.4, 3.9 ,4.4 ,4.9, 5.2 ,5.9); qqnorm(x);qqline(x); ks.test(x,"pnorm",mean(x),sd(x));

Ch3: Location inference for single samples, inference about median /*In SAS, use Proc Univariate*/ Data example3_1; input heartrate; cards; 73 82 87 68 106 60 97 ; procunivariate normal data=example3_1 mu0=70; var heartrate; histogram / normal (mu=est sigma=est); qqplot /normal (mu=est sigma=est); run; Note: The Normal option is used for provide a test for normality based on Shapiro-wilk test.

Ch4: OSI, the Kolmogorov’stest in R • Eg4.1: The distances from one end at which each of 20 threads 6 cm long break when subjected to strain are given below: • 0.6 0.8 1.1 1.2 1.4 1.7 1.8 1.9 2.2 2.4 2.5 2.9 3.1 3.4 3.4 3.9 4.4 4.9 5.2 5.9 . • Is it reasonable to suppose breaking points are Uniform or Normally distributed? • Solution: Which normal distribution? What would be the average and sd? • Use Kolmogorov’s test, with sample mean and sample sd; • ks.test(x,"punif",a, b); • ks.test(x,"pnorm",mean(x),sd(x)); • Use Lilliefors’ Test; • Use Shapiro-Wilk test: • shapiro.test(rnorm(100, mean = 5, sd = 3)) • shapiro.test(runif(100, min = 2, max = 4)) x<-c( 0.6, 0.8, 1.1, 1.2, 1.4 ,1.7 ,1.8, 1.9, 2.2, 2.4, 2.5 ,2.9, 3.1, 3.4 ,3.4, 3.9 ,4.4 ,4.9, 5.2 ,5.9); qqnorm(x);qqline(x); ks.test(x,"pnorm",mean(x),sd(x));

Ch4: OSI, the Kolmogorov’s test in R (Example 4.4) x<-c(11,13,14,22,29,30,41,41,52,55,56,59,65,65,66,74,74,75,77,81,82,82,82,82,83,85,85,87,87,88); qqnorm(x);qqline(x) ks.test(x,"pnorm",mean(x),sd(x));

Ch4: OSI, inferences for Dichotomous data Def: The binomial distribution is relevant to certain counts associated with only two possible outcomes —often referred to as dichotomous data. Example 4.6 Q: Find both Exact and Asymptotic p-values for testing population proportion. H0: p<=0.1 vs. Ha: p>0.1. binom.test(3,20,p=0.1,alternative="greater"); ## default: p=0.5. Q: if “greater” was not mentioned….. Default=“two.sided” Large sample asymptotic distribution:

Ch4: OSI, inferences for Dichotomous data Example 4.11 For Exact p-value: binom.test(3,20)

Ch4: OSI, inferences for Dichotomous data Example 4.12: A central examining body publishes the information that ‘three- quarters of the candidates taking a mathematics paper achieved a mark of 40 or more’ (i.e. the first population quartile is 40). One school entered 32 candidates for this paper of whom 13 scored less than 40. The president of the Parents’ Association argues that the school’s performance is below national standards. The headmaster counters by claiming that in a random sample of 32 candidates it is quite likely that 13 would score less than the lower quartile mark even though 8 out of 32 is the expected proportion. Is his assertion justified? Q: Find both Exact and Asymptotic p-values for testing population proportion. X~Bin(32, 0.5), symmetric X~Bin(32, 0.25), skew to right For Exact p-value: binom.test(13,32, p=0.25) #Exact p-value = 0.06291

Ch4: OSI, inferences for Dichotomous data Conclusion. If we observe 13 minus signs we would not reject the hypothesis H0: first quartile is 40 at a conventional 5 per cent significance level. Nevertheless there is some evidence against this hypothesis – enough to worry many parents and some may be reluctant to give the headmaster’s claim the benefit of the doubt until further evidence were available. • Comments: • If we use formal significance levels, non- rejection of a hypothesis does not prove it true. It is only a statement that evidence to date is not sufficient to reject it. This may simply be because our sample is too small. • We used a two-tail test. A one-tail test would not be justified unless we had information indicating the school’s performance could not be better than the national norm. For example, if most schools devoted three periods per week to the subject but the school in question only devoted two, we might argue that lack of tuition could only depress performance. • The headmaster’s claim said ‘if one took a random sample’. Pupils from a single school are in no sense a random sample from all examination candidates. Our test only establishes that results for this particular school are not too strongly out of line.

Ch4: OSI, §4.5, A run test for randomness Example 4.14 : If the outcomes, in order, of a computer process that purports to simulate 20 tosses of a coin were 4.5) HHHHHTTTTTTTTTTHHHHH; or 4.6) HTHTHTHTHTHTHTHTHTHT; or 4.7) HHTHTTTHTHHTHTHHHTTH. Q: Are we going to suspect that tosses were random/independent? Define: A run is a sequence of one or more heads or tails. We consider only a test based on the number of runs, R, in a sequence of N ordered observations. Define: m are of one kind (e.g. H), and n = N – m are of another kind (e.g. T). Q: find the values of N, m, n, and R for Example 4.14 above.

Ch4: OSI, §4.5, A run test for randomness If the outcomes, in order, of a computer process that purports to simulate 20 tosses of a coin were 4.5) HHHHHTTTTTTTTTTHHHHH; or 4.6) HTHTHTHTHTHTHTHTHTHT; or 4.7) HHTHTTTHTHHTHTHHHTTH. Q: Are we going to suspect that the tosses were random/independent? • We consider only a test based on the r.v., number of runs, R, in a sequence of N ordered observations of which • m are of one kind (e.g. H) and n = N – m are of another kind (e.g. T). • We reject the hypothesis that the outcomes are independent or random, if we observe too few or too many runs. • The random variable R specifies the number of runs. r is a realization to R. • We consider separately the cases r odd and r even. • For r odd, we set r = 2s + 1 • For r even, we set r = 2s and *

Ch4: OSI, A run test for randomness, the asymptotics Test for randomness is based on the relevant tail probabilities associated with small and large numbers of runs. Example 4.14: If the outcomes, in order, of a computer process that purports to simulate 20 tosses of a coin were (4.5) HHHHHTTTTTTTTTTHHHHH; (4.6) or HTHTHTHTHTHTHTHTHTHT; (4.7) or HHTHTTTHTHHTHTHHHTTH, Q: Are we going to suspect that tosses were random? [R codes on next PPT] In R: choose(n,k) Asymptotic Run Test: where Z has a standard Normal distribution. The approximation is improved by the usual numerator continuity correction: i.e. adding 0.5 if R < E(R); and subtracting 0.5 if R > E(R). (Same idea before to get closer to the center...)

Ch4: OSI, A run test for randomness, the asymptotic ## Chapter 4: Find the probabilities for Run test in R ## N=20; m=10; n=10; PROB=NULL; for (r in 1:20) { if ((r %% 2) == 0) ## decide whether it is even { s=r/2; PROB= c(PROB, 2*choose(m-1, s-1)*choose(n-1,s-1)/choose(N, m)); } if ((r %% 2) == 1) ## decide whether it is odd { s=(r-1)/2; PROB= c(PROB, (choose(m-1, s-1)*choose(n-1,s)+choose(m-1, s)*choose(n-1,s-1))/choose(N, m)); } } print(PROB) RUN=1:20; rbind(RUN, PROB)

Summary: A run test for randomness, Exact or Asymptotic p-value • Exact Run Test: Asymptotic Run Test: where Z has a standard Normal distribution. The approximation is improved by the usual numerator continuity correction: i.e. adding 0.5 if R < E(R); and subtracting 0.5 if R > E(R).

Ch4: OSI, Angular data, Hodges-Ajne Test 1. In some investigations measurements are made on directions, e.g. wind directions, the bearings at which released pigeons disappear over the horizon and the successive stopping positions of a roulette wheel, the time of day at which babies are born in a large hospital or the days during the year in which new cases of leukemia are diagnosed in a certain region. 2. These are called angular measurements. The names circular or directional measurements are also used. For angular data, Hodges-Ajne Test is used to investigate whether a sample of n observations on a circle could arise from a uniformly distributed population. H0: a sample of n observations on the circle could arise from a uniformly distributed population; Ha: in the population observations are more concentrated within a particular arc of the circumference; outliers may nevertheless occur well away from this arc.

Ch4: OSI, Angular data, Hodges-Ajne Test Example 4.16: A midwife recorded the times of birth for twelve consecutive home deliveries. She was interested in whether births tended to occur at particular times of the day. The times (rearranged in order throughout the day) 0100, 0300, 0420, 0500, 0540, 0620, 0640, 0700, 0940, 1100, 1200, 1720. Since on a 24-hour circular clock one hour corresponds to 360/24 = 15 degrees the successive angles on the circle (assuming midnight corresponds to 0°) are 15°, 45°, 65°, 75°, 85°, 95°, 100°, 105°, 145°, 165°, 180°, 260°. We test the hypothesis H0 that the times of birth have a uniform distribution around the circle.

Ch4: OSI, Angular data, Hodges-Ajne Test Example 4.16: 15°, 45°, 65°, 75°, 85°, 95°, 100°, 105°, 145°, 165°, 180°, 260°. We test the hypothesis H0 that the times of birth have a uniform distribution around the circle. To carry out this test, a straight line is drawn through the centre of the circle; this will divide the observations into two groups. The line is rotated about the centre to a position at which there is a minimum possible number of points, m, on one side of this line.

Ch4: OSI, Angular data, Hodges-AjneTest 2nd example Example 4.17. Smeeton and Wilkinson (1988) give data for a female psychiatric patient who repeatedly attempted to commit suicide. There was evidence to suggest that these attempts occurred during one particular part of the year. Records showed that attempts had occurred on 2 June 1980, 3 June 1980, 8 June 1980, 18 June 1980, 4 July 1980, 5 June 1981, 6 June 1981 and 31 July 1981.The successive angles on the circle (assuming 0° is the start of the year) are 151°, 152°, 157°, 167°, 182°, 154°, 155°, 209°. We test the null hypothesis H0 that the dates of the suicide attempts have a uniform distribution.

Ch4: OSI, Angular data, Hodges-AjneTest 2nd example Homework problems: Ks test: 4.3, 4.4 Dichotomous data: 4.6,4.7,4.15 Test of random ness: the scores of a football team in a series of 20 matches are WWLLLLLLLLLLWWWLLLLL, use both exact test and asymptotic test to look for evidence of clustering (random or not?) Angular data: 4.13,4.14

Ch4: OSI, A run test for randomness, the asymptotics • runs.test(y, plot.it = FALSE, alternative = c("two.sided", "positive.correlated", "negative.correlated")) library(lawstat) x=c(1,1,0,1,0,0,0,1,0,1,1,0,1,0,1,1,1,0,0 ) runs.test(x, plot.it = TRUE, alternative = "two.sided")

A run test for randomness, Exact or Asymptotic p-value • Exact Run Test: Asymptotic Run Test: where Z has a standard Normal distribution. The approximation is improved by the usual numerator continuity correction: i.e. adding 0.5 if R < E(R); and subtracting 0.5 if R > E(R).

Ch4: OSI, Angular data, Hodges-Ajne Test

STT430/530: Nonparametric Statistics

STT430/530: Nonparametric Statistics

Presentation Transcript

Student’s t test and Nonparametric Statistics

Nonparametric Methods II

STATISTICS 542 Introduction to Clinical Trials

STATISTICS 542 Introduction to Clinical Trials

SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS

Statistics A Basic Introduction and Review

AP Statistics Review

Basic Statistics

Statistics

Introduction to Applied Statistics

Statistics

Inferential Statistics

Nonparametric tests

Applications of Statistics in Research

Chapter 1

COMPLETE BUSINESS STATISTICS

COMPLETE BUSINESS STATISTICS

Descriptive Statistics Univariate Statistics Chi Square ANOVA

Review of Top 10 Concepts in Statistics