Experimental Design & Analysis. Nonparametric Methods April 17, 2007. DOCTORAL SEMINAR, SPRING SEMESTER 2007. Nonparametric Tests. Occasions for use? The response variable (residual) cannot logically be assumed to be normally distributed, a key assumption of ANOVA models Limited data
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Nonparametric Methods
April 17, 2007
DOCTORAL SEMINAR, SPRING SEMESTER 2007
Treatment Sum of Ranks
No weeds 23
Weeds 13
= 4(8+1)/2
= 4*9/2
= 36/2
=18
= sqrt n1n2(N+1)/12
= sqrt (4)(4)(8+1)/12
= sqrt 144/12
= sqrt 12
= 3.464
σ
Wilcoxon Rank Sum Testproc univariate data = MYDATA;
var CORN;
run;
μ
data MYDATA; set MYDATA;
diff = STORY1 – STORY2;
proc univariate data = MYDATA;
var diff; run;
proc npar1way data = mydata wilcoxon;
class female;
var write;
run;
proc npar1way data = mydata;
class prog;
var write;
run;
proc freq data = mydata;
tables school*gender / chisq;
run;
proc freq data = mydata;
tables school*race / fisher;
run;
proc logistic data = mydata desc;
class prog schtyp;
model female = prog schtyp prog*schtyp / expb;
run;
proc corr data = mydata spearman;
var read write;
run;
Parametric tests are often preferred because:
Used in situations in which you need to adjust and rescale
observations to have a different mean and standard deviation
Example: Midterm test scores are to be rescaled to have a mean of 75 and a standard deviation of 10
data midterm;
input grade @@;
datalines;
64 71 80 69 55 84 77 63 68 90 66 61 84 43 80
66 68 89 71 59 52 62 60 79 43 63 68 72 60 77
80 73 40 74 63 68 95 66 59 70 73 62 64 62 77
81 73 64 82 59 84 70 70 71 49 90 84 66 69 62
;
proc univariate data=midterm plot;
var grade;
run;
N 60
Mean 69.06667
Std Dev 11.60489
Stem Leaf # Boxplot
9 5 1 
9 00 2 
8 9 1 
8 000124444 9 
7 7779 4 ++
7 00011123334 11  
6 6666888899 10 *+*
6 0012222333444 13 ++
5 5999 4 
5 2 1 
4 9 1 
4 033 3 
++++
Multiply Stem.Leaf by 10**+1
proc standard data=midterm out=adjusted mean=75 std=10;
var grade;
run;
The new data set, ADJUSTED, has one variable, also called GRADE, which has a mean of 75 and a standard deviation of 10.
For example, the grade of 95 in the MIDTERM dataset becomes a grade of 97.2 in the ADJUSTED dataset.
0.86(9569.1) +75=97.2
The STDIZE procedure standardizes one or more numeric variables in a SAS data set by subtracting a location measure and dividing by a scale measure. A variety of location and scale measures are provided, including estimates that are resistant to outliers and clustering. Some of the wellknown standardization methods such as mean, median, std, range, Huber's estimate, Tukey's biweight estimate, and Andrew's wave estimate are available in the STDIZE procedure.
options ls=78 ps=200 nocenter nodate;
data midterm;
input grade @@;
datalines;
64 71 80 69 55 84 77 63 68 90 66 61 84 43 80
66 68 89 71 59 52 62 60 79 43 63 68 72 60 77
80 73 40 74 63 68 95 66 59 70 73 62 64 62 77
81 73 64 82 59 84 70 70 71 49 90 84 66 69 62
;
run;
proc stdize data=midterm
out=adjusted method=std add=75 mult=10 ;
var grade;
run;
proc print data=adjusted;
run;
proc univariate data=adjusted;
var grade;
run;
What is the equivalent of the onesample normal test or onesample t test for the hypothesis that the true mean is equal to a specified value?
Sign Test
Wilcoxon Sign Rank Test
Where h is the (unknown) median of the population.
PROC UNIVARIATE in SAS automatically performs three tests of location but it does so by testing if the "typical" value is zero.
T Test
Sign Test
Wilcoxon Sign Rank Test
data midterm;
input grade @@;
/*
If GRADE is typically 75,
then GRADE75 should typically be zero. */
diff=grade75;
label diff='Points above 75';
datalines;
64 71 80 69 55 84 77 63 68 90 66 61 84 43 80
66 68 89 71 59 52 62 60 79 43 63 68 72 60 77
80 73 40 74 63 68 95 66 59 70 73 62 64 62 77
81 73 64 82 59 84 70 70 71 49 90 84 66 69 62
;
run;
proc univariate data=midterm;
var diff;
run;
Variable=DIFF Points above 75
Moments
N 60 Sum Wgts 60
Mean 5.93333 Sum 356
Std Dev 11.60489 Variance 134.6734
Skewness 0.17772 Kurtosis 0.250711
USS 10058 CSS 7945.733
CV 195.588 Std Mean 1.498185
T:Mean=0 3.96035 Pr>T 0.0002
Num ^= 0 60 Num > 0 17
M(Sign) 13 Pr>=M 0.0011
Sgn Rank 481.5 Pr>=S 0.0002
Ttest (m=75)
Sign Test (h=75)
Wilcoxon Sign Rank Test (h=75)
Many nonparametric procedures rely on the relative ordering, or ranking, of the observations.
Suppose a new Florida resident wants to see if the prices of houses in Gainesville are higher than the prices of homes in smaller towns near Gainesville. She collects a random sample of 10 prices for houses on the market in Gainesville and 10 prices of homes in other cities in Alachua County.
input location $ price @@;
datalines;
Gville 74500 Gville 269000 Gville 94500
Gville 86900 Gville 99900
Gville 91500 Gville 72000 Gville 78000
Gville 289000 Gville 114000
County 32000 County 125000 County 105900
County 120000 County 139900
County 72000 County 85000 County 74500
County 199500 County 2200000
;
run;
Influential
observation
Does one location tends to have higher average prices than the other?
Does one location tends to have higherranked prices than the other?
More higher ranking homes in the county than expected at random.
1 Gville $74,500 4.5
2 Gville $269,000 18.0
3 Gville $94,500 10.0
4 Gville $86,900 8.0
5 Gville $99,900 11.0
6 Gville $91,500 9.0
7 Gville $70,000 2.0
8 Gville $78,000 6.0
9 Gville $289,000 19.0
10 Gville $114,000 13.0
11 County $32,000 1.0
12 County $125,000 15.0
13 County $105,900 12.0
14 County $120,000 14.0
15 County $139,900 16.0
16 County $72,000 3.0
17 County $85,000 7.0
18 County $74,500 4.5
19 County $199,500 17.0
20 County $2,200,000 20.0
How to get the ranks?
proc rank data=homes
out=rankdata
ties=mean;
var price;
ranks rankcost;
run;
proc print
data=rankdata;
format price dollar10.;
run;
Another way to rank the data would be to create groups of least expensive, inexpensive, moderate, expensive, and very expensive price ranges. PROC RANK can do this with the GROUPS option.
OBS LOCATION PRICE PRICEGRP
1 Gville $74,500 1
2 Gville $269,000 4
3 Gville $94,500 2
4 Gville $86,900 1
5 Gville $99,900 2
6 Gville $91,500 2
7 Gville $72,000 0
8 Gville $78,000 1
9 Gville $289,000 4
10 Gville $114,000 3
11 County $32,000 0
12 County $125,000 3
13 County $105,900 2
14 County $120,000 3
15 County $139,900 3
16 County $72,000 0
17 County $85,000 1
18 County $74,500 1
19 County $199,500 4
20 County $2,200,000 4
proc rank data=homes
out=rankdata
groups=5;
var price;
ranks pricegrp;
run;
proc print
data=rankdata;
format price dollar10.;
run;
Grouping starts at 0.
PROC RANK can be used to produce a better normal probability plot than the one produced by PROC UNIVARIATE.
We use PROC RANK to calculate the normal scores.
If the data are indeed normally distributed, the Blombergcalculated scores (BLOM option) should provide the best straight line.
Consider the 20 homes to be a random sample of all homes for sale in Alachua County, and we want to see if price or log(price) more closely follows a normal distribution.
data homes;
set homes;
logprice=log(price);
run;
proc rank data=homes out=rankdata normal=blom;
var price logprice;
ranks norm1 norm2;
run;
proc plot data=rankdata;
plot price*norm1 logprice*norm2;
run;
Plot of PRICE*NORM1. Legend: A = 1 obs, B = 2 obs, etc.
PRICE
3000000+




 A
2000000+





1000000+



 A A
 A A AA AA A A A A A
0 A B B A
++++++
2 1 0 1 2
RANK FOR VARIABLE PRICE
Plot of LOGPRICE*NORM2. Legend: A = 1 obs, B = 2 obs, etc.
16+



 A

14+



 A A
 A
12+ A
 AA A A A
 B B A A A AA


 A
10+
++++++
2 1 0 1 2
RANK FOR VARIABLE LOGPRICE
PROC NPAR1WAY performs tests for location and scale differences based on the following scores of a response variable: Wilcoxon, median, Van der Waerden, Savage, SiegelTukey, AnsariBradley, Klotz, and Mood Scores. Additionally, PROC NPAR1WAY provides tests using the raw data as scores. When the data are classified into two samples, tests are based on simple linear rank statistics. When the data are classified into more than two samples, tests are based on oneway ANOVA statistics. Both asymptotic and exact pvalues are available for these tests.
PROC NPAR1WAY also calculates the following empirical distribution function (EDF) statistics: the Kolmogorov Smirnov statistic, the Cramervon Mises statistic, and, when the data are classified into only two samples, the Kuiper statistic. These statistics test whether the distribution of a variable is the same across different groups
proc npar1way wilcoxon data=homes;
class location;
var price; run;
N P A R 1 W A Y P R O C E D U R E
Wilcoxon Scores (Rank Sums) for Variable PRICE
Classified by Variable LOCATION
Sum of Expected Std Dev Mean
LOCATION N Scores Under H0 Under H0 Score
Gville 10 100.500000 105.0 13.2237824 10.0500
Other 10 109.500000 105.0 13.2237824 10.9500
Average Scores Were Used for Ties
Wilcoxon 2Sample Test (Normal Approximation)
(with Continuity Correction of .5)
S = 100.500 Z = .302485 Prob > Z = 0.7623
TTest Approx. Significance = 0.7656
KruskalWallis Test (ChiSquare Approximation)
CHISQ = 0.11580 DF = 1 Prob > CHISQ = 0.7336
When there are two factors with no interaction, as in a randomized complete block design, Friedman's chisquare test is a nonparametric test that can be used to examine treatment differences. Friedman's test is available in SAS under PROC FREQ, but it is fairly complicated to perform. PROC FREQ also offers versions of rank sum tests for ordinal data. See SAS documentation for details.
The LIFETEST procedure can be used with data that may be rightcensored to compute nonparametric estimates of the survival distribution and to compute rank tests for association of the response variable with other variables. The survival estimates are computed within defined strata levels, and the rank tests are pooled over the strata and are therefore adjusted for strata differences.
The Pearson Product Moment correlation coefficient measures the strength of the tendency for two variables X and Y to follow a straight line.
Suppose we are more interested in measuring the tendency for X to increase or decrease with Y, without necessarily assuming a strictly linear relationship.
Twelve students take a test, and we want to see if the students who finished the test early had higher scores than those who finished later.
data students;
input order grade @@;
datalines;
1 90 2 74 3 76
4 60 5 68 6 86
7 92 8 60 9 78
10 70 11 78 12 64
;
run;
proc plot data=students;
plot grade*order;
run;
proc corr data=students
spearman kendall;
var grade;
with order;
run;
Plot of GRADE*ORDER. Legend: A = 1 obs, B = 2 obs, etc.
GRADE
100+

 A
A
 A

80+
 A A A
 A
 A
 A
 A
60+ A A
+++++++++++++
1 2 3 4 5 6 7 8 9 10 11 12
ORDER
Spearman Correlation Coefficients / Prob > R under Ho: Rho=0
/ N = 12
GRADE
ORDER 0.17544
0.5855
Kendall Tau b Correlation Coefficients
/ Prob > R under Ho: Rho=0 / N = 12
GRADE
ORDER 0.12309
0.5815