Experimental Design & Analysis

Experimental Design & Analysis Nonparametric Methods April 17, 2007 DOCTORAL SEMINAR, SPRING SEMESTER 2007

Nonparametric Tests • Occasions for use? • The response variable (residual) cannot logically be assumed to be normally distributed, a key assumption of ANOVA models • Limited data • Counts and rank are relevant units instead of means

Wilcoxon Rank Sum Test • Nonparametric version of a paired samples t-test • Example of corn yield as a function of weeding Treatment Sum of Ranks No weeds 23 Weeds 13

= n1(N+1)/2 = 4(8+1)/2 = 4*9/2 = 36/2 =18 = sqrt n1n2(N+1)/12 = sqrt (4)(4)(8+1)/12 = sqrt 144/12 = sqrt 12 = 3.464 σ Wilcoxon Rank Sum Test • Calculate test statistic by calculating the mean μand the standard deviation proc univariate data = MYDATA; var CORN; run; μ

Wilcoxon Signed Rank Sum Test • Nonparametric version of a paired samples t-test • Study difference between two variables (Story 1 vs. Story 2) • Data step necessary to create the difference of the two scores for each subject data MYDATA; set MYDATA; diff = STORY1 – STORY2; proc univariate data = MYDATA; var diff; run;

Wilcoxon Mann-Whitney Test • Nonparametric version of independent samples t-test can be used when you do not assume that the dependent variable is a normally distributed interval variable • Assume that the dependent variable is ordinal proc npar1way data = mydata wilcoxon; class female; var write; run;

Kruskal Wallis Test • Used when you have one independent variable with two or more levels and an ordinal dependent variable • Nonparametric version of ANOVA • Generalized form of the Mann-Whitney test method, as it permits two or more groups proc npar1way data = mydata; class prog; var write; run;

Chi-Square Test • Used when you want to see if there is a relationship between two categorical variables • Chi-square test assumes that the expected value for each cell is 5 or higher • If this assumption is not met, use Fisher's exact test • In SAS, the chisq option is used on the tables statement to obtain test statistic and p-value proc freq data = mydata; tables school*gender / chisq; run;

Fisher’s Exact Test • Used when you want to conduct a chi-square test, but one or more of your cells has an expected frequency of 5 or less • Fisher's exact test has no such assumption and can be used regardless of how small the expected frequency is proc freq data = mydata; tables school*race / fisher; run;

Factorial Logistic Regression • Used when you have two or more categorical independent variables but a dichotomous dependent variable • The desc option on the proc logistic statement is necessary so that SAS models the odds of being female (i.e., female = 1). The expb option on the model statement tells SAS to show the exponentiated coefficients (i.e., the odds ratios). proc logistic data = mydata desc; class prog schtyp; model female = prog schtyp prog*schtyp / expb; run;

Nonparametric Correlation • Spearman correlation is used when one or both of the variables are not assumed to be normally distributed and interval (but are assumed to be ordinal) • The values of the variables are converted in ranks and then correlated • The spearman option on the proc corr statement tells SAS to perform a Spearman rank correlation instead of a Pearson correlation proc corr data = mydata spearman; var read write; run;

Nonparametric Tests: Advantages & Shortcomings

Common Nonparametric Tests

Why Use Nonparametric Tests? • When data are not normally distributed and the measurements at best contain rank-order information, computing the standard descriptive statistics (e.g., mean, standard deviation) is sometimes not the most informative way to summarize the data

Advantages of Nonparametrics • Nonparametric test make less stringent demands of the data (resistant to outliers, shape of distribution) • Nonparametric procedures can sometimes be used to get a quick answer with little calculation • Nonparametric methods provide an air of objectivity when there is no reliable (universally recognized) underlying scale for the original data

Why Not Use All the Time? Parametric tests are often preferred because: • They are robust • They have greater power efficiency (greater power relative to the sample size) • They provide unique information (e.g., the interaction in a factorial design) • Parametric and nonparametric tests often address two different types of questions

Different Nonparametric Tests  Same Results? • Different nonparametric tests may yield different results • Advisable to run different nonparametric tests

Large Data Sets • Nonparametric methods are most appropriate when the sample sizes are small • When data set is large it often makes little sense to use nonparametric statistics at all • How large is large enough?

Small Data Sets • What happens when you use a nonparametric test with data from a normal distribution? • Greater incidence of Type II error • The nonparametric tests lack statistical power with small samples

Shortcomings • Non-parametric tests cannot give very significant results for very small samples as all the possible rank-sums are fairly likely • They do not, without the addition of extra assumptions, give confidence intervals for the means or medians of the underlying distributions • Assume that the data can be ordered – power of the test diminished if there are lots of ties

Standardization Used in situations in which you need to adjust and rescale observations to have a different mean and standard deviation Example: Midterm test scores are to be rescaled to have a mean of 75 and a standard deviation of 10 data midterm; input grade @@; datalines; 64 71 80 69 55 84 77 63 68 90 66 61 84 43 80 66 68 89 71 59 52 62 60 79 43 63 68 72 60 77 80 73 40 74 63 68 95 66 59 70 73 62 64 62 77 81 73 64 82 59 84 70 70 71 49 90 84 66 69 62 ; proc univariate data=midterm plot; var grade; run;

Moments N 60 Mean 69.06667 Std Dev 11.60489 Stem Leaf # Boxplot 9 5 1 | 9 00 2 | 8 9 1 | 8 000124444 9 | 7 7779 4 +-----+ 7 00011123334 11 | | 6 6666888899 10 *--+--* 6 0012222333444 13 +-----+ 5 5999 4 | 5 2 1 | 4 9 1 | 4 033 3 | ----+----+----+----+ Multiply Stem.Leaf by 10**+1

proc standard data=midterm out=adjusted mean=75 std=10; var grade; run; The new data set, ADJUSTED, has one variable, also called GRADE, which has a mean of 75 and a standard deviation of 10. For example, the grade of 95 in the MIDTERM dataset becomes a grade of 97.2 in the ADJUSTED dataset. 0.86(95-69.1) +75=97.2

The STDIZE procedure standardizes one or more numeric variables in a SAS data set by subtracting a location measure and dividing by a scale measure. A variety of location and scale measures are provided, including estimates that are resistant to outliers and clustering. Some of the well-known standardization methods such as mean, median, std, range, Huber's estimate, Tukey's biweight estimate, and Andrew's wave estimate are available in the STDIZE procedure. • In addition, you can multiply each standardized value by a constant and add a constant. Thus, the final output value is • result = add + multiply ×[((original - location))/scale] • where • result = final output value • add = constant to add (ADD= option) • multiply = constant to multiply by (MULT= option) • original = original input value • location = location measure • scale = scale measure • PROC STDIZE can also find quantiles in one pass of the data, a capability that is especially useful for very large data sets. With such data sets, the UNIVARIATE procedure may have high or excessive memory or time requirements.

options ls=78 ps=200 nocenter nodate; data midterm; input grade @@; datalines; 64 71 80 69 55 84 77 63 68 90 66 61 84 43 80 66 68 89 71 59 52 62 60 79 43 63 68 72 60 77 80 73 40 74 63 68 95 66 59 70 73 62 64 62 77 81 73 64 82 59 84 70 70 71 49 90 84 66 69 62 ; run; proc stdize data=midterm out=adjusted method=std add=75 mult=10 ; var grade; run; proc print data=adjusted; run; proc univariate data=adjusted; var grade; run;

One-Sample Tests of Location What is the equivalent of the one-sample normal test or one-sample t test for the hypothesis that the true mean is equal to a specified value? Sign Test Wilcoxon Sign Rank Test Where h is the (unknown) median of the population.

PROC UNIVARIATE in SAS automatically performs three tests of location but it does so by testing if the "typical" value is zero. T- Test Sign Test Wilcoxon Sign Rank Test data midterm; input grade @@; /* If GRADE is typically 75, then GRADE-75 should typically be zero. */ diff=grade-75; label diff='Points above 75'; datalines; 64 71 80 69 55 84 77 63 68 90 66 61 84 43 80 66 68 89 71 59 52 62 60 79 43 63 68 72 60 77 80 73 40 74 63 68 95 66 59 70 73 62 64 62 77 81 73 64 82 59 84 70 70 71 49 90 84 66 69 62 ; run; proc univariate data=midterm; var diff; run;

Univariate Procedure Variable=DIFF Points above 75 Moments N 60 Sum Wgts 60 Mean -5.93333 Sum -356 Std Dev 11.60489 Variance 134.6734 Skewness -0.17772 Kurtosis 0.250711 USS 10058 CSS 7945.733 CV -195.588 Std Mean 1.498185 T:Mean=0 -3.96035 Pr>|T| 0.0002 Num ^= 0 60 Num > 0 17 M(Sign) -13 Pr>=|M| 0.0011 Sgn Rank -481.5 Pr>=|S| 0.0002 T-test (m=75) Sign Test (h=75) Wilcoxon Sign Rank Test (h=75)

Ranking Many nonparametric procedures rely on the relative ordering, or ranking, of the observations. Suppose a new Florida resident wants to see if the prices of houses in Gainesville are higher than the prices of homes in smaller towns near Gainesville. She collects a random sample of 10 prices for houses on the market in Gainesville and 10 prices of homes in other cities in Alachua County.

data homes; input location $ price @@; datalines; Gville 74500 Gville 269000 Gville 94500 Gville 86900 Gville 99900 Gville 91500 Gville 72000 Gville 78000 Gville 289000 Gville 114000 County 32000 County 125000 County 105900 County 120000 County 139900 County 72000 County 85000 County 74500 County 199500 County 2200000 ; run; Influential observation Does one location tends to have higher average prices than the other? Does one location tends to have higher-ranked prices than the other? More higher ranking homes in the county than expected at random.

OBS LOCATION PRICE RANKCOST 1 Gville $74,500 4.5 2 Gville $269,000 18.0 3 Gville $94,500 10.0 4 Gville $86,900 8.0 5 Gville $99,900 11.0 6 Gville $91,500 9.0 7 Gville $70,000 2.0 8 Gville $78,000 6.0 9 Gville $289,000 19.0 10 Gville $114,000 13.0 11 County $32,000 1.0 12 County $125,000 15.0 13 County $105,900 12.0 14 County $120,000 14.0 15 County $139,900 16.0 16 County $72,000 3.0 17 County $85,000 7.0 18 County $74,500 4.5 19 County $199,500 17.0 20 County $2,200,000 20.0 How to get the ranks? proc rank data=homes out=rankdata ties=mean; var price; ranks rankcost; run; proc print data=rankdata; format price dollar10.; run;

Another way to rank the data would be to create groups of least expensive, inexpensive, moderate, expensive, and very expensive price ranges. PROC RANK can do this with the GROUPS option. OBS LOCATION PRICE PRICEGRP 1 Gville $74,500 1 2 Gville $269,000 4 3 Gville $94,500 2 4 Gville $86,900 1 5 Gville $99,900 2 6 Gville $91,500 2 7 Gville $72,000 0 8 Gville $78,000 1 9 Gville $289,000 4 10 Gville $114,000 3 11 County $32,000 0 12 County $125,000 3 13 County $105,900 2 14 County $120,000 3 15 County $139,900 3 16 County $72,000 0 17 County $85,000 1 18 County $74,500 1 19 County $199,500 4 20 County $2,200,000 4 proc rank data=homes out=rankdata groups=5; var price; ranks pricegrp; run; proc print data=rankdata; format price dollar10.; run; Grouping starts at 0.

PROC RANK can be used to produce a better normal probability plot than the one produced by PROC UNIVARIATE. We use PROC RANK to calculate the normal scores. If the data are indeed normally distributed, the Blomberg-calculated scores (BLOM option) should provide the best straight line. Consider the 20 homes to be a random sample of all homes for sale in Alachua County, and we want to see if price or log(price) more closely follows a normal distribution. data homes; set homes; logprice=log(price); run; proc rank data=homes out=rankdata normal=blom; var price logprice; ranks norm1 norm2; run; proc plot data=rankdata; plot price*norm1 logprice*norm2; run;

Plot of PRICE*NORM1. Legend: A = 1 obs, B = 2 obs, etc. PRICE| 3000000+ | | | | | A 2000000+ | | | | | 1000000+ | | | | A A | A A AA AA A A A A A 0| A B B A +-+------------+------------+------------+------------+- -2 -1 0 1 2 RANK FOR VARIABLE PRICE Plot of LOGPRICE*NORM2. Legend: A = 1 obs, B = 2 obs, etc.

Not very normal looking!

LOGPRICE| 16+ | | | | A | 14+ | | | | A A | A 12+ A | AA A A A | B B A A A AA | | | A 10+ ++------------+------------+------------+------------+- -2 -1 0 1 2 RANK FOR VARIABLE LOGPRICE

Better, but still not very normal!

Comparing Two or More Groups • The nonparametric version of analysis of variance is based on ranks. • The Mann-Whitney test and the Wilcoxon rank sum test are equivalent nonparametric techniques to compare two groups, while the Kruskal-Wallis test is ordinarily used to compare three or more groups. • All of these are available in PROC NPAR1WAY (nonparametric 1-way analysis of variance) in SAS.

NPAR1WAY PROC NPAR1WAY performs tests for location and scale differences based on the following scores of a response variable: Wilcoxon, median, Van der Waerden, Savage, Siegel-Tukey, Ansari-Bradley, Klotz, and Mood Scores. Additionally, PROC NPAR1WAY provides tests using the raw data as scores. When the data are classified into two samples, tests are based on simple linear rank statistics. When the data are classified into more than two samples, tests are based on one-way ANOVA statistics. Both asymptotic and exact p-values are available for these tests. PROC NPAR1WAY also calculates the following empirical distribution function (EDF) statistics: the Kolmogorov- Smirnov statistic, the Cramer-von Mises statistic, and, when the data are classified into only two samples, the Kuiper statistic. These statistics test whether the distribution of a variable is the same across different groups

proc npar1way wilcoxon data=homes; class location; var price; run; N P A R 1 W A Y P R O C E D U R E Wilcoxon Scores (Rank Sums) for Variable PRICE Classified by Variable LOCATION Sum of Expected Std Dev Mean LOCATION N Scores Under H0 Under H0 Score Gville 10 100.500000 105.0 13.2237824 10.0500 Other 10 109.500000 105.0 13.2237824 10.9500 Average Scores Were Used for Ties Wilcoxon 2-Sample Test (Normal Approximation) (with Continuity Correction of .5) S = 100.500 Z = -.302485 Prob > |Z| = 0.7623 T-Test Approx. Significance = 0.7656 Kruskal-Wallis Test (Chi-Square Approximation) CHISQ = 0.11580 DF = 1 Prob > CHISQ = 0.7336

Other Rank Tests When there are two factors with no interaction, as in a randomized complete block design, Friedman's chi-square test is a non-parametric test that can be used to examine treatment differences. Friedman's test is available in SAS under PROC FREQ, but it is fairly complicated to perform. PROC FREQ also offers versions of rank sum tests for ordinal data. See SAS documentation for details.

The LIFETEST procedure can be used with data that may be right-censored to compute nonparametric estimates of the survival distribution and to compute rank tests for association of the response variable with other variables. The survival estimates are computed within defined strata levels, and the rank tests are pooled over the strata and are therefore adjusted for strata differences.

Non-Parametric Correlation The Pearson Product Moment correlation coefficient measures the strength of the tendency for two variables X and Y to follow a straight line. Suppose we are more interested in measuring the tendency for X to increase or decrease with Y, without necessarily assuming a strictly linear relationship. • Spearman correlation coefficient - Pearson correlation between ranks of X and ranks of Y. • Kendall correlation coefficient - probability of observing Y2 > Y1 when X2 > X1.

Twelve students take a test, and we want to see if the students who finished the test early had higher scores than those who finished later. • We don't know the exact time that each student spent on the test, but we do know the order in which the tests were turned in to be graded. • With only have rank data for the time variable, the Pearson linear correlation coefficient would not be appropriate. (The last person to turn in the test could have taken 30 minutes, an hour, or two hours, and the guess of the exact time would greatly influence the resulting Pearson correlation.) • Both the Spearman and Kendall correlation coefficients could legitimately be used. data students; input order grade @@; datalines; 1 90 2 74 3 76 4 60 5 68 6 86 7 92 8 60 9 78 10 70 11 78 12 64 ; run; proc plot data=students; plot grade*order; run; proc corr data=students spearman kendall; var grade; with order; run;

Plot of GRADE*ORDER. Legend: A = 1 obs, B = 2 obs, etc. GRADE| 100+ | | A |A | A | 80+ | A A A | A | A | A | A 60+ A A ++----+----+----+----+----+----+----+----+----+----+----+- 1 2 3 4 5 6 7 8 9 10 11 12 ORDER

Correlation Analysis Spearman Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 12 GRADE ORDER -0.17544 0.5855 Kendall Tau b Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 12 GRADE ORDER -0.12309 0.5815

Experimental Design & Analysis