Experimental Design & Analysis

1 / 47

# Experimental Design & Analysis - PowerPoint PPT Presentation

Experimental Design & Analysis. Nonparametric Methods April 17, 2007. DOCTORAL SEMINAR, SPRING SEMESTER 2007. Nonparametric Tests. Occasions for use? The response variable (residual) cannot logically be assumed to be normally distributed, a key assumption of ANOVA models Limited data

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Experimental Design & Analysis' - dafydd

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Experimental Design & Analysis

Nonparametric Methods

April 17, 2007

DOCTORAL SEMINAR, SPRING SEMESTER 2007

Nonparametric Tests
• Occasions for use?
• The response variable (residual) cannot logically be assumed to be normally distributed, a key assumption of ANOVA models
• Limited data
• Counts and rank are relevant units instead of means
Wilcoxon Rank Sum Test
• Nonparametric version of a paired samples t-test
• Example of corn yield as a function of weeding

Treatment Sum of Ranks

No weeds 23

Weeds 13

= n1(N+1)/2

= 4(8+1)/2

= 4*9/2

= 36/2

=18

= sqrt n1n2(N+1)/12

= sqrt (4)(4)(8+1)/12

= sqrt 144/12

= sqrt 12

= 3.464

σ

Wilcoxon Rank Sum Test
• Calculate test statistic by calculating the mean μand the standard deviation

proc univariate data = MYDATA;

var CORN;

run;

μ

Wilcoxon Signed Rank Sum Test
• Nonparametric version of a paired samples t-test
• Study difference between two variables (Story 1 vs. Story 2)
• Data step necessary to create the difference of the two scores for each subject

data MYDATA; set MYDATA;

diff = STORY1 – STORY2;

proc univariate data = MYDATA;

var diff; run;

Wilcoxon Mann-Whitney Test
• Nonparametric version of independent samples t-test can be used when you do not assume that the dependent variable is a normally distributed interval variable
• Assume that the dependent variable is ordinal

proc npar1way data = mydata wilcoxon;

class female;

var write;

run;

Kruskal Wallis Test
• Used when you have one independent variable with two or more levels and an ordinal dependent variable
• Nonparametric version of ANOVA
• Generalized form of the Mann-Whitney test method, as it permits two or more groups

proc npar1way data = mydata;

class prog;

var write;

run;

Chi-Square Test
• Used when you want to see if there is a relationship between two categorical variables
• Chi-square test assumes that the expected value for each cell is 5 or higher
• If this assumption is not met, use Fisher's exact test
• In SAS, the chisq option is used on the tables statement to obtain test statistic and p-value

proc freq data = mydata;

tables school*gender / chisq;

run;

Fisher’s Exact Test
• Used when you want to conduct a chi-square test, but one or more of your cells has an expected frequency of 5 or less
• Fisher's exact test has no such assumption and can be used regardless of how small the expected frequency is

proc freq data = mydata;

tables school*race / fisher;

run;

Factorial Logistic Regression
• Used when you have two or more categorical independent variables but a dichotomous dependent variable
• The desc option on the proc logistic statement is necessary so that SAS models the odds of being female (i.e., female = 1).  The expb option on the model statement tells SAS to show the exponentiated coefficients (i.e., the odds ratios).

proc logistic data = mydata desc;

class prog schtyp;

model female = prog schtyp prog*schtyp / expb;

run;

Nonparametric Correlation
• Spearman correlation is used when one or both of the variables are not assumed to be normally distributed and interval (but are assumed to be ordinal)
• The values of the variables are converted in ranks and then correlated
• The spearman option on the proc corr statement tells SAS to perform a Spearman rank correlation instead of a Pearson correlation

proc corr data = mydata spearman;

run;

### Nonparametric Tests: Advantages & Shortcomings

Why Use Nonparametric Tests?
• When data are not normally distributed and the measurements at best contain rank-order information, computing the standard descriptive statistics (e.g., mean, standard deviation) is sometimes not the most informative way to summarize the data
• Nonparametric test make less stringent demands of the data (resistant to outliers, shape of distribution)
• Nonparametric procedures can sometimes be used to get a quick answer with little calculation
• Nonparametric methods provide an air of objectivity when there is no reliable (universally recognized) underlying scale for the original data
Why Not Use All the Time?

Parametric tests are often preferred because:

• They are robust
• They have greater power efficiency (greater power relative to the sample size)
• They provide unique information (e.g., the interaction in a factorial design)
• Parametric and nonparametric tests often address two different types of questions
Different Nonparametric Tests  Same Results?
• Different nonparametric tests may yield different results
• Advisable to run different nonparametric tests
Large Data Sets
• Nonparametric methods are most appropriate when the sample sizes are small
• When data set is large it often makes little sense to use nonparametric statistics at all
• How large is large enough?
Small Data Sets
• What happens when you use a nonparametric test with data from a normal distribution?
• Greater incidence of Type II error
• The nonparametric tests lack statistical power with small samples
Shortcomings
• Non-parametric tests cannot give very significant results for very small samples as all the possible rank-sums are fairly likely
• They do not, without the addition of extra assumptions, give confidence intervals for the means or medians of the underlying distributions
• Assume that the data can be ordered – power of the test diminished if there are lots of ties
Standardization

Used in situations in which you need to adjust and rescale

observations to have a different mean and standard deviation

Example: Midterm test scores are to be rescaled to have a mean of 75 and a standard deviation of 10

data midterm;

datalines;

64 71 80 69 55 84 77 63 68 90 66 61 84 43 80

66 68 89 71 59 52 62 60 79 43 63 68 72 60 77

80 73 40 74 63 68 95 66 59 70 73 62 64 62 77

81 73 64 82 59 84 70 70 71 49 90 84 66 69 62

;

proc univariate data=midterm plot;

run;

Moments

N 60

Mean 69.06667

Std Dev 11.60489

Stem Leaf # Boxplot

9 5 1 |

9 00 2 |

8 9 1 |

8 000124444 9 |

7 7779 4 +-----+

7 00011123334 11 | |

6 6666888899 10 *--+--*

6 0012222333444 13 +-----+

5 5999 4 |

5 2 1 |

4 9 1 |

4 033 3 |

----+----+----+----+

Multiply Stem.Leaf by 10**+1

proc standard data=midterm out=adjusted mean=75 std=10;

run;

The new data set, ADJUSTED, has one variable, also called GRADE, which has a mean of 75 and a standard deviation of 10.

For example, the grade of 95 in the MIDTERM dataset becomes a grade of 97.2 in the ADJUSTED dataset.

0.86(95-69.1) +75=97.2

The STDIZE procedure standardizes one or more numeric variables in a SAS data set by subtracting a location measure and dividing by a scale measure. A variety of location and scale measures are provided, including estimates that are resistant to outliers and clustering. Some of the well-known standardization methods such as mean, median, std, range, Huber's estimate, Tukey's biweight estimate, and Andrew's wave estimate are available in the STDIZE procedure.

• In addition, you can multiply each standardized value by a constant and add a constant. Thus, the final output value is
• result = add + multiply ×[((original - location))/scale]
• where
• result = final output value
• multiply = constant to multiply by (MULT= option)
• original = original input value
• location = location measure
• scale = scale measure
• PROC STDIZE can also find quantiles in one pass of the data, a capability that is especially useful for very large data sets. With such data sets, the UNIVARIATE procedure may have high or excessive memory or time requirements.

options ls=78 ps=200 nocenter nodate;

data midterm;

datalines;

64 71 80 69 55 84 77 63 68 90 66 61 84 43 80

66 68 89 71 59 52 62 60 79 43 63 68 72 60 77

80 73 40 74 63 68 95 66 59 70 73 62 64 62 77

81 73 64 82 59 84 70 70 71 49 90 84 66 69 62

;

run;

proc stdize data=midterm

run;

run;

run;

One-Sample Tests of Location

What is the equivalent of the one-sample normal test or one-sample t test for the hypothesis that the true mean is equal to a specified value?

Sign Test

Wilcoxon Sign Rank Test

Where h is the (unknown) median of the population.

PROC UNIVARIATE in SAS automatically performs three tests of location but it does so by testing if the "typical" value is zero.

T- Test

Sign Test

Wilcoxon Sign Rank Test

data midterm;

/*

then GRADE-75 should typically be zero. */

label diff='Points above 75';

datalines;

64 71 80 69 55 84 77 63 68 90 66 61 84 43 80

66 68 89 71 59 52 62 60 79 43 63 68 72 60 77

80 73 40 74 63 68 95 66 59 70 73 62 64 62 77

81 73 64 82 59 84 70 70 71 49 90 84 66 69 62

;

run;

proc univariate data=midterm;

var diff;

run;

Univariate Procedure

Variable=DIFF Points above 75

Moments

N 60 Sum Wgts 60

Mean -5.93333 Sum -356

Std Dev 11.60489 Variance 134.6734

Skewness -0.17772 Kurtosis 0.250711

USS 10058 CSS 7945.733

CV -195.588 Std Mean 1.498185

T:Mean=0 -3.96035 Pr>|T| 0.0002

Num ^= 0 60 Num > 0 17

M(Sign) -13 Pr>=|M| 0.0011

Sgn Rank -481.5 Pr>=|S| 0.0002

T-test (m=75)

Sign Test (h=75)

Wilcoxon Sign Rank Test (h=75)

Ranking

Many nonparametric procedures rely on the relative ordering, or ranking, of the observations.

Suppose a new Florida resident wants to see if the prices of houses in Gainesville are higher than the prices of homes in smaller towns near Gainesville. She collects a random sample of 10 prices for houses on the market in Gainesville and 10 prices of homes in other cities in Alachua County.

data homes;

input location \$ price @@;

datalines;

Gville 74500 Gville 269000 Gville 94500

Gville 86900 Gville 99900

Gville 91500 Gville 72000 Gville 78000

Gville 289000 Gville 114000

County 32000 County 125000 County 105900

County 120000 County 139900

County 72000 County 85000 County 74500

County 199500 County 2200000

;

run;

Influential

observation

Does one location tends to have higher average prices than the other?

Does one location tends to have higher-ranked prices than the other?

More higher ranking homes in the county than expected at random.

OBS LOCATION PRICE RANKCOST

1 Gville \$74,500 4.5

2 Gville \$269,000 18.0

3 Gville \$94,500 10.0

4 Gville \$86,900 8.0

5 Gville \$99,900 11.0

6 Gville \$91,500 9.0

7 Gville \$70,000 2.0

8 Gville \$78,000 6.0

9 Gville \$289,000 19.0

10 Gville \$114,000 13.0

11 County \$32,000 1.0

12 County \$125,000 15.0

13 County \$105,900 12.0

14 County \$120,000 14.0

15 County \$139,900 16.0

16 County \$72,000 3.0

17 County \$85,000 7.0

18 County \$74,500 4.5

19 County \$199,500 17.0

20 County \$2,200,000 20.0

How to get the ranks?

proc rank data=homes

out=rankdata

ties=mean;

var price;

ranks rankcost;

run;

proc print

data=rankdata;

format price dollar10.;

run;

Another way to rank the data would be to create groups of least expensive, inexpensive, moderate, expensive, and very expensive price ranges. PROC RANK can do this with the GROUPS option.

OBS LOCATION PRICE PRICEGRP

1 Gville \$74,500 1

2 Gville \$269,000 4

3 Gville \$94,500 2

4 Gville \$86,900 1

5 Gville \$99,900 2

6 Gville \$91,500 2

7 Gville \$72,000 0

8 Gville \$78,000 1

9 Gville \$289,000 4

10 Gville \$114,000 3

11 County \$32,000 0

12 County \$125,000 3

13 County \$105,900 2

14 County \$120,000 3

15 County \$139,900 3

16 County \$72,000 0

17 County \$85,000 1

18 County \$74,500 1

19 County \$199,500 4

20 County \$2,200,000 4

proc rank data=homes

out=rankdata

groups=5;

var price;

ranks pricegrp;

run;

proc print

data=rankdata;

format price dollar10.;

run;

Grouping starts at 0.

PROC RANK can be used to produce a better normal probability plot than the one produced by PROC UNIVARIATE.

We use PROC RANK to calculate the normal scores.

If the data are indeed normally distributed, the Blomberg-calculated scores (BLOM option) should provide the best straight line.

Consider the 20 homes to be a random sample of all homes for sale in Alachua County, and we want to see if price or log(price) more closely follows a normal distribution.

data homes;

set homes;

logprice=log(price);

run;

proc rank data=homes out=rankdata normal=blom;

var price logprice;

ranks norm1 norm2;

run;

proc plot data=rankdata;

plot price*norm1 logprice*norm2;

run;

Plot of PRICE*NORM1. Legend: A = 1 obs, B = 2 obs, etc.

PRICE|

3000000+

|

|

|

|

| A

2000000+

|

|

|

|

|

1000000+

|

|

|

| A A

| A A AA AA A A A A A

0| A B B A

+-+------------+------------+------------+------------+-

-2 -1 0 1 2

RANK FOR VARIABLE PRICE

Plot of LOGPRICE*NORM2. Legend: A = 1 obs, B = 2 obs, etc.

LOGPRICE|

16+

|

|

|

| A

|

14+

|

|

|

| A A

| A

12+ A

| AA A A A

| B B A A A AA

|

|

| A

10+

++------------+------------+------------+------------+-

-2 -1 0 1 2

RANK FOR VARIABLE LOGPRICE

Comparing Two or More Groups
• The nonparametric version of analysis of variance is based on ranks.
• The Mann-Whitney test and the Wilcoxon rank sum test are equivalent nonparametric techniques to compare two groups, while the Kruskal-Wallis test is ordinarily used to compare three or more groups.
• All of these are available in PROC NPAR1WAY (nonparametric 1-way analysis of variance) in SAS.
NPAR1WAY

PROC NPAR1WAY performs tests for location and scale differences based on the following scores of a response variable: Wilcoxon, median, Van der Waerden, Savage, Siegel-Tukey, Ansari-Bradley, Klotz, and Mood Scores. Additionally, PROC NPAR1WAY provides tests using the raw data as scores. When the data are classified into two samples, tests are based on simple linear rank statistics. When the data are classified into more than two samples, tests are based on one-way ANOVA statistics. Both asymptotic and exact p-values are available for these tests.

PROC NPAR1WAY also calculates the following empirical distribution function (EDF) statistics: the Kolmogorov- Smirnov statistic, the Cramer-von Mises statistic, and, when the data are classified into only two samples, the Kuiper statistic. These statistics test whether the distribution of a variable is the same across different groups

proc npar1way wilcoxon data=homes;

class location;

var price; run;

N P A R 1 W A Y P R O C E D U R E

Wilcoxon Scores (Rank Sums) for Variable PRICE

Classified by Variable LOCATION

Sum of Expected Std Dev Mean

LOCATION N Scores Under H0 Under H0 Score

Gville 10 100.500000 105.0 13.2237824 10.0500

Other 10 109.500000 105.0 13.2237824 10.9500

Average Scores Were Used for Ties

Wilcoxon 2-Sample Test (Normal Approximation)

(with Continuity Correction of .5)

S = 100.500 Z = -.302485 Prob > |Z| = 0.7623

T-Test Approx. Significance = 0.7656

Kruskal-Wallis Test (Chi-Square Approximation)

CHISQ = 0.11580 DF = 1 Prob > CHISQ = 0.7336

Other Rank Tests

When there are two factors with no interaction, as in a randomized complete block design, Friedman's chi-square test is a non-parametric test that can be used to examine treatment differences. Friedman's test is available in SAS under PROC FREQ, but it is fairly complicated to perform. PROC FREQ also offers versions of rank sum tests for ordinal data. See SAS documentation for details.

The LIFETEST procedure can be used with data that may be right-censored to compute nonparametric estimates of the survival distribution and to compute rank tests for association of the response variable with other variables. The survival estimates are computed within defined strata levels, and the rank tests are pooled over the strata and are therefore adjusted for strata differences.

Non-Parametric Correlation

The Pearson Product Moment correlation coefficient measures the strength of the tendency for two variables X and Y to follow a straight line.

Suppose we are more interested in measuring the tendency for X to increase or decrease with Y, without necessarily assuming a strictly linear relationship.

• Spearman correlation coefficient - Pearson correlation between ranks of X and ranks of Y.
• Kendall correlation coefficient - probability of observing Y2 > Y1 when X2 > X1.

Twelve students take a test, and we want to see if the students who finished the test early had higher scores than those who finished later.

• We don't know the exact time that each student spent on the test, but we do know the order in which the tests were turned in to be graded.
• With only have rank data for the time variable, the Pearson linear correlation coefficient would not be appropriate. (The last person to turn in the test could have taken 30 minutes, an hour, or two hours, and the guess of the exact time would greatly influence the resulting Pearson correlation.)
• Both the Spearman and Kendall correlation coefficients could legitimately be used.

data students;

datalines;

1 90 2 74 3 76

4 60 5 68 6 86

7 92 8 60 9 78

10 70 11 78 12 64

;

run;

proc plot data=students;

run;

proc corr data=students

spearman kendall;

with order;

run;

Plot of GRADE*ORDER. Legend: A = 1 obs, B = 2 obs, etc.

100+

|

| A

|A

| A

|

80+

| A A A

| A

| A

| A

| A

60+ A A

++----+----+----+----+----+----+----+----+----+----+----+-

1 2 3 4 5 6 7 8 9 10 11 12

ORDER

Correlation Analysis

Spearman Correlation Coefficients / Prob > |R| under Ho: Rho=0

/ N = 12

ORDER -0.17544

0.5855

Kendall Tau b Correlation Coefficients

/ Prob > |R| under Ho: Rho=0 / N = 12