DATA ANALYSIS

DATA ANALYSIS Module Code: CA660 Lecture Block 7:Non-parametrics

WHAT ABOUT NON-PARAMETRICS? How Useful are ‘Small Data’? • General points • -No clear theoretical probability distribution, so empiricaldistributions needed • -So, less knowledgeof formof data* e.g. ranks instead of values • - Quick and dirty • - Need notfocus on parameter estimation or testing; when do - • frequently based on “less-good”parameters/ estimators, e.g. Medians; otherwise test “properties”, e.g. randomness, symmetry, quality etc. • - weaker assumptions, implicit in * • - smaller sample sizes ‘typical’ • - different data - implicit from other points. Levels of Measurement- Nominal, Ordinal typical for non-parametric/ distribution-free

ADVANTAGES/DISADVANTAGES Advantages - Power may be better using N-P, if assumptions weaker - Smaller samples and less work etc. – as stated Disadvantages- alsoimplicit from earlier points, specifically: - loss of information /power etc. whendo know moreon data /when assumptionsdo apply - Separate tables each test General bases/principles: Binomial - cumulative tables, Ordinal data, Normal - large samples, Kolmogorov-Smirnov for Empirical Distributions - shift in Median/Shape, Confidence Intervals- more work to establish. Use Confidence Regions and Tolerance Intervals Errors– Type I, Type II . Power as usual. Relative Efficiency – asymptotic, e.g. look at ratio of sample sizes needed to achieve same power

STARTING SIMPLY: - THE ‘SIGN TEST’ • Example.Suppose want to test if weights of a certain item likely to be more or less than 220 g. • From 12 measurements, selected at random, count how many above, how many below. Obtain 9(+), 3(-) • Null Hypothesis: H0: Median  = 220. “Test” on basis of counts of signs. • Binomialsituation, n=12, p=0.5. • For this distribution • P{3 X 9} = 0.962 while P{X 2 or X  10}= 1-0.962 = 0.038 • Result not strongly significant. • Notes:Need not be Median as “Location of test” • (Describe distributions by Location, dispersion, shape). Location = median, “quartile” or other percentile. • Many variants of Sign Test- including e.g. runs of + and - signs for “randomness”

PERMUTATION/RANDOMIZATION TESTS • Example:Suppose have 8 subjects, 4 to be selected at random for new training. All 8 rankedin order of level of abilityafter a given period, ranking from 1 (best) to 8 (worse). • P{subjects ranked 1,2,3,4 took new training} = ?? • Clearly any 4 subjects could be chosen. Select r=4units from n = 8, • If new scheme ineffective, sets of ranks equally likely: P{1,2,3,4} = 1/70 • More formally, Sumranks in each grouping. Low sumsindicate that the trainingis effective, High sumsthat it is not. • Sums 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 • No. 1 1 2 3 5 5 7 7 8 7 7 5 5 4 2 1 1 • Critical Regionsize 2/70 given by rank sums 10 and 11 while • size 4/70 from rank sums 10, 11, 12 (both “Nominal”5%) • Testing H0: new training scheme no improvement vsH1: some improvement

MORE INFORMATION WILCOXON ‘SIGNED RANK’ • Direction and Magnitude : H0:  = 220 ?Symmetry • Arrange all sample deviations from median in order of magnitude and replace by ranks (1 = smallest deviation, n largest). High sum for positive (or negative) ranks, relative to the other  H0 unlikely. • Weights 126 142 156 228 245 246 370 419 433 454 478 503 • Diffs. -94 -78 -64 8 25 26 150 199 213 234 258 283 • Rearrange 8 25 26 -64 -78 -94 150 199 213 234 258 383 • Signedranks 1 2 3 -4 -5 -6 7 8 9 10 11 12 • Clearly Snegative = 15 and < Spositive • Tables of form: Reject H0 if lower of Snegative, Spositive tabled value • e.g. here, n=12 at  = 5 % level, tabled value =13, so do notreject H0

LARGE SAMPLES andC.I. • Normal Approximation for S the smaller in magnitude of rank sums • so C.I. as usual • General for C.I. Basic idea is to take pairs of observations, calculate mean and omit largest / smallest of (1/2)(n)(n+1) pairs. Usually, computer-based - re-samplingor graphicaltechniques. • Alternative Forms -common for non-parametrics • e.g. for Wilcoxon Signed Ranks. Use = magnitude of differences between positive /negative rank sums. Different Table • Ties- complicate distributions and significance. Assign mid-ranks

KOLMOGOROV-SMIRNOV and EMPIRICAL DISTRIBUTIONS • Purpose- to compare set of measurements (two groups with each other) or one group with expected - to analyse differences. • CannotassumeNormalityof underlying distribution, (usual shape), so need enough sample values to base comparison on (e.g.  4, 2 groups) • Major features- sensitivity to differences in both shape and location of Medians: (does not distinguish whichis different) • Empirical c.d.f.not p.d.f. - looks for consistencyby comparing popn. curve (expected case) with empiricalcurve (sample values) • Step fn. • i.e. value at each step from data • S(x) should never be too far from F(x) = “expected” form • Test Basisis

Criticisms/Comparison K-S with other ( 2) Goodness of Fit Tests for distributions Main Criticismof Kolmogorov-Smirnov: - wastes information in using only differences of greatest magnitude; (in cumulative form) General Advantages/DisadvantagesK-S - easy to apply - relatively easy to obtain C.I. - generally deals well with continuous data. Discrete data also possible, but test criteria not exact, so can be inefficient. - For two groups, need same number of observations - distinctionbetween location/shape differences not established Note: 2 applies to both discrete and continuous data , and to grouped, but “arbitrary” grouping can be a problem. Affects sensitivity of H0 rejection.

COMPARISON 2 INDEPENDENT SAMPLES: WILCOXON-MANN-WHITNEY • Parallelwith parametric (classical)again. H0 : Samples from same population (Medians same) vs H1 : Medians not the same • For two samples, size m, n, calculate joint ranking and Sum for eachsample, giving Sm and Sn . Should be similarif populations sampled are alsosimilar. • Sm+Sn = sum of all ranks = and result tabulated for • Clearly, so need only calculate one from 1st principles • Tables typicallygive, for various m, n, the value to exceedfor smallest U in order to reject H0 . 1-tailed/2-tailed. • Easier : use the sum of smallerranks or fewer values. • Example in brief: If sum of ranks =12 say, probability based on no. possible ways of obtaining a 12 out of Total no. of possible sums

Example - W-M-W • For example on weights earlier. Assume now have 2ndsample set also: • 29 39 60 78 82 112 125 170 192 224 263 275 276 286 369 756 • Combined ranksfor the two samples are: • Value 29 39 60 78 82 112 125 126 142 156 170 192 224 228 245 • Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 • Value 246 263 275 276 286 369 370 419 433 454 478 503 756 • Rank 16 17 18 19 20 21 22 23 24 25 26 27 28 • Here m = 16, n=12 and Sm= 1+ 2+ 3 + ….+21+ 28= 187 • So Um=51, and Un=141. (Clearly, can check by calculating Un directly also) • For a 2-tailed test at 5%level,Um= 53 from tables and our value is less, i.e. more extreme, so rejectH0 . Medians are different here

MANY SAMPLES - Kruskal-Wallis • Direct extensionof W-M-W. Tests: H0: Medians are the same. • Rank total number of observations for all samples from smallest (rank 1) to highest (rank N) for N values. Ties given mid-rank. • rijis rank of observationxijand si = sum of ranks in ith sample (group) • Compute treatment and total SSQ ranks - uncorrectedgiven as • For no ties,this simplifies • Subtract off correction for average for each, given by • Test Statistic • i.e. approx. 2for moderate/large N. Simplifies further if no ties.

PAIRING/RANDOMIZED BLOCKS - Friedman • Blocks of units, so e.g. twotreatments allocated at random within block = matched pairs; can use a variant of sign test(on differences) • Many samplesorunits= Friedman(simplest case of R.B. design) • Recallcomparisons withinpairs/blocks more precise than between, so including Blocks term, “removes” block effect as source of variation. • Friedman’s test- replaces observations by ranks (within blocks) to achieve this. (Thus, ranked data can also be used directly). • Have xij = response. Treatmenti, (i=1,2..t) in eachblock j, (j=1,2...b) • Ranked within blocks • Sum of ranks obtained each treatmentsi, i=1,…t • For rank rij(or mid-rank if tied), raw (uncorrected) rank SSQ

Friedman contd. • With no ties, the analysis simplifies • Need alsoSSQ(All treatments –appear in blocks) • Again, the correction factor analogous to that for Kruskal-Wallis • and common form of Friedman Test Statistic • t, b not verysmall, otherwise need exacttables.

Other Parallels with Parametric cases • Correlation- Spearman’s Rho( Pearson’s P-Mcalculated using ranks or mid-ranks) • where • used to compare e.g. ranks on two assessments/tests. • Regression– LSE robustin general. Some use of “median methods”, such as Theil’s (not dealt with here, so assume usual least squares form).

NON-PARAMETRIC C.I. in Informatics: BOOTSTRAP • Bootstrapping = re-samplingtechniqueused to obtain Empirical distribution for estimator in construction of non-parametric C.I. • - Effective when distribution unknownor complex • - More computationthan parametric approaches and may fail when sample size of original experiment is small • - Re-samplingimplies sampling from a sample - usually to estimate empirical properties, (such as variance, distribution, C.I. of an estimator) and to obtain EDF of a test statistic- common methods are Bootstrap,Jacknife,shuffling • - Aim= approximate numerical solutions (like confidence regions). Can handle biasin this way - e.g. to find MLE of variance 2, mean unknown • - both Bootstrap and Jacknife used, Bootstrap more often for C.I.

Bootstrap/Non-parametric C.I. contd. Basis- both Bootstrapand othersrely on fact that sample cumulative distn fn. (CDF or just DF) = MLE of a population Distribution Fn. F(x) DefineBootstrap sampleas a random sample,size n, drawn with replacementfrom a sample of n objects ForS the original sample, P{drawing each item, object or group} = 1/n Bootstrap sampleSBobtained from original, s.t. sampling n times with replacement gives Powerrelies on the fact that large number of resampling samples can be obtained from a single original sample, so if repeat process b times, obtain SjB, j=1,2,….b, with each of these being a bootstrap replication

Contd. • Estimator- obtained from each sample. If is the estimate for the jthreplication, then bootstrap mean and variance • while BiasB= • CDFofEstimator = for b replications • soC.I. with confidence coefficient for some percentile is then • Normal Approx. for mean: Large b • (tb-1 - distribution if No. bootstrap replications small). Standardised Normal Deviate

Example • Recall (gene and marker) or (sex and purchasing) example • MLE, 1000 bootstrapping replications might give results: • ParametricVariance 0.0001357 0.00099 • 95% C.I. (0, 0.0455) (0.162, 0.286) • 95% Interval (Likelihood) (0.06, 0.056) (0.17, 0.288) • BootstrapVariance 0.0001666 0.0009025 • Bias 0.0000800 0.0020600 • 95% C.I. Normal (0, 0.048) (0.1675, 0.2853) • 95% C.I. (Percentile) (0, 0.054) (0.1815, 0.2826)

SUMMARY: Non-Parametric Use Simple property : Sign Tests – large number of variants. Simple basis Paired data Wilcoxon Signed Rank -Compare medians Conditions/Assumptions - No. pairs  6; Distributions- Same shape Independent data Mann-Whitney U - Compare medians - 2 independent groups Conditions/Assumptions -(N  4);Distributions same shape Correlation –as before. Parallels parametric case. Distributions: Kolmogorov-Smirnov - Compare either location, shape Conditions/Assumptions - (N  4), but the two are not separately distinguished. If 2 groups compared) need equal no. observations Many Group/Sample Comparisons: Friedman– compare Medians. Conditions : Data in Randomised Block design. Distributions same shape. Kruskal-Wallis- Independent groups. Conditions : Completely randomised. Groups can be unequal nos. Distributions same shape. Regression – robust, as noted, so use parametric form.

DATA ANALYSIS

DATA ANALYSIS

Presentation Transcript

Data Analysis

Data analysis

Data analysis

Data Analysis

Data analysis

Data Analysis

Data Analysis

DATA ANALYSIS

DATA ANALYSIS

DATA ANALYSIS

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

Data Analysis

DATA ANALYSIS