Statistical Methods for Comparing Data Distributions

Application of statistical methods for the comparison of data distributions Susanna Guatelli, Barbara Mascialino, Andreas Pfeiffer, Maria Grazia Pia, Alberto Ribon, Paolo Viarengo

Outline • The comparison of two data distribution is fundamental in experimental practice • Many algorithms are available for the comparison of two data distributions (the two-sample problem) Aim of this study: compare the algorithms available in statistics literature to select the most appropriate one in every specific case Detector monitoring (current versus reference data) Simulation validation (experiment versus simulation) Reconstruction versus expectation Regression testing (two versions of the same software) Physics analysis (measurement versus theory, experiment A versus experiment B) Parametric statistics Non-parametric statistics (Goodness-of-Fit testing)

The two-sample problem EXAMPLE 1: binned data EXAMPLE 2: unbinned data X-ray fluorescence spectrum Dosimetric distribution from a medical LINAC Which is the most suitable goodness-of-fit test?

Chi-squared test • Applies tobinneddistributions • It can be useful also in case of unbinned distributions, but the data must be grouped into classes • Cannot be applied if the counting of the theoretical frequencies in each class is < 5 • When this is not the case, one could try to unify contiguous classes until the minimum theoretical frequency is reached • Otherwise one could use Yates’ formula

EMPIRICAL DISTRIBUTION FUNCTION ORIGINAL DISTRIBUTIONS Tests based on the supremum statistics unbinned distributions • Kolmogorov-Smirnov test • Goodman approximation of KS test • Kuiper test Dmn SUPREMUM STATISTICS

Tests containing a weighting function EMPIRICAL DISTRIBUTION FUNCTION ORIGINAL DISTRIBUTIONS • Fisz-Cramer-von Mises test • k-sample Anderson-Darling test QUADRATIC STATISTICS + WEIGHTING FUNCTION Sum/integral of all the distances binned/unbinned distributions

G.A.P Cirrone, S. Donadio, S. Guatelli, A. Mantero, B. Mascialino, S. Parlati, M.G. Pia, A. Pfeiffer, A. Ribon, P. Viarengo “A Goodness-of-Fit Statistical Toolkit” IEEE- Transactions on Nuclear Science (2004), 51 (5): October issue. http://www.ge.infn.it/geant4/analysis/HEPstatistics/

The power of a test is the probability of rejecting the null hypothesis correctly Power evaluation Parent distribution 1 Parent distribution 2 N=1000 Monte Carlo replications Pseudoexperiment: a random drawing of two samples from two parent distributions GoF test Sample 1 n Sample 2 m Confidence Level = 0.05 Power = # pseudoexperiments with p-value < (1-CL) # pseudoexperiments For each test, the p-value computed by the GoF Toolkit derives from analytical calculation of the asymptotic distribution, often depending on the samples sizes.

Gaussian Uniform Double exponential Cauchy Exponential Contaminated Normal Distribution 1 Contaminated Normal Distribution 2 Parent distributions

Skewness and tailweight Skewness Tailweight

Uniform Normal Exponential Double Exponential Contaminated Normal 1 Contaminated Normal 2 Cauchy Case Parent1 = Parent 2 The “location-scale problem” Kolmogorov-Smirnov test CL = 0.05 Power increases as a function of the sample size (analytical calculation of the asymptotic distribution) Power small sized samples moderate sized samples N sample

CL = 0.05 Power For short-medium tailed distributions: Kolmogorov-Smirnov KS KS ~ ~ CVM CVM < ~ AD AD Cramér-von Mises For very long tailed distributions: Anderson-Darling Tailweight Distribution 2 Case Parent1 ≠ Parent 2 The “general shape problem” A)Symmetric versus symmetric (S1 = S2 = 1) Distribution 1 Double exponential (T1 = 2.161) B)Skewed versus symmetric T2

Supremum statistics tests Tests containing a weight function 2 < < Comparative evaluation of tests Tailweight Skewness

^ ^ X-variable: Ŝ=4T=1.43 Y-variable: Ŝ=4T=1.50 X-variable: Ŝ=1.53T=1.36 Y-variable: Ŝ=1.27T=1.34 ^ ^ Results for the data examples EXAMPLE 1: binned data EXAMPLE 2: unbinned data Extremely skewed – medium tail ANDERSON-DARLING TEST A2=0.085 – p>0.05 Moderate skewed – medium tail KOLMOGOROV-SMIRNOV TEST D=0.27 – p>0.05

Conclusions • Studied several goodness-of-fit tests for location-scale alternatives and general alternatives • There is no clear winner for all the considered distributions in general • To select one test in practice: 1. first classify the type of the distributions in terms of skewness S and tailweight T 2. choose the most appropriate test for the classified type of distribution Topic still subject to research activity in the domain of statistics

Statistical Methods for Comparing Data Distributions

Statistical Methods for Comparing Data Distributions

Presentation Transcript

Survey of Statistical Methods

Comparison of Statistical Methods for Delay Measurement between Heart Sound Signals

Statistical Methods for the Analysis of Change

Survey of Statistical Methods

Survey of Statistical Methods

Survey of Statistical Methods

Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data

Statistical Methods for Data Analysis the Bayesian approach

Methods of Comparison

Comparison of Advertising Methods

Comparison of Single Shot Methods for R2* Comparison

Comparison of Single Shot Methods for R2* Comparison

Survey of Statistical Methods

Survey of Statistical Methods

Application of statistical methods for the comparison of data distributions

Statistical Distributions

Comparison of Volume Methods

Comparison of methods

Statistical methods of data analysis

Investigating the Effect of Sampling Methods for Imbalanced Data Distributions

The Shape of Distributions of Data