1 / 60

Statistical Methods for Data Analysis parameter estimate

Statistical Methods for Data Analysis parameter estimate. Luca Lista INFN Napoli. Contents. Parameter estimates Likelihood function Maximum Likelihood method Problems with asymmetric errors. Meaning of parameter estimate. We are interested in some physical unknown parameters

gomer
Download Presentation

Statistical Methods for Data Analysis parameter estimate

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Methodsfor Data Analysisparameter estimate Luca Lista INFN Napoli

  2. Contents • Parameter estimates • Likelihood function • Maximum Likelihood method • Problems with asymmetric errors Statistical Methods for Data Analysis

  3. Meaning of parameter estimate • We are interested in some physical unknown parameters • Experiments provide samplings of some PDF which has among its parameters the physical unknowns we are interested in • Experiment’s results are statistically “related” to the unknown PDF • PDF parameters can be determined from the sample within some approximation or uncertainty • Knowing a parameter within some error may mean different things: • Frequentist: a large fraction (68% or 95%, usually) of the experiments will contain, in the limit of large number of experiments, the (fixed) unknown true valuewithin the quoted confidence interval, usually [ - , + ] (‘coverage’) • Bayesian: we determine a degree of belief that the unknown parameter is contained in a specified interval can be quantified as 68% or 95% • We will see that there is still some more degree of arbitrariness in the definition of confidence intervals… Statistical Methods for Data Analysis

  4. Statistical inference Probability Theory Model Data Data fluctuate according to process randomness Inference Theory Model Data Model uncertainty due tofluctuations of the data sample Statistical Methods for Data Analysis

  5. Hypothesis tests Theory Model 1 Data Theory Model 2 Which hypothesis is the most consistent with the experimental data? Statistical Methods for Data Analysis

  6. Parameter estimators • An estimator is a function of a given sample whose statistical properties are known and related to some PDF parameters • “Best fit” • Simplest example: • Assume we have a Gaussian PDF with a known and an unknown • A single experiment will provide a measurement x • We estimate  as est=x • The distribution of est (repeating the experiment many times) is the original Gaussian • 68.27%, on average, of the experiments will provide an estimate within:  -  < est<  +  • We can determine: =est  Statistical Methods for Data Analysis

  7. Likelihood function • Given a sample of N events each with variables (x1, …, xn), the likelihood function expresses the probability density of the sample, as a function of the unknown parameters: • Sometimes the used notation for parameters is the same as for conditional probability: • If the size N of the sample is also a random variable, the extended likelihood function is also used: • Where p is most of the times a Poisson distribution whose average is a function of the unknown parameters • In many cases it is convenient to use –ln L or –2ln L: Statistical Methods for Data Analysis

  8. Maximum likelihood estimates • ML is the widest used parameter estimator • The “best fit” parameters are the set that maximizes the likelihood function • “Very good” statistical properties, as will be seen in the following • The maximization can be performed analytically, for the simplest cases, and numerically for most of the cases • Minuit is historically the most used minimization engine in High Energy Physics • F. James, 1970’s; rewritten in C++ recently Statistical Methods for Data Analysis

  9. Extended likelihood function • Given a sample of N measurements of the variables (x1, …, xn), the likelihood function expresses the probability density of the sample, as a function of the unknown parameters: • If the size N of the sample is also a random variable, the extended likelihood function is usually also used: • Where P(N; θ1, ... ,θm) is in practice always a Poisson distribution whose expected rate is a function of the unknown parameters • In many cases it is convenient to use –lnL or –2ln L: Statistical Methods for Data Analysis

  10. Extended likelihood function constant! For Poissonian signal and background processes: We can fit simultaneously s, b and θminimizing: Sometimes s is replaced by μs0, where s0 is the theory estimate and μ is called signal strength Statistical Methods for Data Analysis

  11. Example of ML fit Exponential decay parameter λ, Guassian mean μ and standard deviation σ can be fit together with sig. and bkg. yields s and b. The additional parameters, beyond the parameters of interest(s in this case), used to model background, resolution, etc. are examples of nuisance parameters Data In the plot, data are accumulated into bins of a given width Error bars usually represent uncertainty on each bin count (in this case: Poissonian) Ps(m): Gaussian peak Pb(m): exponential shape Statistical Methods for Data Analysis

  12. Gaussian case (example of a χ2 variable) • If we have n independent measurements all modeled with (or approximated to) the same Gaussian PDF, we have: • An analytical minimization of−2ln L w.r.t μ (assuming σ2 is known) gives the arithmetic mean as ML estimate of μ: • If σ2 is also unknown, the ML estimate of σ2is: • The above estimate can be demonstrated to have an unpleasant feature, called bias ( next slide) Statistical Methods for Data Analysis

  13. Estimator properties • Consistency • Bias • Efficiency • Robustness Statistical Methods for Data Analysis

  14. Estimator consistency • The estimator converges to the true value (in probability) • ML estimators are consistent Statistical Methods for Data Analysis

  15. Efficiency of an estimator bias of θ Fisher information • The variance of any consistent estimator is subject a lower bound (Cramér-Rao bound): • Efficiency can be defined as the ratio of Cramér-Rao bound and the estimator’s variance: • Efficiency for ML estimators tends to 1 for large number of measurements • I.e.: ML estimates have, asymptotically, the smallest possible variance Statistical Methods for Data Analysis

  16. Bias of an estimator ML methodunderestimates the variance σ2 • Thebias of a parameter is the average value of its deviation from the true value • ML estimators may have a bias, but the bias decreases with large number of measurements (if the fit model is correct…!) • E.g.: in the case of the estimate of a Gaussian’s σ2, the unbiased estimate is the well known: Statistical Methods for Data Analysis

  17. Robustness • If the sample distribution has (slight?) deviations from the theoretical PDF model, some estimators may deviate more or less than others from the true value • E.g.: unexpected tails (“outliers”) • The median is a robust estimate of a distribution average, while the mean is not • Trimmed estimators: removing n extreme values • Evaluation of estimator robustness: • Breakdown point: max. fraction of incorrect measurements that above which the estimate may be arbitrary large • Trimmed observations at x% have a break point of x • The median has a break point of 0.5 • Influence function: • Deviation of estimator if one measurement is replaced by an arbitrary (incorrect measurement) • Details are beyond the purpose of this course… Statistical Methods for Data Analysis

  18. Neyman’s confidence intervals Procedure to determine frequentist confidence intervals Plot from PDG statistics review α = significance level Scan the allowed range of an unknown parameter θ Given a value of θ compute the interval [x1, x2] that contain x with a probability 1 − α equal to 68% (or 90%, 95%) Choice of interval needed! Invert the confidence belt: for an observed value of x, find the interval [θ1, θ2] A fraction of the experiments equal to 1 − αwill measure x such that the corresponding [θ1, θ2]contains (“covers”) the true value of θ(“coverage”) Note: the random variables are [θ1, θ2], not θ ! Statistical Methods for Data Analysis

  19. Simplest example: Gaussian case μ 1 − α = 68% x = x ± σ • Assume a Gaussian distribution with unknown average μ and known σ = 1 • The belt inversion is trivial and gives the expected result:Central value = x , [μ1, μ2] = [x −σ, x + σ] • So we can quote: Statistical Methods for Data Analysis

  20. Binomial intervals p N = 10 p = 0.83 • Clopper and Pearson (1934) solved the belt inversion problem for central intervals • For an observed n = k, find lowest ploand highest pup such that: • P(n ≤k | N, plo) = α/2, P(n≥ k | N, pup) = α/2 • E.g.: n = N = 10, P(N|N) = pN = α/2, hence:plo = = 0.83 (68% CL), 0.74 (90% CL) • A frequently used approximation, which fails for n = 0,N is: p = 0.17 1 − α = 68% n The Neyman’s belt construction may only guarantee approximate coverage in case of discrete variables For a Binomial distribution: find the interval {nmin, …, nmax} such that: Statistical Methods for Data Analysis

  21. Clopper-Pearson coverage (I) P (coverage) 1 − α = 68% N = 10 p CP intervals are often defined as “exact” in literature Exact coverage is often impossible to achieve for discrete variables Statistical Methods for Data Analysis

  22. Clopper-Pearson coverage (II) P (coverage) 1 − α = 68% N = 100 p For larger N the “ripple” gets closer to the nominal 68% coverage Statistical Methods for Data Analysis

  23. Approx. maximum likelihood errors • A parabolic approximation of −2ln Laround the minimum is equivalent to a Gaussian approximation • Sufficiently accurate in many but not all cases • Estimate of the covariance matrix from 2nd order partial derivatives w.r.t. fit parameters at the minimum: • Implemented in Minuit as MIGRAD/HESSE function Statistical Methods for Data Analysis

  24. Asymmetric errors • Errors can be asymmetric • For a Gaussian PDF the result is identical to the 2nd order derivative matrix • Implemented in Minuit as MINOS function -2lnL -2lnLmax+ 1 1 -2lnLmax θ –δ− + δ+ Another approximation alternative to the parabolic one may be to evaluate the excursion range of -2ln L. Error (nσ) determined by the range around the maximum for which -2ln L increases by +1(+n2fornσ intervals) Statistical Methods for Data Analysis

  25. Error of the (Gaussian) average • We have the previous log-likelihood function: • The error on  is given by: • I.e.: the error on the average is: Statistical Methods for Data Analysis

  26. Exercise • Assume we have n independent measurements from an exponential PDF: • How can we estimate by ML  and its error? Statistical Methods for Data Analysis

  27. y 2σ 1σ x 1σ 2σ 2D intervals In more dimensions one can determine 1σ and 2σ contours Note: different probability content in 2D compared to one dimension 68% and 95% contours are usually preferable Statistical Methods for Data Analysis

  28. Example of 2D contour Exponential decay parameter, Gaussian mean and standard deviation are fit together with s and b yields. The contour shows for this case a mild correlation between s and b 1σ contour (39.4% CL) • From previous fit example: • Ps(m): Gaussian peak • Pb(m): exponential shape Statistical Methods for Data Analysis

  29. Error propagation η ση θ σθ Assume we estimate from a fit the parameter set:θ= (θ1, …, θn)and we know their covariance matrix Θij We want to determine a new set of parameters that are functions of θ: η= (η1, …, ηm). For small uncertainties, a linear approximation maybe sufficient A Taylor expansion around the central values of θgives, using the error matrixΘij: Few examples in case of no correlation: Statistical Methods for Data Analysis

  30. Care with asymmetric errors • Be careful about: • Asymmetric error propagation • Combining measurements with asymmetric errors • Difference of “most likely value” w.r.t. “average value” • Naïve quadrature sum of + and - lead to wrong answer • Violates the central limit theorem: the combined result should be more symmetric than the original sources! • A model of the non-linear dependence may be needed for quantitative calculations • Biases are very easy to achieve (depending on +- -, and on the non-linear model) • Much better to know the original PDF and propagate/combine the information properly! • Be careful about interpreting the meaning of the result • Average value and Variance propagate linearly, while most probable value (mode) does not add linearly • Whenever possible, use a single fit rather than multiple cascade fits, and quote the final asymmetric errors only Statistical Methods for Data Analysis

  31. Mean, variance and skewness add linearly when doing convolution Not the most probable values (fit)! For this model: Online calculator (R. Barlow):http://www.slac.stanford.edu/~barlow/java/statistics1.html Non linear models Central value shifted! x + - x   See: R. Barlow, PHYSTAT2003 Statistical Methods for Data Analysis

  32. Binned likelihood • Sometimes data are available as binned histogram • Most often each bin obeys Poissonian statistics (event counting) • The likelihood function is the product of Poisson PDFs corresponding to each bin having entries ni • The expected number of entries nidepends on some unknown parameters: μi=μi(θ1, …, θm) • The function to minimize is the following −2 lnL: • The expected number of entries μi is often approximated by a continuous function μ(x)evaluatedat the center xi of the bin • Alternatively, μi can be a combination of other histograms (“templates”) • E.g.: sum of different simulated processes with floating yields as fit parameters Statistical Methods for Data Analysis

  33. Binned fits: minimum𝜒2 • Bin entries can be approximated by Gaussian variables for sufficiently large number of entries with standard deviation equal to ni(Neyman’sχ2) • Maximizing L is equivalent to minimize: • Sometimes, the denominator ni is replaced (Pearson’s χ2) by:μi= μ(xi; θ1, …, θm) in order to avoid cases with zero or small ni • Analytic solution exists for linear and other simple problems • E.g.: linear fit model • Most of the cases are treated numerically, as for unbinned ML fits Statistical Methods for Data Analysis

  34. Binned fit example Gaussian fit (determine yield, μ and σ) Bins with small number of entries! • Binned fits are convenient w.r.t. unbinned fits because the number of input variables decreases from the number of entries to the number of bins • Usually simpler and faster numerically • Unbinned fits become unpractical for very large number of entries • A fraction of the information is lost, hence a possible loss of precision may occur for small number of entries • Treat correctly bins with smalll number of entries! Statistical Methods for Data Analysis

  35. Fit quality (𝜒2 test) n is the number ofdegrees of freedom(n. of bins − n. of params.) • The maximum value of the likelihood function obtained from the fit doesn’t usually give information about the goodness of the fit • The𝜒2 of a fit with a Gaussian underlying model is distributed according to a known PDF • The cumulative distribution of P(χ2; n) follows a uniform distribution between 0 and 1 (p-value) • If the model deviates from the assumed distribution, the distribution of the p-value will be more peaked around zero • Note! p-values are not the “probability of the fit hypothesis” • This would be a Bayesian probability, with a different meaning, and should be computed in a different way Statistical Methods for Data Analysis

  36. Binned likelihood ratio S. Baker, R. Cousins NIM 221 (1984) 437 A better alternative to the (Gaussian-inspired, Neyman and Pearson’s) 𝜒2 has been proposed by Baker and Cousins using the following likelihood ratio: Same minimum value as from Poisson likelihood function, since a constant term has been added to the log-likelihood function In addition, it provides goodness-of-fit information, and asymptotically obeys chi-squared distribution with n − m degrees of freedom(Wilks’ theorem, see following slides) Statistical Methods for Data Analysis

  37. Combining measurements Weighted average, wi = σi−2 Assume two measurements with different uncorrelated (Gaussian) errors: Build the 𝜒2: Minimize the 𝜒2: Estimate m as: Error estimate: Statistical Methods for Data Analysis

  38. Generalization of 𝜒2to n dimensions We have n measurements, (m1, …, mn) with a n ⨉n covariance matrix (Cij) Expected values for m1, …, mn, M1, …, Mnmay depend on some theory parameter(s) θ The following χ2 can be minimized to have an estimate of the parameter(s) θ: Statistical Methods for Data Analysis

  39. Concrete examples

  40. Global electroweak fit Details on:http://gfitter.desy.de/Standard_Model/ A Global 𝜒2 fit to electroweak measurements predicts the W mass allowing a comparison with direct measurements Statistical Methods for Data Analysis

  41. More on electroweak fit W mass vstop-quark mass from global electroweak fit Statistical Methods for Data Analysis

  42. Fitting B(BJ/) / B(BJ/K) • Four variables: • m = B reconstructed mass as J/ + charged hadron invariant mass • E = Beam – B energy in the  mass hypothesis • EK = Beam – B energy in the K mass hypothesis • q = B meson charge • Two samples: • J/ , J/ ee • Simultaneous fit of: • Total yield of BJ/, BJ/K and background • Resolutions separately for J/ , J/ ee • Charge asymmetry (direct CP violation) Statistical Methods for Data Analysis

  43. E and EK Depend on charged hardron mass hypothesis! Statistical Methods for Data Analysis

  44. Extended Likelihood function • To extract the ratio of BR: • Likelihood can be written separately, or combined for ee and  events • Fit contains parameters of interest (mainly n, nK) plus uninteresting nuisance parameters • Separating q = +1 / -1 can be done adding ACP as extra parameter Poisson term BJ/ BJ/K Background Statistical Methods for Data Analysis

  45. Model for independent PDFs EK D D E Statistical Methods for Data Analysis

  46. Signals PDFs in new variables • (E, EK)  (E, EK-E), (EK,  EK-E) Statistical Methods for Data Analysis

  47. Background PDF • Background shape is taken from events in the mESsideband (mES < 5.27 GeV) mES Sideband mES(GeV) Statistical Methods for Data Analysis

  48. Dealing with kinematical pre-selection -120 MeV < E, EK < 120MeV B A C D A B C D The area is preserved after the transformation Statistical Methods for Data Analysis

  49. Signal extraction BJ/K BJ/ J/y  ee events Background Likelihood projection J/y  mm events Statistical Methods for Data Analysis

  50. A concrete fit example (II) Measurement of ms by CDF

More Related