Measures of Significance Jim Linnemann Michigan State University U. Toronto Dec 4, 2003
A Prickly Problemnot to everyone’s taste… • What is Significance? • Li and Ma Equations • Frequentist Methods • Bayesian Methods • Is Significance well-defined? • To Do • Help Wanted • Summary
A simple problemwith lots of answers… • Two Poisson measurements • 1) “background region” • 2) “signal region” • Poisson assumptions usually well satisfied • Is the 2nd measurement predicted by the first? • Measure significance of failure of prediction: Did I see something potentially interesting? I.e., not background 2) If “significant enough” proceed to estimation of size of new effect I’ll describe practice in High Energy Physics and Gamma Ray Astronomy (Behavioral Statistics)
Crossed Disciplines… To physicists, this may sound like a statistics talk Or both HEP and Astrophysics (or neither…) To statisticians…a physics talk Apologies, but I’m try trying to cross boundaries What a sabbatical is for, no? From HEP (D0 experiment) Working on Gamma Ray Astronomy (Milagro experiment)
Help? • To my statistical colleagues: • Better understand Fraser-Reid with nuisance parameters prime motive for visit • Geometrical intuition for nuisance parameters • Is the indeterminacy of Z well understood? • Binomial-Bayes identity interesting? • Is Binomial test really best?
My Original Goal What Can HEP and Astrophysics Practice Teach Each Other? Astrophysics (especially ray) aims at simple formulae (very fast) calculates σ’s directly (Asymptotic Normal) hope it’s a good formula HEP (especially Fermilab practice) calculates probabilities by MC (general; slow) translates into σ’s for communication loses track of analytic structure Statistics:what can I learn This is a progress report…
D0 (a tiny slice) and Milagro • D0: searches for new particles (Top Quark; SUSY…) • Many input events • Many selection criteria • ~ 10 events in signal region, maybe all background • Milagro: searches for new or weak sources • Many input events (mostly hadron showers) • Few photons among this background • Few selection criteria • Cases: • A) ~106 background events, possibly 10-3 photon excess • B) many places to look in the sky • 104 or more trials (multiple comparisons) • C) many time scales for variable phenomena • 1ms to 1 year so up to 1014 trials • ~10-100 counts in search region, maybe all background
DØ is a large scale experiment… …with some complexities that come with scale many different kinds of analyses: I’m talking about searches only!
Experimenters have little choice but to love their apparatus—we spend years building it!
Milagro is a smaller experiment • “Only” 30 people • ~ 100 man-years, ~5M$ construction cost • ~ 5 year data taking lifetime
Los Alamos, NM 2600 m
e m g 8 meters 50 meters 80 meters Milagro Schematic Use water instead of scintillators to detect EAS particles 100% of the area is sensitive (instead of 1% with scintillator) Low energy threshold (100 GeV) Median energy ~4 TeV High duty cycle (~95%) (10% for air Cherenkov) Large field of view (~2 sr) (.003 sr) Good background rejection (~90%) But needs work! Trigger Rate 1.7 kHz
Milagro pointing: About 1 degree
Moon Shadow: Proton Energy scale and angular resolution Significance maps! E = 640±70 GeV (MC 690 GeV) sq = 0.9o
Mrk 421 Hot Spot Crab The Northern Hemisphere in TeV Gamma Rays: 12/00-12/01 One year survey detected Crab and active galactic nuclei Mrk 421 with the rest of the sky consistent with no detections giving upper limits for the predicted TeV sources indicated by circles sfrom background Declination Right Ascension
The Crab Nebula Cut Data On: 1,952,917 Off: 1,945,109 Excess: 7,808 (~10/day) Significance: 5.4 s Raw Data On: 16,987,703 Off: 16,981,520 Significance: 1.4 s
Significance:Signal Search with Poisson Background • Z value: ~ Normal(0,1) (Astro. Usage;Milagro; Li Ma) • The art is to pick a good variable for this More Generally: • P(more extreme “signal” | background) (HEP usage) P-value (similar to Chi-squared probability) • Assume Null Hypothesis:=background only P(data worse | bkg), not P(not bkg | data) • Translate probability p into Z value by:
Observed vs. Prospective Significance • This discussion: Observed Significance (of my data) • Post-hoc: (after data) • Need definition of Z • Choice of Zmin for observational claim = max P(observed|background) • Prospective Observability(before data, to optimize expt.) • Involves more : both signal and background • Naive calculation: Z = S / √ B ( ignores fluctuations: Significance for Expected Conditions ) • Optimistic: crudely, ½ time less signal; or ½ time more background! • Should consider Pr ( Z > Zmin ) (making observational claim) Source Strength for 50% probability of observation? 90% ? (power)
Backgrounds in Astro and HEP • Astrophysics:On-source vs. Off-source • side observation with α = Ton/Toff(sensitivity ratio) • b = α Noff; δb = α Noff • α = (δb)2 / b (deduced from above) • HEP: estimate background in defined signal region • Sometimes a sideband measurement, like Astrophysics • Often a MC estimate; rescaled to signal sensitivity • More often a sum of terms of both types • b ± δb δb: uncertainties in quadrature • α = (δb)2 / b I’ll use as a definition of effective α Apply astrophysics formulae (non-integer N)
Li and Ma Equations Z = S / σ(S) S = Non – bb = α Noff N is observation; b is background estimate Eq 5: Var(S) = Var(Non) + Var(b) =Non + α2 Noff Ignores key null hyp constraint: μon = α μoff (anti-signal bias!) Eq 9: Var(S) = α (Non + Noff) Obeys constraint; uses Non and Noff to estimate μoff (Eq 5, 9 usePoisson → Normal) Li and Ma don’t recommend Eq 5 or 9, but instead: Eq 17: Log Likelihood Ratio L(μs,μb) / L(μb) Generic test for composite hypothesis + Wilks’ Theorem x = Noff; y = Noff
A Li and Ma Variant Eq 5c: Var(S) = α (1+ α) Noff • Use only Noff to estimate Var(Non) and Var(b) respects null hypothesis, not biased • Obviously, poor sensitivity if α > 1 • But that’s always true—need decent estimate of background • Get Eq 9 again if use both Non and Noff in M.L. estimate of μon = α μoff
Other Frequentist Methods Ignoring uncertainty in b: • S/BLi Ma 10a—not recommended • Poisson (Non | b) (often much better) • Feldman & Cousins? confidence limits! • For significance, just Poisson( Non | b), I believe
Other Frequentist Methods Using Uncertainty in b: • b + δb instead of b in above (I’ve seen it!) • Near-Constant Variance (Zhang and Ramsden) • Fraser Reid Likelihood to Significance Asymptotics • Binomial Test
Fraser and ReidApproximate Significance • Interesting approximate method (last 15 yr) • Significance from likelihood curve • Can view as asymptotic expansion of pdf for large n • Combine 2 first order estimates of significance: O(n-.5) • Zr from Likelihood Ratio + Wilks Theorem (usually better estimate) • Zt = D/σ from D = θ* - θ ; σ from Fisher Information ∂2L/∂2θ Combine to give O(n-1.5) estimate: 1st order →3rd order: One version: Z3 = Zr + 1/ Zr Ln(Zt/ Zr) • Fast & simple numerically to apply final formula • Must redo algebra for each new kind of problem • I’m still working to apply it to Non, Noff fully • nuisance parameter; reduction to canonical exponential parameters
Two first-order estimates of significance = Φ(Z(x)) from Likelihood Ratio x Corrected curve is closer to the Likelihood Ratio estimate
Binomial Proportion Test: Ratio of Poisson Means P-value = Pr Binomial( Non | ω, k) where ω = /(1+) Conditional: Holds k = Non + Noff as fixed (2nd Poisson mean a nuisance parameter) UMPU (Uniformly Most Powerful Unbiased) for Composite Hypothesis testμon / α μoff 1 If ~optimal, probably using the best variable! Very stable!
Binomial Proportion Test: Ratio of Poisson Means • Not in common use; probably should be Known in HEP and Astrophysics: but not as optimal nor a standard procedure • Zhang and Ramsden claim too conservative • for Z small? Even if true, we want Z > 4 • Closed form in term of special functions, or sums • Applying for large N requires some delicacy; • slower than Eq 17 • Gaussian Limit: Z = (Non/k – ω)/√ [ω(1- ω)/k ] = Eq 9
Bayesian Methods • In common use in HEP • Cousins & Highland “smeared likelihood” efficiency • Allow correlation among backgrounds (MC integration) • Natural extension to efficiency, upper limits • Predictive Posterior (after background measurement) • P(x | y) (integrate posterior over the theoretical mean) • Natural avenue for connection with p-values for Non But: typical Bayes analysis isn’t significance, but odds ratio • A flat prior for background, gives Gamma dist. for p(b|y) • P value calc using Gamma: (also Alexandreas--Astro) • numerically, identical to Frequentist Binomial Test ! • Truncated Gaussian often used in HEP to represent p(b|y) • Less tail (high b) than Gamma: higher reported significance
Predictive Posterior Bayes P-value (HEP) In words: tail sum averaged over Bayes posterior for mean or: integrate before sum Posterior for μ with flat prior for y
Two ways to write Bayesian p-value: Bayesian p-value can be thought of as Poisson p-value weighted by posterior for μ Compute by sum, by numerical or MC integral or, as it turns out, by an equivalent formula…
Bayes Gamma Posterior p-value Large x, y: approximate sum by integral… = Binomial p-value Surprised? I was! Proof: H. Kim, MSU
Comparing the Methods • Some test cases from published literature • And a few artificial cases • Range of Non, Noff values • Different α values (mostly < 1) • Interest: Z > 3, but sometimes >> 3 (many trials…) • Color Code Accuracy • Assume Frequentist Binomial as Gold Standard • Zhang and Ramsden found best supported by MC At least when calculating Z > 3 or so At worst, ZMC 3% higher
Poission(>= x | b) very well approximated by FR with non-canonical parameter, p* mid-p Poisson well approximated by FR canonical,
Results Comments:use Zbinomial as standard • Bayes: ZΓ = Zbinomial—when it converged • Bayes: ZΓ < Znormal • α < 1 and N > 500 easiest • LR, √ not bad • Usually, bad formulae overestimate Z!! • S/√B, for example • But Z5 is biased against signal • Fraser-Reid vs Poisson: • Exact all but one case (overestimate) • Very slow calculation for very large N if integers • Faster if floating point…Mathematica in action
How to test a Significance Z Variable? Standard Method of MC Testing a Variable: • “self-test”: compare Z with distribution of statistic for MC assuming background only • i.e. convert back from Z to probability • Good if PrMC(Z>Zo) = PrGauss(Z>Zo) • Intuition: want fast convergence to Gaussian Why not just compare with “right answer”? • Variables all supposed to give same Z, right? Significance ill-defined in 2-D (Non, Noff)!
What is a Bigger Deviation?Part of Significance Definition! Which contour? • Measure Non, Noff = (x,y) • Which values are worse? • Larger S = x - α y? • Or farther from line x = α y? • In angle? Perpendicular to line? • Trying to order 2-dim set! • Points on (x,y) plane • Nuisance parameter bites again • Statistics give different metricscontours of equal deviation • Convergence (to Gaussian)? For large N: Enough peaking so overlapping regions dominate integrals? y ↑ (Noff) More Signal x → (Non)
√x + 3/8 … S √ B 9 • • Compare line shape near point, not spacing Binomial Likelihood Ratio • •
To Do • Monte Carlo Tests • Fraser-Reid for full problem • Nuisance parameter treatment, geometry • Canonical parameter? • Simpler numerics, if it works! • But: problems for large Z ? • And for large N?
Summary • Probably should use Binomial Test for small N • Optimal Frequentist Technique • numerically, more work than Li Ma Eq 17 (Likelihood Ratio) • Binomial Test and L. Ratio have roots in Hyp Testing • For high and moderate N, Likelihood Ratio OK • Anything works for Crab, but not for short GRB’s • Most errors overestimate significance • Variance Stabilization better than Li Ma Eq 9 • S/√B is way too optimistic—ignores uncertainty in B • Interesting relations exist among methods • Bayes with Gamma = Binomial • Li Ma Eq 9 = Binomial for large N • Bayes with Gaussian a bit more optimistic than Gamma • Fraser-Reid Approximation Promising but not finished
References Li & Ma Astroph. Journ. 272 (1983) 314-324 Zhang & Ramsden Experimental Astronomy 1 (1990) 145-163 Fraser Journ. Am. Stat. Soc. 86 (1990) 258-265 Alexandreas et. al. Nuc. Inst. & Meth. A328 (1993) 570-577 Gelman et. al., Bayesian Data Analysis, Chapman & Hall (1998) (predictive p-value terminology) Talks: http://www-conf.slac.stanford.edu/phystat2003/
Controlling theFalse Discovery Rate: From Astrophysics to Bioinformatics
The Problem: • I: You have a really good idea. You find a positive effect at the .05 significance level. • II: While you are in Stockholm, your students work very hard. They follow up on 100 of your ideas, but have somehow misinterpreted your brilliant notes. Still, they report 5 of them gave significant results. Do you book a return flight?
Significance • Define “wrong” as reporting false positive: • Apparent signal caused by background • Set a level of potential wrongness • 2 =.05 3 = .003 etc. • Probability of going wrong on one test • Or, error rate per test • “Per Comparison Error Rate” (PCER) • Statisticians say: “z value” instead of z ’s • Or “t value”
What if you do m tests? • Search m places • Must be able to define “interesting” • e.g. “not background” • Examples from HEP and Astrophysics • Look at m histograms, or bins, for a bump • Look for events in m decay channels • Test standard model with m measurements (not just Rb or g-2) • Look at m phase space regions for a Sleuth search (Knuteson) • Fit data to m models: What’s a bad fit? • Reject bad tracks from m candidates from a fitting routine • Look for sources among m image pixels • Look for “bursts” of signals during m time periods • Which of m fit coefficients are nonzero? • Which (variables, correlations) are worth including in the model? • Which of m systematic effect tests are significant? Rather than testing each independently
“Multiple Comparisons” • Must Control False Positives • How to measure multiple false positives? Default method: • Chance of any false positives in whole set • Jargon: Familywise Error Rate (FWER) • Whole set of tests considered together • Control by Bonferroni, Bonferroni-Holm, or Random Field Method See backup slides for more
Must do something about m! • m is “trials factor”only NE Jour Med demands! • Don’t want to just report m times as many signals • P(at least one wrong) = 1 – (1- )m ~ m • Use /m as significance test “Bonferroni correction” • This is the main method of control • Keeps to the probability of reporting 1 or more wrong on whole ensemble of m tests • Good: control publishing rubbish • Bad: lower sensitivity (must have more obvious signal) • For some purposes, have we given up too much?
False Discovery Rate (FDR) • Fraction of errors in signal candidates • Proportion of false positives among rejected tests “False Discovery Fraction” might have been clearer? Physics: rate = N/time Statistics: rate = fraction? • use FDR method to set the threshold