1 / 9

Patrick Nolan Stanford University GLAST LAT DC2 Kickoff 2 March 2006

Interpretation of the Test Statistic or: basic Hypothesis Testing, with applications, in 15 minutes. Patrick Nolan Stanford University GLAST LAT DC2 Kickoff 2 March 2006. The Likelihood Ratio. Likelihood is defined to be the probability of observing the data,

Download Presentation

Patrick Nolan Stanford University GLAST LAT DC2 Kickoff 2 March 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Interpretation of the Test Statisticor: basic Hypothesis Testing, with applications, in 15 minutes Patrick Nolan Stanford University GLAST LAT DC2 Kickoff 2 March 2006

  2. The Likelihood Ratio Likelihood is defined to be the probability of observing the data, assuming that our model is correct. L(θ)  P(x|θ) Here x is the observed data and θ is the parameter(s) of the model. Likelihood is a function of the model parameters (aka the “hypothesis”). Suppose there are two models with parameter(s) θ0 and θ1. Typically θ0 represents a “null” hypothesis (for instance, no point source is present) while θ1 represents an “alternate” hypothesis (for instance, there is a point source). The likelihood ratio isΛ L(θ0)/L(θ1). If Λ is small, then the alternate hypothesis explains the data better than the null hypothesis. This needs to be made quantitative.

  3. The Power of a Statistical Test In hypothesis testing, we decide whether we think θ0 or θ1 is the best explanation for the data. There are two ways we could go wrong: We would like to have both α and  be small, but there are tradeoffs. The usual procedure is to design a statistical test so that α is fixed at some value, called the size or significance level of the test. For a single test, a number like 5% might be OK. When looking for point sources in many places, a smaller αis needed because there are many opportunities for a Type 1 error. Once α is fixed, 1- is called thepowerof the test. Large power means that real effects are unlikely to be missed. The likelihood ratio is useful in this context because of the Neyman-Pearson lemma, which says that the likelihood ratio is the “best” way to choose between hypotheses. If we choose the alternative hypothesis over the null when Λ < k, where P(Λ < k | θ0) = α, then the results will be unbiased and the test is the most powerful available. This is the notation used in every textbook.

  4. Making it Quantitative Usually we deal with composite hypotheses. That is, θ isn’t a single point in parameter space, but we allow a range of values. Then we compare the best representatives of the two hypotheses: choose θ0 to maximize L(θ0) and θ1 to maximize L(θ1) in the allowed regions of parameter space. In order to use the likelihood ratio test (LRT) we need to be able to solve the equation P(Λ < k | θ0) = α for k. In general the distribution of Λ is unknown, but Wilks’s Theorem gives a useful asymptotic expression. The alternate model must “include” the null. That is, the set of null parameters {θ0} must be a subset of {θ1}. For instance, θ0 describes N point sources, while θ1 has N+1 sources. When there are many events, -2 ln(Λ) ~ r2 This is what we call TS (“test statistic”). Here r is the difference in the number of parameters in the null and alternate sets. This is the basis for the popular 2 and F tests. If r = 1, then the unit normal distribution! Thus a 3-sigma result requires ln(Λ) = -4.5. Why doesn’t this work? See next page.

  5. Conditions, caveats & quibbles • How many photons do we need to use the asymptotic distribution? I’m not sure. The faintest EGRET detections on a strong background always had at least ~50. That’s certainly enough. Can GLAST detect a source with fewer? • More seriously, Wilks’s Theorem doesn’t work for our most common situation. It is valid only under “regularity” conditions on the likelihood function and the area of parameter space we study. • Example: We want to know if there is a point source at a certain position. The brightness of the source will be the only adjustable parameter in the alternate model. Of course the brightness must be ≥ 0. When the brightness  0, the alternate and null models are indistinguishable. This is one of the regularity conditions. • What are the consequences?

  6. EGRET pathology: not so bad Extensive simulations were done using the EGRET likelihood program and a realistic background model, with no point sources. The histogram of test statistic values doesn’t follow the 12 distribution. It’s low by a factor of 2. This discrepancy isn’t surprising. Half of the simulations would produce a better fit with negative source brightness. This isn’t allowed, so Λ = 1 (TS = 0) in all these cases. There should be a δ-function at 0 in the graph. Statisticians call the resulting distribution ½ 02 + ½ 12.

  7. GLAST pathology: ??? We are in the early stages of similar simulations for GLAST. The results are harder to understand. In this example, about ¾ of the cases result in TS = 0, rather than the expected half. About half of the positive TS values are < 0.1. The distribution cuts off at large TS more sharply than a 2 should. If this type of behavior persists, the interpretation of TS values will be more difficult. We will need to use simulations to produce probability tables.

  8. Final Words • This is by no means everything we need to know about statistics. I have said nothing about parameter estimation, upper limits, or comparing models which are not “nested”. • Finding an efficient method to optimize the parameter values is a major effort. • The problem of multiple point sources is an example of a “mixture model”. How do we decide when to stop adding more sources? That’s cutting-edge research in statistics. • I have also skipped over the Bayesian method for dealing with the hypothesis testing problem. That could be a whole other talk. • Some of us have been talking with Ramani Pilla of the Statistics dept. at Case Western. She has a novel method which avoids the use of Wilks’s Theorem. The computation of probabilities is quite involved, but it should be tractable for comparisons with only one additional parameter.

  9. References • The ultimate reference for all things statistical is Kendall & Stuart, “The Advanced Theory of Statistics”. I have consulted the 1979 edition, Volume 2. It is very dense and mathematical. • More accessible versions can be found in Barlow’s “Statistics” and Cowan’s “Statistical Data Analysis”, both written for physicists. These books are a bit expensive, but I like them. They consider both Bayesian and frequentist methods. • A cheaper alternative is Wall & Jenkins “Practical Statistics for Astronomers”. It tends to skimp on the theory, but it could be useful. • The downfall of the LRT was pointed out clearly by Protassov et al. 2002, ApJ 571, 545. • Pilla’s method is described in Pilla et al. 2005, PRL 95, 230202. • The EGRET likelihood method is explained by Mattox et al. 1996, ApJ 461, 396.

More Related