An Introduction to X-ray Spectral Fitting II: Statistical background

An Introduction toX-ray Spectral Fitting II: Statistical background Other material XSPEC manual: http://heasarc.gsfc.nasa.gov/docs/xanadu/xspec/ “X-ray school” lecture notes: http://xrayschool.gsfc.nasa.gov/docs/xrayschool/ Local pages: http://www/star.le.ac.uk/~sav2/stats

The forward fitting method(revision) Define model Calculate model Convolve with response Change model parameters Compare to data

What next? So you’ve fitted a model M(θ) – with parameters θ={θ1, θ2, …,θP} – to some data D and found the “best fit” (e.g. minimum χ2). What next…? Hypothesis testing: Does the model give a “good” fit. Is the model statistically rejected? Confidence regions: How well determined are the parameters of the model? (“error bars”)

Hypothesis testing Once you have chosen a model and obtained the ‘best fit’ parameters you should ask whether the model is a fair representation of the data: “Is my model statistically acceptable?” For this you need a hypothesis test. If the model is not acceptable, you have shown the model to be wrong (to some specified level of confidence). Use Pearson’s χ2 to test the fit “All models are wrong, but some are useful” - George Box

Testing the fit Assuming: (1) the model is reasonable (2) the uncertainties are Gaussian Pearsons’s χ2 statistic has a known distribution - also called χ2 - for ν “degrees of freedom.” χ2(ν) has a mean (expectation) value of ν and a variance of 2ν . If the fit is good expect: Or “reduced” chi-square: Bad fit: “Too good”:

Theoretical χ2 distribution NB: χ2 becomes ~Gaussian as  increases

Null hypothesis probability By integrating the probability density function (PDF) for the theoretical χ2 distribution we can estimate the probability of measuring a higher value than χ2obs by chance, if the model is correct. If probability is small (e.g. Pr = 0.05) then the fit is not so good. We can reject the hypothesis (that the model fits) with 95% confidence. [“confidence” = (1-Pr) × 100%] XSPEC automatically tells you the null hypothesis probability after every “fit”.

Probability density Integrating the PDF of the χ2 distribution from the observed value upwards we get the ‘probability’ of measuring a higher value than χ2obs by chance (shaded area). If this is small then our data are unlikely to have been produced by this model ( = bad fit!).

The two types of error for a hypothesis test Type I error: reject a true hypothesis “found guilty when actually innocent” Type II error: accept a false hypothesis “found innocent when actually guilty”

Outcomes of χ2 fitting • χ2»ν = bad fit (Pr « 0.05) • Not found global minimum yet... • Poor choice of model • Poor calibration (see next slide) • Errors too small (software prob.) • χ2~ ν = acceptable fit (Pr ~ 0.5) • Check any “frozen” parameters • Is this the only viable model? • χ2 «ν = fit “too good” (1-Pr « 0.05) • Errors too large (software prob.) • Too many parameters (over-fitting) Occasional low/high χ2 results are to be expected, but be wary of consistently getting extreme values...

Systematic errors When you test a model in X-ray astronomy, what you are really doing is testing how well the data fit to the folded model. If the fit is to be good you need • A good astrophysical model • A good detector response model The second condition depends on the quality of the instrumental calibration. Here we must remember No calibration is perfect! Imperfect calibration is a cause of systematic errors. If you have very high S:N data these systematic errors may dominate. Often see working astronomers adding e.g. 2% “systematic” errors in quadrature – an unsound procedure (systematic errors are correlated). Recommend eye-ball examination of the fit residuals. Are they systematic? Try ignoring the ‘dodgy’ region…?

Warning: Statistics! Never blindly apply textbook statistics, always think about what you are trying to achieve. A model that is rejected at 95% confidence may seem like a bad fit, and you might be thinking of rejecting it. But if you found one spectrum like this out of ~20 spectra, it is not an interesting result! We should expect a false rejection – type I error – 5% of the time when performing a 95% confidence test! This is the expected rate of “false alarms”. (Solution: increase your confidence threshold) If a model gives a good fit to your data, it does not mean the model is correct. There may be other models that give a good fit. All you can say is that the data are consistent with the model. It is often more interesting to rule out a simple model (with high confidence if χ2 >> ν). This gives you motivation for a more complex model – progress!

What next? So you’ve fitted a model M(θ) – with parameters θ={θ1, θ2, …,θP} – to some data D and found the “best fit” (e.g. minimum χ2). What next…? Hypothesis testing: Is the model “good” fit. Is the model statistically rejected? Confidence regions: How well determined are the parameters of the model? (e.g. “error bars”)

Confidence regions(parameter uncertainty) Once you have a “good” model you will want to know the confidence regions for each of the parameters. To do this we observe how χ2 changes (increases) as we vary the parameter from its best (minimum χ2) value. • > error <parameter> <Δχ2> • error 2 2.706 Gives 90% confidence region on parameter number 2. No point if the fit is rejected!

Confidence regions1-dimension = “error bar” For the confidence region of one parameter we measure χ2 as the parameter is forced away from the best fit (but letting the other parameters find new best fit values). The value of Δχ2 follows a χ2 distribution with ν= 1. For example: The 90% confidence region (Pr = 0.1) for our one parameter is the value, in either direction, that increases χ2 by 2.706. The value of 2.706 can be found in many good statistics tables. • steppar <parm> <lo> <hi> <no. steps> • steppar 2 1.7 3.0 50 • plot contour

Confidence regions2-dimensional = “contour plot” “Probability” surface shows probability (grey) as a function of two parameters (here, X and Y). A 90% confidence region in two dimensions is the locus (red) containing 90% of the probability. NB: The 90% region in 1D is always narrower than the projection of the 2D region onto 1D.

Confidence regions: 2D For the joint confidence region of two parameters we measure χ2 as both parameters are forced away from the best fit (but letting the other parameters float). The value of Δχ2 now follows a χ2 distribution with ν= 2. For example: The 90% joint confidence region (Pr = 0.1) for our two parameters is the contour around the best fit that increases χ2 by 4.61. > steppar 1 0.1 0.4 25 2 1.9 2.7 25 > plot contour

Confidence regions The value of χ2 for a ν–dimensional confidence region can be found in various statistics tables (see below). If you want to know the confidence region of one parameter (“error bar”) use ν= 1. For the joint confidence region of 2 parameters (2-dimensional contour plot) use ν = 2. Note: this is not the same as the total number of parameters in the model! Example: For =1 the probability is 1% that ∆χ2 > 6.63.

How many parameters? If we have (i) a simple model and (ii) a more complex model, with more free parameters, we will want to compare the two. Does the more complex model give a significantly better fit to the data? If not, then we should stick with the simpler model. Occam’s razor: “don’t make models more complex than strictly needed!” Compare results with theoretical chi-square distribution to get probability. This is related to the F-test that is implemented in XSPEC. (NB: Δχ2 < 1.0 is never significant.)

F-test in XSPEC > ftest <new χ2> <new ν> <old χ2> <old ν> > ftest 96.1 89 102.2 91 Low probability means high significance. Eg: Pr = 0.06 → (1-Pr)*100 = 94% significance Single power law χ2=102.2 ν=91 Broken power law χ2=96.1 ν=89 No point if the M0 is rejected!

When to use the F-test(or Δχ2 test and its relations) Model 0: with P0 free parameters Model 1: with P1 free parameters (P1 > P0) Rule 1: The simple model M0 must be a subset of the complex model M1: “nested models” Rule 2:Null values of the (P1-P0) additional parameters not on boundary of parameter space Rule 3: Number of counts must be large. < fixed >

“Good” example of F-test H0: Power law H1: Broken power law For some set of parameter values

“Bad” example of F-test H0: Power law H1: Blackbody For any parameter values

When to use the F-test In a nutshell: Comparing two models with different numbers of free parameters, where the simpler of the two models can be obtained by fixing one or more parameters of the more complex model. The F-test really tells you when you can free up (thaw) a previously fixed (frozen) parameter. BUT: Make sure that when you fix the parameters of the more complex model, to make it equivalent to the simpler model, that the parameters are not at their extreme limits (e.g. flux=0 or temp=0). Even though F-test is implemented in XSPEC, the simpler Δχ2test is more powerful (use look-up table)

Notes on using χ2 fitting The method of chi-square minimisation is a ‘minimum variance’ method. Adjusting the parameters to minimise χ2 is equivalent to finding the maximum likelihood estimate (MLE) of the parameters only if certain conditions are met: In order for the counts per channel to be Gaussian we need have at least N ≥ 20 counts per spectral bin. To bin a spectrum use grppha (prior to XSPEC). If the bins contain few counts the distribution (which is really Poisson) will be noticeably asymmetric. Fluctuations below the model have more weight than those above, causing the fit to place the model below where it should be. This also means your null hypothesis probability is meaningless since chi-square is not behaving as expected. • The model (both physical and instrumental) is a reasonable representation of the observed spectrum. • The ‘errors’ on the data are Gaussian.

Poisson vs. Gauss Comparison of Poisson distribution (dotted) with Gaussian distribution (solid curve) having the same mean and variance. Rule of thumb: Poisson distribution is usually ‘close enough’ to Gaussian when μ>20. Common practice to group data such that there are ~20 counts/bin, e.g. using the FTOOL “grppha” μ= 5 μ = 10 μ= 20 μ= 50

Fitting without grouping There is another statistic, given by Cash (1979) that gives maximum likelihood results for data following the Poisson distribution. No need to group data. Recall the Poisson distribution: This gives the probability of observing C counts if the true (expected) number is μ Now, if we have a model, this defines our expected counts in a given bin, μi. So the probability of observing Ci counts in channel i, if the model is true, is p(Ci|μi). If we have many channels i=1, 2, …, N then the combined (joint) probability for all of them is just the product of all the individual probabilities: But it is often simpler to work in the logarithm:

Fitting without grouping This gives the probability of observing the count spectrum Ci(i=1, 2, …, N) if the model is true. If we already have an observed spectrum this defines the likelihood of that spectrum, under the hypothesis a particular model is true. NB: Likelihood is similar to probability, but is a function of parameters given some fixed data (rather than distribution of data given some fixed model). Minimise S  Maximise likelihood. So replace χ2 by S (aka Cash statistic) and you have fit the data without any binning. The result, the fit giving minimum S is the maximum likelihood model, i.e. the model for which the likelihood of observing your data is maximised. In the limit of many count/bin S becomes same as χ2 (use Stirling’s approximation). Better interpretation of χ2 is as an approximation to maximum likelihood in the high count/bin limit:

Using the C-statistic Q: Have very few photons (e.g. N « 1000) ? Q: Want to use the maximum spectral resolution (minimum binning)? • The Cash statistic can be minimized just like χ2 to give the ‘best’ fit. • Confidence regions can be derived in exactly the same way (using e.g. ΔS = 2.706…). • One can use the Likelihood Ratio Test in a similar fashion to the F-test (they are related). • Drawback: there is no analytical “goodness of fit” test, i.e. cstat does not instantly give you a “null hypothesis probability” to gauge how good the fit is. In fact, since binning data generally loses information, this can be useful (not just restricted to low ct/bin data!). • statistic cstat Use the command goodness 1000 This will generate 1000 randomised datasets (using the Poisson distribution). It will tell you the fraction of simulated datasets that give a worse S-value. This is a “Monte Carlo” calculation of the rejection probability for your model (i.e. 99% means a bad fit!)

Pros and cons of χ2 • Pros • Simple to calculate • Ready-made (analytical) significance test • Historical use of χ2 • Cons • Only gives maximum likelihood if data are Gaussian (not Poissonian) • Therefore: need to bin-up data. This loses spectral information • Can be insensitive to small residuals in large datasets (no account of order of residuals)

The small print: Probability In the previous slides I used the word ‘probability’ in a rather loose fashion. Strictly speaking, when we perform a χ2 test in classical statistics, we do not evaluate the probability that the model is incorrect. What we are doing is calculating the frequency that hypothetical future observations will give higher values of χ2, assuming the current model is correct. Minimizing χ2 is not the same as maximizing the probability of the model. The minimum in χ2 corresponds to the model that makes the observed data most likely. Hence “maximum likelihood.” Likewise with confidence regions. Contrary to popular opinion, these do not really give the probability that the true value of a parameter lies in the given range. Instead they give the frequency of hypothetical future observations giving parameter estimates in that range, assuming the current model is correct. To do more than this requires inference… If you want to perform statistical inference (“what is the probability of the true value of θ lying in a given range?”) you need to delve a little deeper. See Sivia (1996)

General stats references • Data Reduction and Error Analysis in the Physical Sciences • P. R. Bevington & D. K. Robinson (2003) • Statistical Methods in Experimental Physics • W. T. Eadie et al. (1971) • Statistics for Physicists • B. R. Martin (1971) • Numerical Recipes (ch 14-15) • W. H. Press et al. (1996) • Practical Statistics for Astronomers • J. V. Wall & C. R. Jenkins (2003) • J. V. Wall (1979 & 1996), QJRAS • Data Analysis: A Bayesian Tutorial • D. S. Sivia (1996)

An Introduction to X-ray Spectral Fitting II: Statistical background