1 / 40

Statistics

Statistics is . . . . . making a silk purse out of a pig's ear, i.e. getting reliable results from data which is never quite as good as you hoped it was going to be. Statistics. N.B. The book used for these notes was “Statistics” by R J Barlow, Wiley 1989. How long is a piece of string?.

dusan
Download Presentation

Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics is . . . . . making a silk purse out of a pig's ear, i.e. getting reliable results from data which is never quite as good as you hoped it was going to be. Statistics N.B. The book used for these notes was “Statistics” by R J Barlow, Wiley 1989.

  2. How long is a piece of string? • N measurements x1, . . . . . xNof the length of a piece of string • Moments of distribution: mean and variance V = • We suppose that is an estimator of the string length  and  an estimator of the measurement errors. a good estimator should be unbiased—its expectation value should equal the true value, should be consistent—the limit as N should be the true value and should be efficient—its variance should be small. (It turns out that the mean satisfies these criteria, but the variance does not—it is biased. However the expression is unbiased.) But what do we mean by the “true value”?

  3. The population model • Set up a model of the measurement, where each observation is subject to an error ei which is a random variable of mean 0 and variance V, which added to  gives xi. • This model dataset has an infinite number of randomly-distributed observations from which our real dataset has been sampled. • The expectation value < > is evidently . Because we know the probability distribution of the xi we can answer questions like " if I repeat this experiment 1000 times, what will be the magnitude and distribution of the values of - ?" and so to deduce the accuracy of .

  4. Statistical inference • This is an example of the process of statistical inference, where we set up a model of the measurement and then deduce its properties using suitable estimators. For more complicated models than the one used here we might use a fitting process to deduce the values of several parameters of the model, but the process is essentially the same. Conceptually, we can repeat the measurement as often as we like. This allows us to answer the question, “what is the probability that I will get a value of in the range a to b” by repeating the (thought) experiment many times and finding the fraction which give a result in this range. This is the frequentist approach to defining the probability of an outcome. The obvious problem is that the idea is highly artificial: no one would really try to answer these questions by actually repeating the experiment—sometimes in astronomy that is not even possible in principle. Observations of a classical gamma-ray burst could not be repeated because that star no longer exists! On the other hand it defines clearly what we mean by probability, which otherwise is not easy.

  5. The Calculus of Probabilities • Suppose a certain “event” has two possible outcomes A and B with probabilities P(A ) and P(B). Also let P(A  B) be the probability of A given B. • Then, if A and Bare mutually exclusive P(A) + P(B) = 1 • P(A & B) = 0 • P(A + B) = 1 If A and B are independent P(A B) = P(A) Otherwise, more generally, P(A + B) = P(A) + P(B) – P(A & B) and P(A & B) = P(A) P(BA) = P(B) P(AB) The last expression can be re-arranged to give Bayes’ Theorem P(A B) = P( A) P(B A) / P(B) These axioms are the basis of the mathematical theory of probablility

  6. The Bayesian approach • Suppose P(m d) is the probability of our model, given the observed data (the “posterior” probability), then by Bayes’ Theorem we can write • P(m d) = P( m) P(d m) / P(d) • where P( m) is the probability of the model before we take into account our new measurements (the “prior” probability), • P(d m) is the probability of getting our particular dataset if the model is true, • and P(d) is a normalisation constant which makes the sum of P(m d) =1, taken over all possible models. Now P(m d) is just exactly what we would like to know—going back to our piece of string, P(x) is much better than P( x) which is what we get from the frequentist analysis. But nobody knows what this probability really means, and they attach different names—propensity, plausibility, confidence which make it plain it is somewhat subjective, unlike the frequentist probability which has an objective meaning. In addition it is usually difficult to know what value to give the prior; normally it is given some rather bland value, implying little prior knowledge.

  7. Model space data space Probability theory Statistical inference Probability theory and statistical inference How do we calculate the probability P(d m)of getting a given dataset, from a knowledge of the model? True model Parameter estimation statistical inference-- what can we say about the probability P(m d) of the model or its parameters from the properties of our dataset?

  8. Probability distributions The specification of the model will tell us the distribution function which we must use to calculate the probability P(r) of any particular outcome r for a measurement ( error ei or numbers of photons). For example, if we are observing random events which occur at a mean rate of  per sec, then we must use a Poissonian distribution with that mean value. For example, if we are observing random events which occur at a mean rate of  per sec, then we must use a Poissonian distribution with that mean value. If we are measuring the heights of students in a class, with mean  and variance 2, we must use a Gaussian distribution with those parameters (Central Limit Theorem). Note that r is discrete but x is a continuous variable, so PGis aprobability density: dPG = PG. dx

  9. Poisson distribution The mean <r> = , of course, but in addition the variance  also Gaussian distribution For large , the Poisson distribution tends to a Gaussian of mean and variance 

  10. The 2 distribution Another important distribution is that of 2, the statistic often used in fitting. This is where n=N-m is the number of degrees of freedom—N is the number of data points fitted, and m the number of parameters optimised—and  is the gamma function.

  11. The Central Limit Theorem It is noticeable that Poissonian and 2 distributions tend to become more symmetrical and to look more like a Gaussian at large or n, respectively; they do in fact tend to Gaussians. Another important case is where the distribution of a quantity (like the heights of a population) depends on a number of independent factors; it then always tends to be Gaussian-distributed. The Central Limit Theorem states that any variable which depends on a large number of independent factors will be Gaussian-distributed.

  12. Model-fitting with Maximum Likelihood • Calculate P(d m) for the particular case of models of photon-counting. • Usually called the likelihood of the data—only difference is normalisation • L(d;a) = P(d m) = • so Next we maximise the likelihood by using the values of a for which

  13. Example: • Model: constant source of counting rate a. • Since the statistics are Poissonian, we have • So the Maximum Likelihood value is just the mean.

  14. In reality . . . • Of course the model is likely to be much more complicated, • e.g. a point source of strength a at position () in the field of view, together with a background b, would be modelled by • m(x, y) = b + a. h(x - , y - ) • where h(a, b) is the point spread function of the image. Now lnL has to be minimised with respect to 4 parameters, a, b, and , so it must be done numerically. Note that with ML/Poissonian statistics it is never correct to subtract the background before fitting as it changes the statistics.

  15. Finally, in “real life” (i.e. LAT) • The source model is considered as: • This model is folded with the Instrument Response Functions (IRFs) to obtain the predicted counts in the measured quantity space (E’,p’,t’): • where • is the combined IRF. is the orientation vector of the spacecraft. The integral is performed over the Source Region, i.e. the sky region encompassing all sources contributing to the Region-of – Interest (ROI).

  16. Extended ML fitting • The above formulation of the likelihood is correct only if the total number of photon events is predetermined by stopping the exposure when a desired value has been reached. In practice of course we normally collect events until a desired exposure time is reached in which case the total number of events is a parameter of the model. This changes the form of lnL slightly; the main effect is to increase the estimated errors.

  17. Bayesian Model-fitting • From Bayes’ Theorem, P(m d) = P( m) P(d m) / P(d) • If we have no prior information about the model parameters, P( m) will be uniform over the parameter space and P(d) is only a normalising factor, so the parameter values that which were obtained from the ML analysis which maximise P(d m) will also maximise P(m d) and so are the Bayesian values.

  18. 2 (Least squares) Model-fitting • If our data is binned • and if the number of photon events per bin is large enough (>5, 10, 20, say) • then we can replace the Poisson distribution in the model by a Gaussian distribution • and for a one-parameter model For a multi-parameter model , where mi and i2 are the predicted mean and variance in the ith bin. Maximising lnL is equivalent to minimising , so Least Squares fitting is just ML fitting with large numbers of events. In fact, in place of lnL, the Cash statistic C = -2lnL can be used which makes the correspondence even closer.

  19. Pros and cons of  • Cons: • (i) Data must be binned reducing resolution • (ii) What value to use for the estimate of i2? • Commonly, xi is used but this introduces a bias • Other choices have been proposed. As an example, Keith Arnaud did 1000 simulations of a Chandra power law spectrum of slope 1.7 with 1000 events in the spectrum and fitted this data using different fit statistics to get the following results for the fitted slope: • Standard 2 1.41 (-0.15, +0.16) • Gehrels 2 1.88 (-0.22, +0.25) • Churasov 2 1.69 (-0.15, +0.16) • ML 1.70 (-0.13, +0.14) • The superiority of ML is obvious; it is partly due to this issue of the estimate of the variance, and partly to the use of the approximate distribution function. Pros: 2 has one important strength, however—its distribution is known which allows an estimate to be made of the goodness-of-fit of the model.

  20. Calculating Optimal Parameter Values • The calculation of maximal/minimal statistics is usually done numerically using a standard computer package. • BEWARE!! • Suppose we need to minimise a statistic S using a model with m free parameters. The simplest method is, starting from a suitably-chosen point on the S-m surface, to calculate the direction of steepest descent and take a small step in that direction. Repeating this procedure, one will eventually arrive at a minimum. The problem is that unless m is small, the S-m surface may well have multiple minima and the one found may be only a local minimum. Always repeat the calculation from a large range of initial parameters and make sure the fit always converges on the same solution.

  21. Confidence intervals We now assume that this means that will lie within this range in 90% of experiments.

  22. Reverend Bayes is back!!This is a Bayesian statement based on a uniform prior.Suppose we estimate a parameter m to be x, and that we have an estimate of the error s. We assume that x is Gaussian-distributed and use a prior P(m) = 1Note that this is correctly normalised—x must have a value between - ∞ and ∞ and P integrated over this range is 1.We know m must be positive, so now we use a prior P(m) = 1, m ≥ 0 = 0 otherwiseand now we must re-normalise so thatx > 0

  23. Dealing with negative intensities We can use this result to advantage in those cases where the confidence interval includes physically-nonsensical negative values, e.g. an intensity of 0.06±0.10 units. Using gives the 90% confidence interval as 0.0 to 0.19 instead of -0.04 to 0.16. See Barlow, p130

  24. Confidence Intervals in Model-fitting • The fit statistic gives directly the relative probability attached to any set of parameter values; this is true whether LnL or 2 • If the prior is bland, this is also the model probability given the observed data. • The parameter for which an error estimate is required is varied in steps, whilst at each stage re-optimising the rest, until the change S in the fit statistic S reaches a required value, say 2.71  2 for one parameter is distributed as 2 (1); for two parameters as 2 (2) Confidence,% 68 90 95 99 S (one parameter) 1.0 2.71 3.84 6.63 S (two paras) 2.30 4.61 5.99 9.21 For ML, (lnL) is distributed as - 2 /2 provided the number of data points n is large, so these values can be multiplied by -1/2.. If two parameters are varied at once, the line S =2.3 is an ellipse which traces out the 68% confidence region. If the axes of the ellipse are not parallel to the parameter axes, then the two parameters are correlated.

  25. Correlated variables

  26. Goodness of Fit • It is obvious from the expression • that we would expect the minimum value of 2 to be close to N and this is true—the distribution of 2 is • n = N-m is the number of degrees of freedom where N is the number of data points and m is the number of parameters in the model • This is true for all n, but if n is large enough it approaches a Gaussian of mean n and variance 2n This assumes the model is a good one and so can be used as a test of that hypothesis, by comparing the “reduced” 2mini.e. the best-fit value, with the expected value. If n is large, we compare the “reduced” 2min / n with 1.

  27. A good agreementimplies the model is a good description of the data; if the reduced 2 is too large not only is the model poor, but the difference indicates how likely such a value is by chance. For example, a bremsstrahlung fit to a 20 channel spectrum gives 2min =35 There are 2 parameters so there are 18 dof. 2min is suspiciously high—P ~ 0.01 by chance Alternatively, a low value implies that the errors are overestimated. This is a very valuable property of 2 fitting and, perhaps, the only reason to consider using it. Unfortunately, there is no ML equivalent. The usual solution is to run e.g. 1000 Monte Carlo simulations of the experiment.

  28. How many Parameters? • Adding more adjustable parameters will usually make a model fit better. • Example—try adding a power law component (2 extra parameters) to the bremsstrahlung • 2min drops to 18 with 16 dof—obviously a better fit • How can you know when to stop? The first answer is just to stop when you get an acceptable reduced 2, but the F test gives a more rigorous answer.

  29. The F test • Suppose an initial model with idegrees of freedom gives a 2 of Si and after fitting extra parameters one gets Sf with f dof. • Calculate • and look up the significance of the result in a table of the F-distribution. (It is essential to note that books usually give this test in its simpler, original form and the tables to go with it. You need tables of the distribution of F(as in Bevington, “Data reduction and error analysis for the physical sciences”) For the example, F = (35-18)/18 x 16/2 = 7.6. The F-test table shows the chance of exceeding 6.23 by chance is 0.01 so the power law is significant at 99% confidence.

  30. Conditions for using the F-test • The F-test can only validly be used if two conditions are satisfied • the simpler model is nested within the more complex • the null hypothesis does not lie along a boundary in data space • (added spectral line, point source, power law) • See Protassov et al , Ap J 2002, 571, 545

  31. The Maximum Likelihood “Test Statistic”The F-statistic is closely related to the “test statistic” TS used to determine the significance of a possible point source, (see below)TS = 2 ( lnL1– lnL2)If N is large, TSis approximately distributed as 2 (nf - ni) /2For a point source, this is 2 (4) /2 (position, flux, spectral index)or 2 (3) /2 (position, intensity)TS = 25 for 4 additional parameters corresponds to P = 2.10-5Rule of thumb: TS = s2 (s is actually 4.1 in this case)There are 20000 PSF’s in the LAT sky so 0.4 spurious sources expected.

  32. Hypothesis testing • We started by considering the problem of parameter estimation • We have now moved into hypothesis testing—e.g. that a spectrum includes a significant power law component. • An important tool of hypothesis testing is the Null Hypothesis, useful where a testable form of the hypothesis cannot be set up.

  33. the Null hypothesis Suppose that we count photons from a source in constant intervals, and find the mean number of events in each is . Then in one interval we measure some (greater) number, l. Is this due to a flare in the source? First we formulate the null hypothesis-- there is no flare, the measurement l is simply drawn from the same population as all the others. The probability of obtaining a count of l or greater is which for =10, l=20 is 1% How significant this result is, depends on how many measurements in all we have made. In 1000 intervals, 10 would give this value by chance alone

  34. Source Searching • Searching an image for point sources is another example of hypothesis testing, where one is looking for pixels (or groups of pixels) which contain a significant excess of counts. • One estimates the background b from some selected large region of the image and compares this with the number of counts n in each pixel. • The significance of a hypothetical source is then (n-b)/ √b in standard deviations, and the hypothesis that there is no source present can then be rejected at some level of confidence. A group of pixels would be used to increase the sensitivity of the test when the image point spread function covers several pixels. A maximum probability of ~10-6 ("5") is usually taken as a realistic value when searching for point sources in an image which can easily have 105 pixels.

  35. Timing Analyses • Dave showed data from the Vela Pulsar using gtptest

  36. Testing Pulsation Significance – Output from gtptest on Vela Type of test: Chi-squared Test, 10 phase bins Probability distribution: Chi-squared, 9 degrees of freedom Test Statistic: 824.028880866426 Chance Probability Range: (0, 2.03757046903054e-99) Type of test: Rayleigh Test Probability distribution: Chi-squared, 2 degrees of freedom Test Statistic: 46.2571601550502 Chance Probability Range: (9.02305042259081e-11, 9.02395272685048e-11) Type of test: Z2n Test, 10 harmonics Probability distribution: Chi-squared, 20 degrees of freedom Test Statistic: 1511.03487971911 Chance Probability Range: (0, 2.0785338644267e-99) Type of test: H Test, 10 maximum harmonics Probability distribution: H Test-specific Test Statistic: 1475.03487971911 Chance Probability Range: (0, 4e-08) From Dave's timing talk

  37. Pulsation tests • Chi-squared null hypothesis: all bins equal • Rayleigh : phases form a random walk • Z2n : pulsation is a sine wave with n harmonics • H Test :pulsation is a sine wave with automatic selection of n Why have more than one? Because the waveform can vary from one or two sharp peaks/cycle to a sine wave The tests are most sensitive for waveforms which match their null hypothesis Also, the last three tests do not require the data to be binned

More Related