Objective Functions for Model Fitting: Sum of Squared Residuals (Method of Least Squares)

Lecture 6 • Objective functions for model fitting: • Sum of squared residuals (=> the ‘method of least squares’). • Likelihood • Hypothesis testing

Model fitting – reminder of the terminology: • We have datayi at samples of some independent variable xi. • The model is our estimate of the parent or truth function. • Let’s express the model m(xi) as a function of a few parametersθ1, θ2 .. θM. • Finding the ‘best fit’ model then just means best estimates of the θ. (Bold – shorthand for a list) • Knowledge of physics informs choice of m, θ. The parent function – what we’d like to find out (but never can, exactly).

Naive best fit calculation: • The residuals for a particular model = yi-mi. • To ‘thread the model through the middle of the noise’, we want the magnitudes of all residuals to be small. • A reasonable way (not the only way) to achieve this is to define a sum of squared residuals as our objective function: • Fitting by minimizing this objective function is called the method of least squares. It is extremely common. • NOTE! This approach IGNORES possible uncerts in x.

But what if the noise is not homogeneous? • Some bits clearly have higher σ than others. • Answer: weight by 1/σ2i • This form of U is sometimes called χ2(pronounced kai squared). • To use it, we need to know the σi.

Simple example: mi = θ1 + θ2si Model – red is si, green the flat background. Contour map of Uls. The data yi: Truth values!

An even simpler example: • Last lecture, I noted that there do exist cases in which we can directly invert • For least squares, this happens if the model is a polynomial function of the parameters θi. • Expansion of grad U in this case gives a set of M linear equations in the M parameters called the normal equations. • It is easy to solve these to get the θi.

Simplest example of all: fitting a straight line. • Called linear regression by the statisticians. • There is a huge amount of literature on it. • Normal equations for a line model turn out to be: • Polynomial is an easy extension to this.

χ2 for Poisson data – possible, but problematic. • Choose data yi as estimator for σi2? • No - can have zero values in denominator. • Choose (evolving) model as estimator for σi2? • No - gives a biased result. • Better: Mighell formula • Unbiased, but no good for goodness-of-fit. • Use Mighell to fit θ then standard U for “goodness of fit” (GOF). Mighell K J, Ap J 518, 380 (1999)

Another choice of U: likelihood. • Likelihood is best illustrated by Poisson data. • Consider a single Poisson random variable y: its PDF is where m here plays the role of the expectation value of y. • We’re used to thinking of this as a function just of one variable, ie y; • but it is really a function of both y and m.

Poisson PDF

PDF for y vs likelihood for θ. Probability p(y|θ) = θye–θ / y! Likelihood p(y|θ) = θye–θ / y!

The likelihood function. • Before, we thought “given m, let us apply the PDF to obtain the probability of getting between y and y+dy.” • Now we are saying “well we know y, we just measured it. We don’t know m. But surely the PDF taken as a function of mindicates the probability density for m.” • Problems with this: • Likelihood function is not necessarily normalized, like a ‘proper’ PDF; • What assurance do we have that the true PDF for m has this shape??

Likelihood continued. • Usually we have many (N) samples yi. Can we arrive at a single likelihood for all samples taken together? • (Note that we’ve stopped talking just about Poisson data now – this expression is valid for any form of p.) • Sometimes easier to deal with the log-likelihoodL:

Likelihood continued • To get the best-fit model m, we need to maximize the likelihood (or equivalently, the log likelihood). • If we want an objective function to minimize, it is convenient to choose –L. • Can show that for Gaussian data, minimizing –L is equivalent to minimizing the variance-weighted sum of squared residuals (=chi squared) given before. • Proof left as an exercise!

Poissonian/likelihood version of slide 3 Model – red is si, green the flat background. Map of the joint likelihood L. The data yi:

What if also errors in xi? • Tricky… Bayes better in this case.

What next? • In fitting a model, we want (amplifying a bit on lecture 4): • The best fit values of the parameters; • Then we want to know if these values are good enough! • If not: need to go back to the drawing board and choose a new model. • If the model passes, want uncertainties in the best-fit parameters. • (I’ll put this off to a later lecture…) • Number 1 is accomplished. √

How to tell if our model is correct. • Supposing our model is absolutely accurate. • The U value we calculate is, nevertheless, a random variable: each fresh set of data will give rise to a slightly different value of U. • In other words, U, even in the case of a perfectly accurate model, will have some spread – in fact, like any other random variable, it will have a PDF. • This PDF is sometimes calculable from first principles (if not, one can do a Monte Carlo to estimate it).

How to tell if our model is correct. • The procedure is: • First calculate the PDF for U in the ‘perfect fit’ case; • From this curve, obtain the value of the PDF at our best-fit value of U; • If p(Ubest fit) is very small, it is unlikely that our model is correct. • Note that both χ2 and –L have the property that they cannot be negative. • A model which is a less than ideal match to the truth function will always generate U values with a PDF displaced to higher values of U.

Perfect vs. imperfect p(U): A perfect model gives this shape PDF PDF for imperfect model is ALWAYS displaced to higher U.

Goodness of model continued • Because plausible Us are >=0; and because an imperfect model always gives higher U: we prefer to • generate the survival function for the perfect model; • that tells us the probability of a perfect model giving us the measured value of Uor higher. • This procedure is called hypothesis testing. • Because we make the hypothesis: • “Suppose our model is correct. What sort of U value should we expect to find?” • We’ll encounter the technique again next lecture when we turn to enquire if there is any signal at all buried in the noise.

Perfect-model p(U)s: • If we use the least-squares U (also known as χ2), this is easy, because p(U) is known for this: where • Г is the gamma function • and υ is called the degrees of freedom. • Note: the PDF has a peak at U~υ.

What are degrees of freedom? • The easiest way to illustrate what degrees of freedom is, is to try fitting a polynomial of higher and higher order to a set of noisy data. • The more orders we include, the nearer the model will fit the data, and the smaller the sum of squared residuals (χ2) will be, until… • when M=N (ie the number of parameters, polynomial orders in this case, equals the number of data points), the model will go through every point exactly. χ2 will equal 0.

Degrees of freedom • Defined as N-M: number of data points minus number of parameters fitted. • It is sometimes convenient to define a reduced chi squared • PDF for χ2reduced should of course peak at about 1. • There is no advantage in using this for minimization rather than the ‘raw’ χ2.

‘Survival function’ for U. • Remember the survival function of a PDF is defined as • For χ2 this is • where Г written with 2 arguments like this is called the incomplete gamma function:

Objective Functions for Model Fitting: Sum of Squared Residuals (Method of Least Squares)

Objective Functions for Model Fitting: Sum of Squared Residuals (Method of Least Squares)

Presentation Transcript

Lecture 6

LECTURE № 6

Lecture 6

Lecture 6

Lecture 6

Lecture 6

Lecture 6:

Lecture 6

Lecture 6

Lecture 6

LECTURE 6

Lecture 6

LECTURE 6

Lecture 6

Lecture 6

Lecture 6

Lecture 6

Lecture 6: