Advanced Topics in Least Squares: Multivariate Normal Distribution & Maximum Likelihood

Lecture 7 Advanced Topics in Least Squares

the multivariate normal distribution for data, d p(d) = (2)-N/2 |Cd|-1/2 exp{ -1/2 (d-d)TCd-1 (d-d) } Let’s assume that the expectation d Is given by a general linear model d = Gm And that the covariance Cd is known (prior covariance)

Then we have a distribution P(d; m) with unknown parameters, m p(d)=(2)-N/2|Cd|-1/2exp{ -½ (d-Gm)TCd-1 (d-Gm) } We can now apply the principle of maximum likelihood To estimate the unknown parameters m

Principle of Maximum Likelihood p(d;m)=(2)-N/2|Cd|-1/2exp{ -½ (d-Gm)TCd-1 (d-Gm) } Last lecture we stated this principle as L(m) = Si ln p(di; m) with respect to m but in this distribution the whole data vector d is being treated as a single quantity So the princple becomes simply Maximize L(m) = ln p(d; m)

L(m) = ln p(d; m) = - ½Nln(2) - ½ln(|Cd|) - ½(d-Gm)T Cd-1 (d-Gm) The first two terms do not contain m, so the principle of maximum likelihood is Maximize -½ (d-Gm) TCd-1 (d-Gm) or Minimize (d-Gm) TCd-1 (d-Gm)

Minimize (d-Gm) TCd-1 (d-Gm) Special case of uncorrelated data with equal variance Cd = sd2I Minimize sd-2 (d-Gm)T (d-Gm) with respect to m Which is the same as Minimize (d-Gm)T (d- Gm) with respect to m This is the Principle of Least Squares

This is the Principle of Least Squares minimize E = eTe = (d-Gm)T(d-Gm) with respect to m follows from the Principle of Maximum Likelihood in the special case of a multivariate Normal distribution the data being uncorrelated and of equal variance

Corollary If your data are NOT NORMALLY DISTRIBUTED Then least-squares is not the right method to use!

What if Cd=sd2I but sd is unknown? note |Cd| = s2N L(m,sd) = -½Nln(2) - ½ln(|Cd|) - ½(d-Gm)T Cd-1 (d-Gm) = -½Nln(2) – Nln(sd) - ½sd-2(d-Gm)T (d-Gm) The first two terms do not contain m, so the principle of maximum likelihood still implies: Minimize (d-Gm)T(d-Gm) = eTe = E Then L/sd= 0 = Nsd-1 + sd-3(d-Gm)T (d-Gm) Or, solving for sd sd2 = N-1 (d-Gm)T (d-Gm) = N-1eTe

This is the Principle of Maximum Likelihood implies that sd2 = N-1 (d-Gm)T (d-Gm) = N-1eTe Is a good posterior estimate of the variance of the data, when the data follow a multivariate normal distribution the data are uncorrelated and with uniform (but unknown) variance, sd2

But back to the general case … What formula for m does the rule Minimize (d-Gm)TCd-1 (d-Gm) imply ?

Symmetric, so it inverse and square root is symmetric, too Trick … Minimize (d-Gm)T (d-Gm) Implies m = [GTG]-1GTd Now write, Minimize (d-Gm)TCd-1 (d-Gm) = (d-Gm)TCd-1/2 Cd-1/2 (d-Gm) = (Cd-1/2d-Cd-1/2Gm)T (Cd-1/2d-Cd-1/2Gm) = (d’-G’m)T (d’-G’m) with d’=Cd-1/2d G’ = Cd-1/2G This is simple least squares, so m= [G’TG’]-1G’Td’ or m = [GTCd-1/2Cd-1/2G]-1GTCd-1/2Cd-1/2d =[GT Cd-1G]-1GTCd-1d

Remember formula Cm = M Cd MT So, minimize (d-Gm)TCd-1 (d-Gm) implies m = [GT Cd-1G]-1GTCd-1d and Cm = {[GTCd-1G]-1GTCd-1} Cd{[GTCd-1G]-1GTCd-1 }T = = [GTCd-1G]-1GTCd-1G[GTCd-1G]-1 = [GTCd-1G]-1

Example with Correlated Noise Uncorrelated Noise Correlated Noise

Scatter Plots di vs. di+1 high correlation di vs. di+2 some correlation di vs. di+3 little correlation

data = straight line + correlated noise d = a + bx + n x

Model for Cd[Cd ]ij=exp{ -c |i-j| } with c=0.25exponential falloff from main diagonal MatLab Code: c = 0.25; [XX, YY] = meshgrid( [1:N], [1:N] ); Cd = (sd^2)*exp(-c*abs(XX-YY));

Results Intercept Correlated 10.96 ± 20.6 Uncorrelated 8.42 ± 7.9 True 1.0 Slope Correlated 1.92 ± 0.35 Uncorrelated 1.97 ± 0.14 True 2.0 d = a + bx + n Both fits about the same … but x … note error estimates are larger (more realistic ?) for the correlated case

How to make correlated noise w = [0.1, 0.3, 0.7, 1.0, 0.7, 0.3, 0.1]'; w = w/sum(w); Nw = length(w); Nw2 = (Nw-1)/2; N=101; N2=(N-1)/2; n1 = random('Normal',0,1,N+Nw,1); n = zeros(N,1); for i = [-Nw2:Nw2] n = n + w(i+Nw2+1)*n1(i+Nw-Nw2:i+Nw+N-1-Nw2); end Define weighting function Start with uncorrelated noise Correlated noise is a weighted average of neighboring uncorrelated noise values

s12 0 0 … 0 s22 0 … 0 0 s32 … ... Let’s look at the transformations … d’=Cd-1/2d G’ = Cd-1/2G In the special case of uncorrelated data with different variances Cd= diag( s12,s22, … sN2) di’=si-1 di multiply each data by the reciprocal of its error Gij’ = si-1 Gij multiply each row of the data kernel by the same amount Then solve by ordinary least squares

s1-1G11s1-1G12s1-1G13 …s2-1G21s2-1G22s2-1G13 …s3-1G31s3-1G32s3-1G33 ……sN-1GN1sN-1GN2sN-1GN3 … s1-1d1s2-1d2s3-1d3…sN-1dN m = Rows have been weighted by a factor of si-1

So this special case is often called Weighted Least Squares Note that the total error isE = eTCd-1e = Sisi-2 ei2 weight Each individual error is weighted by the reciprocal of its variance, so errors involving data with SMALL variance get MORE weight

Example: fitting a straight line 100 data, first 50 have a different sd than the last 50

N=101;N2=(N-1)/2;sd(1:N2-1) = 5;sd(N2:N) = 100;sd2i = sd.^(-2);Cdi = diag(sd2i);G(:,1)=ones(N,1);G(:,2)=x;GTCdiGI=inv(G'*Cdi*G);m = GTCdiGI*G'*Cdi*d;d2 = m(1) + m(2) .* x; MatLab Code Note that Cd-1 is explicitly defines as a diagonal matrix

Equal variance Left 50: sd = 5 right 50: sd = 5

Left has smaller variance first 50: sd = 5 last 50: sd = 100

Right has smaller variance first 50: sd = 100 last 50: sd = 5

Finally, two miscellaneous comments about least-squares

Comment 1Case of fitting functions to a datasetdi = m1 f1(xi) + m2 f2(xi) + m3 f3(xi) … e.g. di = m1 sin(xi) + m2 cos(xi) + m3 sin(2xi) …

f1(x1) f2(x1) f3(x1) …f1(x2) f2(x2) f3(x2) … f1(x3) f2(x3) f3(x3) …… f1(xN) f2(xN) f3(xN) … d1d2d3…dN m =

Note that the matrix GTGhas element i,j [GTG]ij = Si fi(xk)fj(xk) = fi fjand thus is diagonal if the functions are orthogonal

if the functions are normalized so fifi=1 then GTG = I and the least squares solution ism = GTd and Cm=sd2I super-simple formula! mi = fi d guaranteed uncorrelated errors!

Example of Straight line yi = a + bxi implies f1(x) = 1 and f2(x) = x so condition f1(x)f2(x)=0 implies Si xi = 0 or x=0 this happens when the x’s straddle the origin The choice f1(x) = 1 and f2(x) = x-x i.e. y = a’ + b’ (x-x) leads to uncorrelated errors in (a’,b’) y a’ a x x1 x2 x3 x4 x5 x

Example – wavelet functions Localized oscillation with a character-istic frequency

GTG “Almost” diagonal

Comment 2sometimes writing least-squares as [GTG] m = GTdorGT[Gm] = GTdis more useful than m = [GTG]-1GTdsince you can use some method other than a matrix inverse for solving the equation

Advanced Topics in Least Squares: Multivariate Normal Distribution & Maximum Likelihood

Advanced Topics in Least Squares: Multivariate Normal Distribution & Maximum Likelihood

Presentation Transcript

Lecture 7

Lecture # 7

Lecture 7

Lecture 7

Software Engineering Lecture 7 Lecture # 7

Lecture 7

Lecture # 7

Lecture # 7

Lecture 7

Lecture 7

Lecture 7

Lecture 7

Lecture 7

Lecture 7

Lecture 7

Lecture 7

Software Engineering Lecture 7 Lecture # 7

Lecture 7

LECTURE № 7

Lecture 7

Lecture 7

Lecture 7