Chapter 12: Smoothing Data and Least Squares Motivation: Solve Ax = b when A is not square

Chapter 12: Smoothing Data and Least Squares Motivation: Solve Ax = b when A is not square

Example: fit y = ax + b to m data points Find unknown a and b that minimize sum of squared residuals

If know uncertainty in data, make the fit best where the uncertainty is least

Data fitting formulated as an over-determined linear system: Given a set of data {(tk,yk), k=1,...,m} and a set of functions {fj(t), j=1,...,n} find the linear combination of f(t) functions that best represents the data Let akj = fj(tk) (jth function evaluated at kth data point) Let b = [y1, y2,...,ym]T (column vector of the measured values) Let x = [x1, x2,...,xn]T (column vector of unknown parameters) Then solve Ax = b to determine the combination of f(t) functions that goes through the data points If n = m, then the problem formulated in this way probably has a solution.

This solution is usually not what we are looking for Usually, we are seeking a model of the data as “noise” superimposed on a smooth variation described by a linear combination of f(t) functions with n << m Under the condition, n << m, Ax = b probably does not have an exact solution because x is “over determined” To obtain this “dimensional reduction”, find an approximate solution which minimizes a norm of the residual vector r = b – Ax Choosing the Euclidean norm for minimization  linear least squares In this example, model is a parabola

Let y = Ax (like b, y is an m-vector) y span(A) (y is a linear combination of the columns of A) For given data set, b is a constant vector and || b – y ||2 = f(y) f(y) is continuous and strictly convex. • f(y) is guaranteed to have a minimum The vector y span(A) that is closest to b is unique; however, this unique vector may not correspond to a unique set of unknown parameters x If Ax1 = Ax2 = y and z = x2 – x1 o, then Az = o columns of A must be linearly dependent. This condition is called rank deficiency Unless otherwise stated, assume A has full rank.

Normal Equations: Let r = b – Ax and define f(x) = (||r||2)2 = rTr f(x) = (b – Ax)T(b – Ax) = bTb –2xTATb + xTATAx A necessary condition for x0 to be minimum of f(x) is f(x0) = o, where f is an n-vector with components equal to the partial derivatives of f(x) with respect to the unknown parameters f(x) = 2ATAx – 2ATb = o  the optimal set of parameters is a solution of the nxn symmetric system ATAx = ATb called the “normal” equations of the linear least squares problem.

Same set of equations as obtained by objective function method

Fit a polynomial of degree p-1 to m data points (xk, yk)

Matrix formulation of weighted linear least squares Let Dy be a column vector of uncertainties in measured values w = 1./ Dy is a column vector of the square root of the weights W = diag(w) is a diagonal matrix V = coefficient matrix of the un-weighted least-squares problem (Vandermonde matrix in the case of polynomials fitting) y = column vector of observations A = W V = weighted coefficient matrix b = W y = weighted column vector of measured values Ax = b is over-determined linear system for weighted linear least squares normal equations ATAx = (WV)TWVx = ATb = (WV)T W y (note that w = 1./ Dy gets squared at this point)

Weighted parabola fit

Weighted quadratic fit to data on text p495 Uncertainties chosen to show how small uncertainty draws fit to data point at 95

Part 1 of Assignment 11 due 4/22/14 Use normal equations to fit a parabola to the data set t=linspace(0,10,21) y=[2.9, 2.7, 4.8, 5.3, 7.1, 7.6, 7.7, 7.6, 9.4, 9, 9.6,10, 10.2, 9.7, 8.3, 8.4, 9, 8.3, 6.6, 6.7, 4.1] with weights that the reciprocal of the square of the uncertain in y, which is 10% of the value of y. Plot the data with error bars and the fit on the same set of axes. Show the optimum value of the parameters and calculate the sum of squared deviations between fit and data.

Linear models with different types of data Estimate height of 3 hills by combining 2 types of measurements heights relative fixed reference heights relative to each other height of 3 hills relative to a fixed reference. hill 1 = 1237 m hill 2 = 1941 m hill 3 = 2417 m relative heights of the hills hill 2 relative to 1 = 711 m hill 3 relative to 1 = 1177 m hill 3 relative to 2 = 475 m Construct a linear model of the data with height of hills relative to fixed reference as parameters

x = = b Ax = = Construct normal equations = ATb ATAx = = Solve by Cholesky factorization Note that measurements of relative hill heights affect estimate of heights relative fixed reference

Orthogonal Transformations: An alternative to normal equations Use of the normal equations ATAx = ATb to solute a over-determined linear system Ax = b can have the following numerical problems. • Loss of information in construction of ATA and ATb A = with 0 < e < emach½ ATA = = in floating point arithmetic

Orthogonal Transformations: An alternative to normal equations 2) “condition-squaring effect” cond(ATA)  [cond(A)]2 if cond(A) is large, the normal equations will be very ill-conditioned. To avoid these problems, we need a transformation that converts the mxn matrix A into a nxn upper triangular matrix so that the linear least squares problem can be solved by backward substitution Gauss elimination cannot be used to develop this transformation because it does not preserve the Euclidean norm.

QR factorization A real square matrix Q is “orthogonal” if QTQ = I ||Qv||22 = (Qv)T(Qv) = vTQTQv = vTv = ||v||22 Qv is called an orthogonal transformation of v Orthogonal transformations preserve Euclidean norm QR factorization in MatLab: if A is an mxn matrix with m > n, then [Q,R] = qr(A) returns Q = mxm orthogonal matrix and R = nxn upper triangular matrix. = mxn matrix with zeros below R

Solution of linear least squares problem by QR factorization: Given a QR factorization for A, find an approximate solution ofthe over-determined system Ax = b Norm of residuals ||r||22 = ||b - Ax||22 = ||b – Q x||22 Since orthogonal transformation preserves Euclidean norm ||r||22 = ||QTr||22 = ||QTb – x||22 norm of difference of m-vectors Write QTb aswhere c1 is an n-vector an c2 is (m-n)-vector ||r||22 = || ||22 = ||c1 - Rx||22 +||c2||22 Minimum ||r||22 obtained by solving Rx = c1, which is an nxn upper triangular system for parameters of best fit. Minimum sum of squared residuals is ||c2||22.

Reduced QR factorization Q = [Q1Q2] is an mxm matrix Q1 contains the 1st n columns of Q Q2 contains the remaining m – n columns of Q A = Q = [Q1Q2] = Q1R A = Q1R is the reduced or QR factorization of A Obtained in MATLAB by [Q,R] = qr(A,0) Given A = Q1R, we can solve Ax = b by back substitution because Q1TAx = Q1TQ1Rx = Rx = Q1T b = c1 Reduced QR factorization does not provide c2 needed to calculate the minimum sum of squared residuals.

Part 2 of Assignment 11 due 4/22/14 Use QR factorization to fit a parabola to the data set t=linspace(0,10,21) y=[2.9, 2.7, 4.8, 5.3, 7.1, 7.6, 7.7, 7.6, 9.4, 9, 9.6,10, 10.2, 9.7, 8.3, 8.4, 9, 8.3, 6.6, 6.7, 4.1] Same data as part 1 but without weights Plot the data and the fit on the same set of axes Show the optimum value of the parameters Calculate the sum of squared deviations between fit and data directly from the QR factorization

Orthogonalization Methods Like Gauss elimination, QR factorization introduces zeros into A to produce a triangular form Use orthogonal transformations rather than elementary elimination matrices so that Euclidean norm is preserved Three commonly used procedures are (1) Householder transformations (elementary reflectors) (2) Givens transformations (plane rotations) (3) Gram-Schmidt orthogonalization

Householder transformations H = I – 2v vT/(vTv) where v is a non-zero vector chosen so that HA induces zeros below the diagonal of some column of matrix A H is both orthogonal and symmetric (i.e. H = HT = H-1) Given m-vector a, how do we find v such that Ha = = a = ae1? ae1 = Ha = (I – 2v vT/(vTv))a = a – 2v vTa/(vTv) = a – 2v (vTa)/(vTv) • v = (vTv)/(2(vTa))(a – ae1) = c(a – ae1) No loss of generality if we take c = 1 so v = a – ae1 Ha must have the same 2-norm as a so a = + ||a||2 v1 = a1 – a to avoid loss of significance chose a = -sign(a1)||a||2 (i.e. if a1 is positive then a is negative)

Example: a = ||a||2 = 3 a1 is positive v = a – ae1 = a + 3e1 = vTa = 15 vTv = 30 2(vTa)/(vTv) = 1 Ha = a – 2v (vTa)/(vTv) = - = Expected result with ||a||2 = 3 and a1 is positive Note: All the work was done with vectors a and v. No need to construct H Given ||a||2 and sign(a1) we can construct v and Ha

A similar approach works for annihilating other components of m-vector a a = where a1 is a (k-1)-vector with 1 < k < m v = - aek where a = -sign(ak)||a2||2 Ha = a – 2v (vTa)/(vTv) = is an m-vector with zeros below the kth components and the same 2-norm as a This method can be applied successively to the columns of an mxn matrix A Again, given ||a2||2 and sign(a2) we can construct v and Ha

QR factorizationapplied to over-determined Surveyor’s problem Ax = = = b ||a1||2 = 3 a1 is positive

Apply H1 to 2nd and 3rd columns of A and to b

Transform column 2 of H1A

Transform column 3 of H2H1A Solve the upper triangular system Rx = c1 by back substitution x =[1236, 1943, 2416]T Sum of squared residuals = ||c2||2 = 35

Rank deficiency Let Ax = b is a mxn least-squares problem with m>n. b  span(A) because system is over-determined Consider all y span(A) y = Ax r(y) = b – y are residuals associated with approximate solutions to Ax = b yminfor which || r(ymin) ||2 is the unique smallest Solution of Ax = ymin may not be unique If x1 and x2 exist such that Ax1 = Ax2 = ymin Then z = x2 – x1o, and Az = o Columns of A must be linearly dependent This condition is called rank deficiency.

Consequences of rank deficiency for the linear least squares problem ATA will be singular and normal equation method will fail QR factorization is possible, but R will be singular (zero on diagonal). Solution by back substitution after QR factorization not possible In general, this means that linear least squares problem as formulated does not have a unique solution Best approach is to reformulate the problem in terms of parameters for which a unique solution does exist

Clever surveyors get better statistics Experimental design: Estimate height of 3 hills by combining 2 types of measurements heights relative fixed reference relative height to each other height of 3 hills relative to a fixed reference. hill 1 = 1237 m hill 2 = 1941 m hill 3 = 2417 m relative heights of the hills hill 2 relative to 1 = 711 m hill 3 relative to 1 = 1177 m hill 3 relative to 2 = 475 m Construct a linear model of the data with height of hills relative to fixed reference be parameters All 3 parameters can be determined

Example of rank-deficient linear least squares problem Incompetent surveyors lose data on heights of hills above the fixed reference. With only data on relative height of hills, problem of finding heights above the fixed reference becomes Ax = = = b Note design matrix is consistent with height relative for fixed reference hill 2 relative to 1 = 711 m hill 3 relative to 1 = 1177 m hill 3 relative to 2 = 475 m A is singular (all row-sums are zero) Unique solution does not exist Parameters are not all “identifiable” Obviously, cannot estimate heights of hills relative to a fixed reference from their relative heights only

Rank-deficient linear least squares problem Ax = = = b Reformulate problem with the height of hill 3 as the reference point The measurements x1 – x3 = -1177 and x2 – x3 = -475 are the height of the hills 1 and 2 relative to 3. Toss the measurement of hill 2 relative to hill 1 and write report? No! This measurement contains useful information. QR factorization allows this information to be used in parameter estimation

QR factorization of rank-deficient linear system = b Problem is reduced to one where the number of parameters equals the rank(A) = 2 Optimal values for the heights of hills 1 and 2 relative to hill 3 (determined by R1z = c1) are z1 = -1180 and z2 = -472, which are slightly different from the values of -1177 and -475 obtained when the data on height of hill 2 relative to hill 1 is ignored

Singular Value Decomposition A = USVT is the singular value decomposition of mxn matrix A U is an mxm orthogonal matrix. V is an nxn orthogonal matrix. S is an mxn diagonal matrix with diagonal elements si> 0 that are called the “singular values” of A The columns of U are called the “left singular vectors” of A The columns of V are called the “right singular vectors” of A SVD in MatLab is [U, S, V] = svd(A), where U, S, and V are the matrices U, S,and V,respectively.

Singular values of A are s1 = 25.5, s2 = 1.29, and s3 = 0 Rank of A = number of non-zero singular values = 2 in this case In floating-point arithmetic, small singular values may correspond to zeros in exact calculations Usual procedure is to sort singular values in decreasing order and regard singular values below some threshold as zero This defines the “numerical” rank of a matrix

Solution of linear systems by SVD Ax = b is a linear system with A rectangular or square and possibly rank deficient • is the diagonal matrix with diagonal elements equal to the singular values of A. S+ is a diagonal matrix with diagonal elements 1/s if s 0 and 0 if s = 0 S+S = I Ax = USVTx = b  x = V S+ UTb The rows of UT and columns of V in the product V S+ UTb are restricted by zeros in S+ to those associated with non-zero singular values x = vi (uiTb/si), where ui and vi are the left and right singular vector, respectively, associated with si.

Solution of Ax = b with b = [1 2 3 4]T using exact singular values Note that system has 3 unknown but only 2 non-zero singular values u1Tb/s1 = 0.2148 u2Tb/s2 = 0.2155 and x = 0.2148 + 0.2155 =

Solution of linear systems when singular values are not exact Same linear system as before Display MatLab’s SVD Specify a threshold for non-zero singular values Solve linear system with this threshold Solve linear system using the M-P pseudo inverse

Solution of linear systems when singular values are not exact U = S = Zero in exact calculations V = Threshold on non-zero singular values All 3 singular values exceed threshold Solution different from exact results because threshold too small Solution use M-P pseudo inverse is the same as exact results

Part 3 of assignment 11 due 4/29/14 Use singular value decomposition to fit a parabola to the data set on page 495 of Cheney & Kincaid 6th edition: surface tension as a function of temperature. T=0, 10, 20, 30, 40, 80, 90, 95 S=68.0, 67.1, 66.4, 65.6, 64.6, 61,8, 61.0, 60.0 Show the optimum value of the parameters, the minimum sum of squared deviations of the fit from the data points, and plot of fit and data (without error bars) on the same set of axes. Use Moore-Penrose psedo inverse to solve for unknown parameters

Modify weighted parabola fit of surface tension data using normal equations as indicated on left No dy’s in part 3 No weights Solve Vx=y by singular value decomposition for the optimum choice of parameters. Use Moore- Penrose pseudo-inverse. Usual approach to fit and chisq No error bars, use plot

Lower-rank approximation based on singular value decomposition A = USVT = s1E1 + s2E2 + ... + snEn where Ek = ukvkT is the outer product of left and right singular vectors Lower-rank approximation to A neglects terms in the sum with singular values below a specified threshold, Produces an approximation to A with rank k < n.

Indistinguishable from A with 3 significant figures

Suggested problems from text: Chapter 12.1 p502-505 problems 1,3,4, 5,13,16,18,19,20,21 computer problem 4 Chapter 12.3 p527-531 problems 8,9 computer problems 1

Chapter 12: Smoothing Data and Least Squares Motivation: Solve Ax = b when A is not square