Application of matrix differential calculus for optimization in statistical algorithm
Download
1 / 43

Application of Matrix Differential Calculus for Optimization in Statistical Algorithm - PowerPoint PPT Presentation


  • 312 Views
  • Uploaded on

Application of Matrix Differential Calculus for Optimization in Statistical Algorithm. By Dr. Md. Nurul Haque Mollah , Professor, Dept. of Statistics, University of Rajshahi , Bangladesh. Outline. Introduction to the Optimization Problems in Statistics

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Application of Matrix Differential Calculus for Optimization in Statistical Algorithm' - pilis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Application of matrix differential calculus for optimization in statistical algorithm l.jpg

Application of Matrix Differential Calculus for Optimization in Statistical Algorithm

By

Dr. Md. Nurul Haque Mollah,

Professor,

Dept. of Statistics,

University of Rajshahi,

Bangladesh

Dr. M. N. H. Mollah


Slide2 l.jpg

Outline Optimization in Statistical Algorithm

Introduction to the Optimization Problems in Statistics

Test of Convexity of Objective function for Optimization

Nonlinear Optimization for Regression Analysis

Nonlinear Optimization for Maximum Likelihood Estimation

Nonlinear Optimization for PCA

Nonlinear Objective Function for Information Theory

Dr. M. N. H. Mollah


Slide3 l.jpg

1. Introduction to the Optimization Problems in Statistics Optimization in Statistical Algorithm

The main problem in parametric statistics is to find the unconditional or conditional optimizer of the nonlinear objective functions.

Two classical nonlinear conditional optimization problems are the equality constrained problem

(ECP) optimizef(x)

subject to h(x) = 0

and its inequality constrained problem

(ICP) optimizef(x)

subject to g(x) ≤ 0 ( ≥0)

wheref : Rn→R, h : Rn →Rm, g : Rn →Rr are given functions.

Dr. M. N. H. Mollah


Slide4 l.jpg

1. Introduction to the Optimization Problems in Statistics (Cont.)

If the objective function is the profit function, then optimizers are the maximizers of the objective function.

If the objective function is the loss function, then optimizers are the minimizers of the objective function.

For Example, likelihood function for the statistical model can be considered as the profit function, while any error or distance functions like sum of square errors (SSE) in regression problems, K-L or any other divergence between two distribution s can be considered as the loss function.

Matrix derivatives play the key role to detect the optimizer of the objective function

Dr. M. N. H. Mollah


Slide5 l.jpg

1.1 Unconstrained Optimization (Cont.)

  • An overview of analytical and computational methods for solution of the unconstraint problem (UP)

    optimize f(x)

    where f: Rn→R, xєRn, is a given functions.

  • A vector x*is a local minimum for UP if there exists an ε>0 such that

    f(x*) ≤ f(x) , for all xєS(x*; ε)

  • It is a strict local minimum if there exists anε>0 such that

    f(x*) < f(x) , for all xє S(x*; ε), x≠x*

Dr. M. N. H. Mollah


Slide6 l.jpg

Note: (Cont.)S(x*; ε) is an open sphere centered at x* єRnwith radius ε > 0 . That is

S(x*; ε)={x: |x-x*|< ε}, wherex єRn

A vector x*is a local maximum for UP if there exists an ε>0 such that

f(x) ≤ f(x*) , for all xєS(x*; ε)

It is a strict local minimum if there exists anε>0 such that

f(x) < f(x*) , for all xє S(x*; ε), x≠x*

1.1 Unconstrained Optimization (Cont.)

Dr. M. N. H. Mollah


Slide7 l.jpg

We have the following well-known optimally conditions. (Cont.)

If f: Rn→R, x єRn is differentiable (i.e.,fC1 on S(x*; ε) for ε>0 )and

is a (local) minimum, then

i.e. the JacobianDf(x) =0.

If f is twice differentiable (i.e.,fC2 on S(x*; ε) for ε>0 ) and is a (local) minimum, then

i.e. the Hessian D2f(x) is positive definite and f(x) is a convex function.

If x* is the maximizer of f(x) , then

i.e. the Hessian D2f(x) is negative definite and f(x) is a concave function.

1.1 Unconstrained Optimization (Cont.)

Dr. M. N. H. Mollah


Slide8 l.jpg

We consider first the following (Cont.)equality constrained optimization problem

(ECP) minimize f(x)

subject to h(x)=0,

where f : Rn→Rand h: Rn→Rm are given functions and m≤n. the

components of h are denoted h1, h2,…. ,hm.

1.2 Constrained Optimization

Defination:

Let x* be a vector such that h(x*)=0 and, for some ε>0, hC1

on S(x*; ε). We say that x*is regular point if the gradients

Dh1(x*),…,Dhm(x*) are linearly independent.

Consider the Lagrangain function L:Rn+m→R defined by

L(x,λ)=f(x)+λ´h(x).

Dr. M. N. H. Mollah


Slide9 l.jpg

Let (Cont.)x* be a local minimum for (ECP), and assume that, for some ε>0, fC1 , hC1on S(x*; ε), and x* is regular point. Then there exist a unique vector λ*Rmsuch that

If in addition fC2 and hC2on S(x*; ε) then

1. 2 Constrained Optimization (Cont.)

Dr. M. N. H. Mollah


Slide10 l.jpg

We consider first the following (Cont.)inequality constrained optimization problem

(ICP) minimize f(x)

subject to h(x)=0, g(x)≤0

where f : Rn→R,h: Rn→Rm and g: Rn→Rr are given functions and m ≤ n.

1. 2 Constrained Optimization (Cont.)

Defination

Let x* be a vector such that h(x*)=0, g(x*) ≤0and, for some

ε>0, hC1 and gC1 on S(x*; ε). We say that x*is regular

point if the gradients Dh1(x*),…,Dhm(x*) and Dgj(x*), jЄA(x*)

are linearly independent.

Consider the Lagrangain function L:Rn+m+r→R defined by

L(x,λ, µ)=f(x)+λ´h(x)+ µ ´g(x).

Dr. M. N. H. Mollah


Slide11 l.jpg

Let (Cont.)x* be a local minimum for (ICP), and assume that, for some ε>0, fC1 , hC1 , gC1on S(x*; ε), and x* is regular point. Then there exist a unique vectors λ*Rm , µ *Rrsuch that

,

If in addition fC2 , hC2and gC2on S(x*; ε),then

1. 2 Constrained Optimization (Cont.)

Dr. M. N. H. Mollah


Slide12 l.jpg

2. Test of Convexity of Objective function for Optimization

For optimization problem, convexity of the objective function f(x) is necessary.

Hessian matrix H=D2f(x) canbe used to test the convexity of the objective function.

If all leading principal minors of H are non-negative (non-positive), then function f(x) is convex (concave).

If all leading principal minors of H are positive (negative), then f(x) is strictly convex (concave).

Note: H will always be a square symmetric matrix

Dr. M. N. H. Mollah


Slide13 l.jpg

2. Test of Convexity of Objective function for Optimization (Cont.)

Example:

Since all principal minors are not positive, so f(x) is concave.

Dr. M. N. H. Mollah


3 nonlinear optimization for regression analysis l.jpg

3. Nonlinear Optimization for Optimization (Cont.)Regression Analysis

Unconditional Optimization

Let us consider a regression model

Where,

Y: Vector of response

X: Design matrix (matrix of explanatory variable)

β: Vector of regression parameter

E: Vector of error terms

Objective is to find OLSE of β by minimizing error sum of

squares with respect to β.

Dr. M. N. H. Mollah


Slide15 l.jpg

3. Nonlinear Optimization for Regression Analysis (Cont.) Optimization (Cont.)

Solution:

Which is also known as minimizer of

Necessary condition:

Hessian matrix should be positive

definite. That is

It is an example of unconditional optimization.

Dr. M. N. H. Mollah


Slide16 l.jpg

3. Nonlinear Optimization for Regression Analysis (Cont.) Optimization (Cont.)

Conditional Optimization

Recall the previous linear regression model as

Objective is to find OLSE of β by minimizing error sum of squares with respect to β

such that

Dr. M. N. H. Mollah


Slide17 l.jpg

3. Nonlinear Optimization for Regression Analysis (Cont.) Optimization (Cont.)

Solution:Since the restriction one of the equality form, we

can use the method of Lagrange’s undetermined multipliers

and minimize the Lagrangian function

with respect to β and λ. Then

Sufficient condition: should be positive definite

matrix. That is

Dr. M. N. H. Mollah


Slide18 l.jpg

4. Nonlinear Optimization for Maximum Likelihood Estimation (MLE)

  • The likelihood function for the parameter vector θ=(θ1, θ2, …,θd) formed from the observed data X as

    The problem is to estimate the vector θ by maximizing

    likelihood function.

    Ans: The normal equations

    produce

Dr. M. N. H. Mollah


Slide19 l.jpg

4. (MLE)Nonlinear Optimization for MLE (Cont.)

Under regularity conditions, the expected (Fisher) information

matrix is given by

The standard error for the estimate of θi is given by

Dr. M. N. H. Mollah


Slide20 l.jpg

4. (MLE)Nonlinear Optimization for MLE (Cont.)

If the normal equations

becomes as implicit form, then the Newton-Raphson method

or other iterative procedure is necessary for solving the

normal equations. However, the Newton-Raphson iteration

In this case is updated as follows

for k=0,1,2,3,…, until converge. The stopping rule is as

follows

Dr. M. N. H. Mollah


Slide21 l.jpg

4. 1 (MLE)Nonlinear Optimization for MLE using EM algorithm

In statistics, an expectation-maximization (EM) algorithm is a method for finding MLE or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.

EM is an iterative method which alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step.

These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

Dr. M. N. H. Mollah


Slide22 l.jpg

4 (MLE).1 Nonlinear Optimization for MLE using EM algorithm (Cont.)

Given a statistical model consisting of a set x of observed data, a set of unobserved latent data or missing values z ,and a vector of unknown parameters θ , along with a likelihood function,

the MLE of the unknown parameters θ is determined by the marginal likelihood of the observed data

However, this quantity is often intractable. The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying the following two steps:

Dr. M. N. H. Mollah


Slide23 l.jpg

4.1 (MLE)Nonlinear Optimization for MLE using EM algorithm (Cont.)

The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying the following two steps:

Expectation step (E-step): Calculate the expected value of the log-likelihood function, with respect to the conditional distribution of Z given X under the current estimate of the parameters θ(t):

Maximization step (M-step): Find the parameter that maximizes this quantity:

Note that in typical models to which EM is applied

Dr. M. N. H. Mollah


Example gaussian mixture l.jpg

4. 1 (MLE)Nonlinear Optimization for MLE using EM algorithm (Cont.)

Example: Gaussian mixture

Let x = (x1,x2,…,xn) be a sample of n independent observations from a mixture of two multivariate normal distribution of dimension d, and let z=(z1,z2,…,zn) be the latent variables that determine the component from which the observation originates.

and

i=1,2,…,n.

Dr. M. N. H. Mollah


Slide25 l.jpg

4.1 (MLE)Nonlinear Optimization for MLE using EM algorithm (Cont.)

Where

The aim is to estimate the unknown parameters representing the "mixing" value between the Gaussians and the means and covariances of each:

where the likelihood function is:

Dr. M. N. H. Mollah


Slide26 l.jpg

4.1 (MLE)Nonlinear Optimization for MLE using EM algorithm (Cont.)

This may be rewritten in exponential family form as

E-step

Given our current estimate of the parameters θ(t), the conditional distribution of the Zi is determined by Bayes theorem to be the proportional height of the normal density weighted by as follows

Thus, the E-step results in the function:

Dr. M. N. H. Mollah


Slide27 l.jpg

4.1 (MLE)Nonlinear Optimization for MLE using EM algorithm (Cont.)

M-step

Maximize Q(θ|θ(t)) w.r. to θ which contains

Note that may be all maximized

independently of each other since they all appear in

separate linear terms. Firstly, consider which has the

constraint

This has the same form as the MLE for the binomial distribution. So

Dr. M. N. H. Mollah


Slide28 l.jpg

4.1 (MLE)Nonlinear Optimization for MLE using EM algorithm (Cont.)

For the next estimates of (μ1,Σ1):

This has the same form as a weighted MLE for a normal distribution, so

Dr. M. N. H. Mollah


Slide29 l.jpg

4.1 (MLE)Nonlinear Optimization for MLE using EM algorithm (Cont.)

And

and, by symmetry:

and

Dr. M. N. H. Mollah


Slide30 l.jpg

4.1 (MLE)Nonlinear Optimization for MLE using EM algorithm (Cont.)

Louis(1982) showed that the observed information lobs is the

difference of complete loc and missing lom information. That is,

,

where

and

θ* denotes the MLE of θ estimated using EM.

Dr. M. N. H. Mollah


Slide31 l.jpg

5. (MLE)Nonlinear Optimization for PCA

Let Cov (X p × 1 )= ∑ of order p × p.

Let β be the p-component column vector such that β'β=1.

Then the variance of β'X is

(1)

To determine the normalized liner combination β'Xwith maximization variance, we must find a vector β satisfying

β'β=1 which maximize (1). Let

(2)

Where λ is a Lagrange multiplier. The vector of partial derivatives is

(3)

Dr. M. N. H. Mollah


Slide32 l.jpg

Since (MLE)β'Σβand β'β have derivatives every where in a region containing β'β=1, a vector β maximizing β'Σβ must satisfy the expression (3) set equal to 0; that is

(4)

In order to gate a solution of (4) with β'β=1we must have

Σ-λIsingular; in other words,λmust satisfy

(5)

The function |Σ - λI| = 0 is a polynomial in λ of degree p. therefore (5) has p roots; let these be λ1 λ2 …λp. If we multiply (4) on the left by ; we obtain

(6)

5. Nonlinear Optimization for PCA (Cont.)

Dr. M. N. H. Mollah


Slide33 l.jpg

5. (MLE)Nonlinear Optimization for PCA (Cont.)

This shows that if β satisfies (4) (and ββ=1 ) , then the variance of βX [given by (1)] is λ. Thus for the maximum variance we should use in (4) the largest root λ1. Let β(1) be a normalized solution of (Σ – λ1I) β = 0 Then U1= β(1)X is a normalized linear combination with maximum variance. [if Σ – λ1I is of rank p-1, then there is only one solution to (Σ – λ1I) β =0 and ββ=1.

Now let us find a normalized combination βX that has maximum variance if all linear combinations uncorrelated with U1. Lack of correlation means

(7)

Since Σβ(1)= λ1β(1) . Thus βX is orthogonal to U in both the statistical sense (of lack of correlation) and the geometric sense(of the inner product of the vectors βand β(1) being zero). (that is λ1ββ(1) = 0 only if ββ(1) = 0 when λ1 ≠ 0, and λ1 ≠ 0 if Σ ≠ 0 ; the case of Σ = 0 is trivial and is not treated.)

Dr. M. N. H. Mollah


Slide34 l.jpg

5. (MLE)Nonlinear Optimization for PCA (Cont.)

We now want to maximized

(8)

Where λ and ν1 are Lagrangain multipliers. The vector of partial derivatives is

  (9)

And we set this equal to 0. From (9) we obtain by multip-lying on the left by β(1)

(10)

By (7) . therefore ν1=0 and must satisfy (4), and therefore λ must satisfy (5). Let λ2 be the maximum of λ1, λ2, … λ2, such that there is a vector β satisfying (Σ – λ2I) β =0 , ββ=1 and (7); call this vector β(2) and the corresponding linear combination U2 = β(2)X . (it will be shown eventually that λ(2) = λ2. We define λ(1) = λ1.)

Dr. M. N. H. Mollah


Slide35 l.jpg

5. (MLE)Nonlinear Optimization for PCA (Cont.)

This procedure is continued; alt the (r+1)st step, we want to find a vector β such that βX has maximum variance of all normalized linear combinations which are uncorrelated with U1,U2 … Ur, that is , such that

i=1,2,…..r (11)

We want to maximized

(12)

Where λ and ν1, ν1,…, νrare Lagrangain multipliers. The vector of partial derivatives is

(13)

And we set this equal to 0. Multiplying (13) on the left by β(j), we obtain  

(14)

If λ(j) ≠ 0, this gives -2 νj λ(j) = 0and νj =0. If λ(j) = 0, then Σ β(j)= λ(j) β(j) =0 and the j-th term in the sum in (13) vanishes. Thus β must satisfy (4), and therefore λ must satisfy (5). Thus we can compute all principal and minor components sequentially by nonlinear optimization using lagrangian multiplier.

Dr. M. N. H. Mollah


Slide36 l.jpg

6. (MLE)Nonlinear Objective Function for Information Theory

  • Differential Entropy:

  • The differential entropy H of a random vector y with density p(y) is defined as:

  • A normalized version of entropy is given by negentropy J,

    which is defined as follows:

    where . Equality hold only for Gaussian

    random vectors.

Dr. M. N. H. Mollah


Slide37 l.jpg

6. (MLE)Nonlinear Objective function for Information Theory (Cont.)

Mutual Information:

Mutual Information between m random variables yi ,i=1,2,…,m is defined

as follows:

Equality holds if are independent.

where the constant term does not depend on B

Dr. M. N. H. Mollah


Slide38 l.jpg

6. (MLE)Nonlinear Objective functionfor Information Theory (Cont.)

  • In signal processing, blind source signals (BSS) are generally follow non-Gaussian distribution and independent of each other.

  • So maximizationofnegentropy or equivalently minimization of mutual information both are popular techniques for recovering BSS image processing and audio signal processing.

Dr. M. N. H. Mollah


Slide39 l.jpg

Gradient Descent (MLE):

To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point. If instead one takes steps proportional to the positiveof the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is also known as steepest descent.

In gradient descent, we minimize a function J(w) iteratively by starting form some initial point w(0), computing the gradient of J(w) at this point, and then moving in the direction of the negative gradient or the steepest descent by a suitable distance. Once there, we repeat the same procedure at the new point. And so on.

6. 1 An Unconstrained Optimization for Information Theory

Dr. M. N. H. Mollah


Slide40 l.jpg

Gradient Descent (MLE):

In gradient descent, we minimize a function J(w) iteratively by starting form some initial point w(0), computing the gradient of J(w) at this point, and then moving in the direction of the negative gradient or the steepest descent by a suitable distance. Once there, we repeat the same procedure at the new point. And so on. For t=1,2,… we have the update rule

(4.1)

We can then write the rule (3.22) either as

Or even shorter as

6. 1 An Unconstrained Optimization for Information Theory

Dr. M. N. H. Mollah


Slide41 l.jpg

The symbol is read “is proportional to” it is then understood that the vector on the left-hand side, ∆w, has the same dimension as the gradient vector on the right-hand side, but there is a positive scalar coefficient by which the length can be adjusted.

In the upper version of the update rule, this coefficient is denoted by α. In many cases, this learning rate can and should in fact be time dependent. Yet a third very convenient way to write such update rules, in conformity with programming language, is

Where the symbol ← means substitution, i.e., the value of the right-hand side is computed and substituted in w.

6. 1 An Unconstrained Optimization for Information Theory (Cont.)

Dr. M. N. H. Mollah


Slide42 l.jpg

Generally, the speed of convergence can be quite low close to the minimum point, because the gradient approaches zero there. The speed can be analyzed as follows. Let us denote by w* the local or global minimum point where the algorithm will eventually converge. From (3.22) we have

Let us expand the gradient vector (∂J(w) /∂w) element by element as a Taylor series around the point w*, using only the zeroth and first-order terms, we have for the ith element

6. 1 An Unconstrained Optimization for Information Theory (Cont.)

Dr. M. N. H. Mollah


Slide43 l.jpg

Now because to the minimum point, because the gradient approaches zero there. The speed can be analyzed as follows. Let us denote by w* is the point of convergence, the partial derivatives of the cost function must be zero at w*.Using this result and compiling the above expansion into vector form, yields

Where H(w*)is the Hessian matrix computed at the point

w = w*.Substituting this in (3.24) gives

This kind of convergence, which is essentially equivalent to multiplying a matrix many times with itself, is called linear. The speed of convergence depends on the learning rate and the size of Hessian matrix.

6. 1 An Unconstrained Optimization for Information Theory (Cont.)

Dr. M. N. H. Mollah


ad