This presentation is the property of its rightful owner.
1 / 80

# Computacion Inteligente PowerPoint PPT Presentation

Computacion Inteligente. Derivative-Based Optimization. Contents. Optimization problems Mathematical background Descent Methods The Method of Steepest Descent Conjugate Gradient. OPTIMIZATION PROBLEMS.

Computacion Inteligente

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Computacion Inteligente

Derivative-Based Optimization

### Contents

• Optimization problems

• Mathematical background

• Descent Methods

• The Method of Steepest Descent

OPTIMIZATION PROBLEMS

• Objective function – mathematical function which is optimized by changing the values of the design variables.

• Design Variables – Those variables which we, as designers, can change.

• Constraints – Functions of the design variables which establish limits in individual variables or combinations of design variables.

3 basic ingredients…

• an objective function,

• a set of decision variables,

• a set of equality/inequality constraints.

The problem is

to search for the values of the decision variables that minimize the objective function while satisfying the constraints…

Obective

Decision vector

Bounds

constrains

• Design Variables: decision and objective vector

• Constraints: equality and inequality

• Bounds: feasible ranges for variables

• Objective Function: maximization can be converted to minimization due to the duality principle

• Identify the quantity or function, f, to be optimized.

• Identify the design variables: x1, x2, x3, …,xn.

• Identify the constraints if any exist

a. Equalities

b. Inequalities

• Adjust the design variables (x’s) until f is optimized and all of the constraints are satisfied.

• Objective functions may be unimodal or multimodal.

• Unimodal – only one optimum

• Multimodal – more than one optimum

• Most search schemes are based on the assumption of a unimodal surface. The optimum determined in such cases is called a local optimum design.

• The global optimum is the best of all local optimum designs.

• Existence of global minimum

• If f(x) is continuous on the feasible set S which is closed and bounded, then f(x) has a global minimum in S

• A set S is closed if it contains all its boundary pts.

• A set S is bounded if it is contained in the interior of some circle

compact = closed and bounded

x2

x1

local max

• Capable of determining “search directions” according to an objective function’s derivative information

• steepest descent method;

• Newton’s method; Newton-Raphson method;

• Derivative-free optimization

• random search method;

• genetic algorithm;

• simulated annealing; etc.

MATHEMATICAL BACKGROUND

The scalar xTMx= is called a quadratic form.

for all x ≠ 0

• A square matrix M is positive definiteif

• It is positive semidefiniteif

for all x

• A symmetric matrix M = MT is positive definite if and only if its eigenvalues λi > 0. (semidefinite ↔ λi ≥ 0)

• Proof (→): Let vi the eigenvector for the i-th eigenvalue λi

• Then,

• which implies λi > 0,

prove that positive eigenvalues imply positive definiteness.

• Proof. Let’s f be defined as

• If we can show that f is always positive then M must be positive definite. We can write this as

• Provided that Ux always gives a non zero vector for all values of x except when x = 0 we can write b = U x, i.e.

• so f must always be positive

• Theorem: If a matrix M = UTU then it is positive definite

• f: Rn→ R is a quadratic function if

• where Q is symmetric.

• It is no necessary for Q be symmetric.

• Suposse matrix P non-symmetric

Q is symmetric

• Suposse matrix P non-symmetric. Example

Q is symmetric

If Q is positive definite, then f is a parabolic “bowl.”

• Two other shapes can result from the quadratic form.

• If Q is negative definite, then f is a parabolic “bowl” up side down.

• If Q is indefinite then f describes a saddle.

• Quadratics are useful in the study of optimization.

• Often, objective functions are “close to” quadratic near the solution.

• It is easier to analyze the behavior of algorithms when applied to quadratics.

• Analysis of algorithms for quadratics gives insight into their behavior in general.

• The derivative of f: R → R is a function f ′: R → R given by

• if the limit exists.

• Along the Axes…

• In general direction…

• Definition: A real-valued function f: Rn→ R is said to be continuously differentiable if the partial derivatives

• exist for each x in Rnand are continuous functions of x.

• In this case, we say f C1(a smoothfunctionC1)

• Definition: The gradient of f: in R2→ R:

It is a function ∇f: R2→ R2given by

In the plane

• Definition: The gradient of f: Rn→ R is a function ∇f: Rn→ Rngiven by

• The gradient defines (hyper) plane approximating the function infinitesimally

• By the chain rule

• Proposition 1:

is maximal choosing

intuitive: the gradient points at the greatest change direction

Prove it!

• Proof:

• Assign:

• by chain rule:

• Proof:

• On the other hand for general v:

• Proposition 2: let f: Rn→ R be a smooth function C1 around p,

• if f has local minimum (maximum) at p then,

Intuitive: necessary for local min(max)

• Proof: intuitive

• We found the best INFINITESIMAL DIRECTION at each point,

• Looking for minimum: “blind man” procedure

• How can we derive the way to the minimum using this knowledge?

• The gradient of f: Rn→ Rmis a function Df: Rn→ Rm×ngiven by

called Jacobian

Note that for f: Rn→ R , we have ∇f(x) = Df(x)T.

• If the derivative of ∇f exists, we say that f is twice differentiable.

• Write the second derivative as D2f (or F), and call it the Hessianof f.

• The level set of a function f: Rn→ R at level c is the set of points S = {x: f(x) = c}.

• Fact: ∇f(x0) is orthogonal to the level set at x0

• Proof of fact:

• Imagine a particle traveling along the level set.

• Let g(t) be the position of the particle at time t, with g(0) = x0.

• Note that f(g(t)) = constant for all t.

• Velocity vector g′(t) is tangent to the level set.

• Consider F(t) = f(g(t)). We have F′(0) = 0. By the chain rule,

• Hence, ∇f(x0) and g′(0) are orthogonal.

• Suppose f: R → R is in C1. Then,

• o(h) is a term such that o(h) = h → 0 as h → 0.

• At x0, f can be approximated by a linear function, and the approximation gets better the closer we are to x0.

• Suppose f: R → R is in C2. Then,

• At x0, f can be approximated by a quadratic function.

• Suppose f: Rn→ R.

• If f in C1, then

• If f in C2, then

• We already know that ∇f(x0) is orthogonal to the level set at x0.

• Suppose ∇f(x0) ≠ 0.

• Fact: ∇f points in the direction of increasing f.

• Consider xα = x0 + α∇f(x0), α > 0.

• By Taylor's formula,

• Therefore, for sufficiently small ,

f(xα) > f(x0)

DESCENT METHODS

• This theorem is the link from the previous gradient properties to the constructive algorithm.

• The problem:

• We introduce a model for algorithm:

Data

Step 0: set i = 0

Step 1: ifstop,

else, compute search direction

Step 2: compute the step-size

Step 3: setgo to step 1

• The Theorem:

• Suppose f: Rn→ R C1 smooth, and exist continuous function: k: Rn→ [0,1], and,

• And, the search vectors constructed by the model algorithm satisfy:

• And

• Then

• if is the sequence constructed by the algorithm model,

• then any accumulation pointy of this sequence satisfy:

• The theorem has very intuitive interpretation:

• Always go in descent direction.

The principal differences between various descent algorithms lie inthe first procedure for determining successive directions

STEEPEST DESCENT

• We now use what we have learned to implement the most basic minimization technique.

• First we introduce the algorithm, which is a version of the model algorithm.

• The problem:

• Steepest descent algorithm:

Data

Step 0: set i = 0

Step 1: ifstop,

else, compute search direction

Step 2: compute the step-size

Step 3: setgo to step 1

• Theorem:

• If is a sequence constructed by the SD algorithm, then every accumulation point y of the sequence satisfy:

• Proof: from Wolfe theorem

Remark: Wolfe theorem gives us numerical stability if the derivatives aren’t given (are calculated numerically).

• How long a step to take?

Note search direction is

• We are limited to a line search

• Choose λ to minimize f .

• . . . directional derivative is equal to zero.

• How long a step to take?

• From the chain rule:

• Therefore the method of steepest descent looks like this:

They are orthogonal !

Given:

Find the minimum when x1 is allowed to vary from 0.5 to 1.5 and x2 is allowed to vary from 0 to 2.

λ arbitrary

Given:

Find the minimum when x1 is allowed to vary from 0.5 to 1.5 and x2 is allowed to vary from 0 to 2.

• We from now on assume we want to minimize the quadratic function:

• This is equivalent to solve linear problem:

If A symmetric

• La solucion es la interseccion de las lineas

• Cada elipsoide tiene f(x) constante

In general, the solution x lies at the intersection point

of n hyperplanes, each having dimension n– 1.

• What is the problem with steepest descent?

• We can repeat the same directions over and over…

• Wouldn’t it be better if, every time we took a step, we got it right the first time?

• What is the problem with steepest descent?

• We can repeat the same directions over and over…

solution

• First, let’s define de error as

• eiis a vector that indicates how far we are from the solution.

Start point

(should span Rn)

• Let’s pick a set of orthogonal search directions

• In each search direction, we’ll take exactly one step,

that step will be just the right length to line up evenly with

• Using the coordinate axes as search directions…

• Unfortunately, this method only works if you already know the answer.

• We have

• Given , how do we calculate ?

• ei+1 should be orthogonal to di

• Given , how do we calculate ?

• That is

• How do we find ?

• Since search vectors form a basis

On the other hand

• We want that after n step the error will be 0:

• Here an idea: if then:

So if:

• So we look for such that

• Simple calculation shows that if we take

The correct choice is

Data

Step 0:

Step 1:

Step 2:

Step 3:

Step 4: and repeat n times

• Conjugate gradient algorithm for minimizing f:

### Sources

• J-Shing Roger Jang, Chuen-Tsai Sun and Eiji Mizutani, Slides for Ch. 5 of “Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence”, First Edition, Prentice Hall, 1997.

• Djamel Bouchaffra. Soft Computing. Course materials. Oakland University. Fall 2005

• Lucidi delle lezioni, Soft Computing. Materiale Didattico. Dipartimento di Elettronica e Informazione. Politecnico di Milano. 2004

• Jeen-Shing Wang, Course: Introduction to Neural Networks. Lecture notes. Department of Electrical Engineering. National Cheng Kung University. Fall, 2005

### Sources

• Carlo Tomasi, Mathematical Methods for Robotics and Vision. Stanford University. Fall 2000

• Petros Ioannou, Jing Sun, Robust Adaptive Control. Prentice-Hall, Inc, Upper Saddle River: NJ, 1996

• Jonathan Richard Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. Edition 11/4. School of Computer Science. Carnegie Mellon University. Pittsburgh. August 4, 1994

• Gordon C. Everstine, Selected Topics in Linear Algebra. The GeorgeWashington University. 8 June 2004