- 76 Views
- Uploaded on
- Presentation posted in: General

Computacion Inteligente

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Computacion Inteligente

Derivative-Based Optimization

- Optimization problems
- Mathematical background
- Descent Methods
- The Method of Steepest Descent
- Conjugate Gradient

OPTIMIZATION PROBLEMS

- Objective function – mathematical function which is optimized by changing the values of the design variables.
- Design Variables – Those variables which we, as designers, can change.
- Constraints – Functions of the design variables which establish limits in individual variables or combinations of design variables.

3 basic ingredients…

- an objective function,
- a set of decision variables,
- a set of equality/inequality constraints.

The problem is

to search for the values of the decision variables that minimize the objective function while satisfying the constraints…

Obective

Decision vector

Bounds

constrains

- Design Variables: decision and objective vector
- Constraints: equality and inequality
- Bounds: feasible ranges for variables
- Objective Function: maximization can be converted to minimization due to the duality principle

- Identify the quantity or function, f, to be optimized.
- Identify the design variables: x1, x2, x3, …,xn.
- Identify the constraints if any exist
a. Equalities

b. Inequalities

- Adjust the design variables (x’s) until f is optimized and all of the constraints are satisfied.

- Objective functions may be unimodal or multimodal.
- Unimodal – only one optimum
- Multimodal – more than one optimum

- Most search schemes are based on the assumption of a unimodal surface. The optimum determined in such cases is called a local optimum design.
- The global optimum is the best of all local optimum designs.

- Existence of global minimum
- If f(x) is continuous on the feasible set S which is closed and bounded, then f(x) has a global minimum in S
- A set S is closed if it contains all its boundary pts.
- A set S is bounded if it is contained in the interior of some circle

compact = closed and bounded

x2

x1

saddle point

local max

- Derivative-based optimization (gradient based)
- Capable of determining “search directions” according to an objective function’s derivative information
- steepest descent method;
- Newton’s method; Newton-Raphson method;
- Conjugate gradient, etc.

- Capable of determining “search directions” according to an objective function’s derivative information
- Derivative-free optimization
- random search method;
- genetic algorithm;
- simulated annealing; etc.

MATHEMATICAL BACKGROUND

The scalar xTMx= is called a quadratic form.

for all x ≠ 0

- A square matrix M is positive definiteif
- It is positive semidefiniteif

for all x

- A symmetric matrix M = MT is positive definite if and only if its eigenvalues λi > 0. (semidefinite ↔ λi ≥ 0)
- Proof (→): Let vi the eigenvector for the i-th eigenvalue λi
- Then,
- which implies λi > 0,

prove that positive eigenvalues imply positive definiteness.

- Proof. Let’s f be defined as
- If we can show that f is always positive then M must be positive definite. We can write this as
- Provided that Ux always gives a non zero vector for all values of x except when x = 0 we can write b = U x, i.e.
- so f must always be positive

- Theorem: If a matrix M = UTU then it is positive definite

- f: Rn→ R is a quadratic function if
- where Q is symmetric.

- It is no necessary for Q be symmetric.
- Suposse matrix P non-symmetric

Q is symmetric

- Suposse matrix P non-symmetric. Example

Q is symmetric

- Given the quadratic function

If Q is positive definite, then f is a parabolic “bowl.”

- Two other shapes can result from the quadratic form.
- If Q is negative definite, then f is a parabolic “bowl” up side down.
- If Q is indefinite then f describes a saddle.

- Quadratics are useful in the study of optimization.
- Often, objective functions are “close to” quadratic near the solution.
- It is easier to analyze the behavior of algorithms when applied to quadratics.
- Analysis of algorithms for quadratics gives insight into their behavior in general.

- The derivative of f: R → R is a function f ′: R → R given by
- if the limit exists.

- Along the Axes…

- In general direction…

- Definition: A real-valued function f: Rn→ R is said to be continuously differentiable if the partial derivatives
- exist for each x in Rnand are continuous functions of x.
- In this case, we say f C1(a smoothfunctionC1)

- Definition: The gradient of f: in R2→ R:
It is a function ∇f: R2→ R2given by

In the plane

- Definition: The gradient of f: Rn→ R is a function ∇f: Rn→ Rngiven by

- The gradient defines (hyper) plane approximating the function infinitesimally

- By the chain rule

- Proposition 1:
is maximal choosing

intuitive: the gradient points at the greatest change direction

Prove it!

- Proof:
- Assign:
- by chain rule:

- Proof:
- On the other hand for general v:

- Proposition 2: let f: Rn→ R be a smooth function C1 around p,
- if f has local minimum (maximum) at p then,

Intuitive: necessary for local min(max)

- Proof: intuitive

- We found the best INFINITESIMAL DIRECTION at each point,
- Looking for minimum: “blind man” procedure
- How can we derive the way to the minimum using this knowledge?

- The gradient of f: Rn→ Rmis a function Df: Rn→ Rm×ngiven by

called Jacobian

Note that for f: Rn→ R , we have ∇f(x) = Df(x)T.

- If the derivative of ∇f exists, we say that f is twice differentiable.
- Write the second derivative as D2f (or F), and call it the Hessianof f.

- The level set of a function f: Rn→ R at level c is the set of points S = {x: f(x) = c}.

- Fact: ∇f(x0) is orthogonal to the level set at x0

- Proof of fact:
- Imagine a particle traveling along the level set.
- Let g(t) be the position of the particle at time t, with g(0) = x0.
- Note that f(g(t)) = constant for all t.
- Velocity vector g′(t) is tangent to the level set.
- Consider F(t) = f(g(t)). We have F′(0) = 0. By the chain rule,
- Hence, ∇f(x0) and g′(0) are orthogonal.

- Suppose f: R → R is in C1. Then,

- o(h) is a term such that o(h) = h → 0 as h → 0.
- At x0, f can be approximated by a linear function, and the approximation gets better the closer we are to x0.

- Suppose f: R → R is in C2. Then,

- At x0, f can be approximated by a quadratic function.

- Suppose f: Rn→ R.
- If f in C1, then
- If f in C2, then

- We already know that ∇f(x0) is orthogonal to the level set at x0.
- Suppose ∇f(x0) ≠ 0.

- Fact: ∇f points in the direction of increasing f.

- Consider xα = x0 + α∇f(x0), α > 0.
- By Taylor's formula,

- Therefore, for sufficiently small ,
f(xα) > f(x0)

DESCENT METHODS

- This theorem is the link from the previous gradient properties to the constructive algorithm.
- The problem:

- We introduce a model for algorithm:

Data

Step 0: set i = 0

Step 1: ifstop,

else, compute search direction

Step 2: compute the step-size

Step 3: setgo to step 1

- The Theorem:
- Suppose f: Rn→ R C1 smooth, and exist continuous function: k: Rn→ [0,1], and,
- And, the search vectors constructed by the model algorithm satisfy:

- And

- if is the sequence constructed by the algorithm model,
- then any accumulation pointy of this sequence satisfy:

- The theorem has very intuitive interpretation:
- Always go in descent direction.

The principal differences between various descent algorithms lie inthe first procedure for determining successive directions

STEEPEST DESCENT

- We now use what we have learned to implement the most basic minimization technique.
- First we introduce the algorithm, which is a version of the model algorithm.
- The problem:

- Steepest descent algorithm:

Data

Step 0: set i = 0

Step 1: ifstop,

else, compute search direction

Step 2: compute the step-size

Step 3: setgo to step 1

- Theorem:
- If is a sequence constructed by the SD algorithm, then every accumulation point y of the sequence satisfy:
- Proof: from Wolfe theorem

Remark: Wolfe theorem gives us numerical stability if the derivatives aren’t given (are calculated numerically).

- How long a step to take?

Note search direction is

- We are limited to a line search

. . . directional derivative is equal to zero.

- How long a step to take?
- From the chain rule:

- Therefore the method of steepest descent looks like this:

They are orthogonal !

Given:

Find the minimum when x1 is allowed to vary from 0.5 to 1.5 and x2 is allowed to vary from 0 to 2.

λ arbitrary

Given:

Find the minimum when x1 is allowed to vary from 0.5 to 1.5 and x2 is allowed to vary from 0 to 2.

CONJUGATE GRADIENT

- We from now on assume we want to minimize the quadratic function:
- This is equivalent to solve linear problem:

If A symmetric

- La solucion es la interseccion de las lineas

- Cada elipsoide tiene f(x) constante

In general, the solution x lies at the intersection point

of n hyperplanes, each having dimension n– 1.

- What is the problem with steepest descent?
- We can repeat the same directions over and over…

- Wouldn’t it be better if, every time we took a step, we got it right the first time?

- What is the problem with steepest descent?
- We can repeat the same directions over and over…

- Conjugate gradient requires n gradient evaluations and n line searches.

solution

- First, let’s define de error as

- eiis a vector that indicates how far we are from the solution.

Start point

(should span Rn)

- Let’s pick a set of orthogonal search directions

- In each search direction, we’ll take exactly one step,

that step will be just the right length to line up evenly with

- Using the coordinate axes as search directions…

- Unfortunately, this method only works if you already know the answer.

- We have

- Given , how do we calculate ?

- ei+1 should be orthogonal to di

- Given , how do we calculate ?
- That is

- How do we find ?
- Since search vectors form a basis

On the other hand

- We want that after n step the error will be 0:
- Here an idea: if then:

So if:

- So we look for such that
- Simple calculation shows that if we take

The correct choice is

Data

Step 0:

Step 1:

Step 2:

Step 3:

Step 4: and repeat n times

- Conjugate gradient algorithm for minimizing f:

- J-Shing Roger Jang, Chuen-Tsai Sun and Eiji Mizutani, Slides for Ch. 5 of “Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence”, First Edition, Prentice Hall, 1997.
- Djamel Bouchaffra. Soft Computing. Course materials. Oakland University. Fall 2005
- Lucidi delle lezioni, Soft Computing. Materiale Didattico. Dipartimento di Elettronica e Informazione. Politecnico di Milano. 2004
- Jeen-Shing Wang, Course: Introduction to Neural Networks. Lecture notes. Department of Electrical Engineering. National Cheng Kung University. Fall, 2005

- Carlo Tomasi, Mathematical Methods for Robotics and Vision. Stanford University. Fall 2000
- Petros Ioannou, Jing Sun, Robust Adaptive Control. Prentice-Hall, Inc, Upper Saddle River: NJ, 1996
- Jonathan Richard Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. Edition 11/4. School of Computer Science. Carnegie Mellon University. Pittsburgh. August 4, 1994
- Gordon C. Everstine, Selected Topics in Linear Algebra. The GeorgeWashington University. 8 June 2004