- 117 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Computacion Inteligente' - junior

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Computacion Inteligente

Derivative-Based Optimization

Contents

- Optimization problems
- Mathematical background
- Descent Methods
- The Method of Steepest Descent
- Conjugate Gradient

Objective function – mathematical function which is optimized by changing the values of the design variables.

- Design Variables – Those variables which we, as designers, can change.
- Constraints – Functions of the design variables which establish limits in individual variables or combinations of design variables.

3 basic ingredients…

- an objective function,
- a set of decision variables,
- a set of equality/inequality constraints.

The problem is

to search for the values of the decision variables that minimize the objective function while satisfying the constraints…

Decision vector

Bounds

constrains

- Design Variables: decision and objective vector
- Constraints: equality and inequality
- Bounds: feasible ranges for variables
- Objective Function: maximization can be converted to minimization due to the duality principle

Identify the quantity or function, f, to be optimized.

- Identify the design variables: x1, x2, x3, …,xn.
- Identify the constraints if any exist

a. Equalities

b. Inequalities

- Adjust the design variables (x’s) until f is optimized and all of the constraints are satisfied.

Objective functions may be unimodal or multimodal.

- Unimodal – only one optimum
- Multimodal – more than one optimum
- Most search schemes are based on the assumption of a unimodal surface. The optimum determined in such cases is called a local optimum design.
- The global optimum is the best of all local optimum designs.

Existence of global minimum

- If f(x) is continuous on the feasible set S which is closed and bounded, then f(x) has a global minimum in S
- A set S is closed if it contains all its boundary pts.
- A set S is bounded if it is contained in the interior of some circle

compact = closed and bounded

x1

local max

Derivative-based optimization (gradient based)

- Capable of determining “search directions” according to an objective function’s derivative information
- steepest descent method;
- Newton’s method; Newton-Raphson method;
- Conjugate gradient, etc.
- Derivative-free optimization
- random search method;
- genetic algorithm;
- simulated annealing; etc.

The scalar xTMx= is called a quadratic form.

for all x ≠ 0

- A square matrix M is positive definiteif
- It is positive semidefiniteif

for all x

A symmetric matrix M = MT is positive definite if and only if its eigenvalues λi > 0. (semidefinite ↔ λi ≥ 0)

- Proof (→): Let vi the eigenvector for the i-th eigenvalue λi
- Then,
- which implies λi > 0,

prove that positive eigenvalues imply positive definiteness.

- If we can show that f is always positive then M must be positive definite. We can write this as
- Provided that Ux always gives a non zero vector for all values of x except when x = 0 we can write b = U x, i.e.
- so f must always be positive

- Theorem: If a matrix M = UTU then it is positive definite

f: Rn→ R is a quadratic function if

- where Q is symmetric.

Suposse matrix P non-symmetric. Example

Q is symmetric

Given the quadratic function

If Q is positive definite, then f is a parabolic “bowl.”

Two other shapes can result from the quadratic form.

- If Q is negative definite, then f is a parabolic “bowl” up side down.
- If Q is indefinite then f describes a saddle.

Quadratics are useful in the study of optimization.

- Often, objective functions are “close to” quadratic near the solution.
- It is easier to analyze the behavior of algorithms when applied to quadratics.
- Analysis of algorithms for quadratics gives insight into their behavior in general.

The derivative of f: R → R is a function f ′: R → R given by

- if the limit exists.

Definition: A real-valued function f: Rn→ R is said to be continuously differentiable if the partial derivatives

- exist for each x in Rnand are continuous functions of x.
- In this case, we say f C1(a smoothfunctionC1)

Definition: The gradient of f: Rn→ R is a function ∇f: Rn→ Rngiven by

The gradient defines (hyper) plane approximating the function infinitesimally

Proposition 1:

is maximal choosing

intuitive: the gradient points at the greatest change direction

Prove it!

Proof:

- Assign:
- by chain rule:

Proof:

- On the other hand for general v:

Proposition 2: let f: Rn→ R be a smooth function C1 around p,

- if f has local minimum (maximum) at p then,

Intuitive: necessary for local min(max)

We found the best INFINITESIMAL DIRECTION at each point,

- Looking for minimum: “blind man” procedure
- How can we derive the way to the minimum using this knowledge?

The gradient of f: Rn→ Rmis a function Df: Rn→ Rm×ngiven by

called Jacobian

Note that for f: Rn→ R , we have ∇f(x) = Df(x)T.

If the derivative of ∇f exists, we say that f is twice differentiable.

- Write the second derivative as D2f (or F), and call it the Hessianof f.

The level set of a function f: Rn→ R at level c is the set of points S = {x: f(x) = c}.

Proof of fact:

- Imagine a particle traveling along the level set.
- Let g(t) be the position of the particle at time t, with g(0) = x0.
- Note that f(g(t)) = constant for all t.
- Velocity vector g′(t) is tangent to the level set.
- Consider F(t) = f(g(t)). We have F′(0) = 0. By the chain rule,
- Hence, ∇f(x0) and g′(0) are orthogonal.

Suppose f: R → R is in C1. Then,

- o(h) is a term such that o(h) = h → 0 as h → 0.
- At x0, f can be approximated by a linear function, and the approximation gets better the closer we are to x0.

Suppose f: R → R is in C2. Then,

- At x0, f can be approximated by a quadratic function.

Suppose f: Rn→ R.

- If f in C1, then
- If f in C2, then

We already know that ∇f(x0) is orthogonal to the level set at x0.

- Suppose ∇f(x0) ≠ 0.
- Fact: ∇f points in the direction of increasing f.

Consider xα = x0 + α∇f(x0), α > 0.

- By Taylor\'s formula,
- Therefore, for sufficiently small ,

f(xα) > f(x0)

This theorem is the link from the previous gradient properties to the constructive algorithm.

- The problem:

We introduce a model for algorithm:

Data

Step 0: set i = 0

Step 1: if stop,

else, compute search direction

Step 2: compute the step-size

Step 3: set go to step 1

The Theorem:

- Suppose f: Rn→ R C1 smooth, and exist continuous function: k: Rn→ [0,1], and,
- And, the search vectors constructed by the model algorithm satisfy:

And

- Then
- if is the sequence constructed by the algorithm model,
- then any accumulation pointy of this sequence satisfy:

The theorem has very intuitive interpretation:

- Always go in descent direction.

The principal differences between various descent algorithms lie inthe first procedure for determining successive directions

We now use what we have learned to implement the most basic minimization technique.

- First we introduce the algorithm, which is a version of the model algorithm.
- The problem:

Data

Step 0: set i = 0

Step 1: if stop,

else, compute search direction

Step 2: compute the step-size

Step 3: set go to step 1

Theorem:

- If is a sequence constructed by the SD algorithm, then every accumulation point y of the sequence satisfy:
- Proof: from Wolfe theorem

Remark: Wolfe theorem gives us numerical stability if the derivatives aren’t given (are calculated numerically).

How long a step to take?

Note search direction is

- We are limited to a line search
- Choose λ to minimize f .

. . . directional derivative is equal to zero.

How long a step to take?

- From the chain rule:
- Therefore the method of steepest descent looks like this:

They are orthogonal !

Find the minimum when x1 is allowed to vary from 0.5 to 1.5 and x2 is allowed to vary from 0 to 2.

λ arbitrary

Find the minimum when x1 is allowed to vary from 0.5 to 1.5 and x2 is allowed to vary from 0 to 2.

We from now on assume we want to minimize the quadratic function:

- This is equivalent to solve linear problem:

If A symmetric

Cada elipsoide tiene f(x) constante

In general, the solution x lies at the intersection point

of n hyperplanes, each having dimension n– 1.

What is the problem with steepest descent?

- We can repeat the same directions over and over…
- Wouldn’t it be better if, every time we took a step, we got it right the first time?

What is the problem with steepest descent?

- We can repeat the same directions over and over…
- Conjugate gradient requires n gradient evaluations and n line searches.

- First, let’s define de error as

- eiis a vector that indicates how far we are from the solution.

Start point

- Let’s pick a set of orthogonal search directions

- In each search direction, we’ll take exactly one step,

that step will be just the right length to line up evenly with

Using the coordinate axes as search directions…

- Unfortunately, this method only works if you already know the answer.

Given , how do we calculate ?

- ei+1 should be orthogonal to di

Step 0:

Step 1:

Step 2:

Step 3:

Step 4: and repeat n times

- Conjugate gradient algorithm for minimizing f:

Sources

- J-Shing Roger Jang, Chuen-Tsai Sun and Eiji Mizutani, Slides for Ch. 5 of “Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence”, First Edition, Prentice Hall, 1997.
- Djamel Bouchaffra. Soft Computing. Course materials. Oakland University. Fall 2005
- Lucidi delle lezioni, Soft Computing. Materiale Didattico. Dipartimento di Elettronica e Informazione. Politecnico di Milano. 2004
- Jeen-Shing Wang, Course: Introduction to Neural Networks. Lecture notes. Department of Electrical Engineering. National Cheng Kung University. Fall, 2005

Sources

- Carlo Tomasi, Mathematical Methods for Robotics and Vision. Stanford University. Fall 2000
- Petros Ioannou, Jing Sun, Robust Adaptive Control. Prentice-Hall, Inc, Upper Saddle River: NJ, 1996
- Jonathan Richard Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. Edition 11/4. School of Computer Science. Carnegie Mellon University. Pittsburgh. August 4, 1994
- Gordon C. Everstine, Selected Topics in Linear Algebra. The GeorgeWashington University. 8 June 2004

Download Presentation

Connecting to Server..