Loading in 5 sec....

CHAPTER 10 E VOLUTIONARY C OMPUTATION II : G ENERAL M ETHODS AND T HEORYPowerPoint Presentation

CHAPTER 10 E VOLUTIONARY C OMPUTATION II : G ENERAL M ETHODS AND T HEORY

- By
**paul2** - Follow User

- 379 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'CHAPTER 10 EVOLUTIONARY COMPUTATION II: GENERAL METHODS AND THEORY' - paul2

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### CHAPTER 10EVOLUTIONARYCOMPUTATIONII: GENERAL METHODS AND THEORY

Slides for Introduction to Stochastic Search and Optimization (ISSO)by J. C. Spall

Organization of chapter in ISSO

Introduction

Evolution strategy and evolutionary programming; comparisons with GAs

Schema theory for GAs

What makes a problem hard?

Convergence theory

No free lunch theorems

Methods of EC

- Genetic algorithms (GAs), evolution strategy (ES), and evolutionary programming (EP) are most common EC methods
- Many modern EC implementations borrow aspects from one or more EC methods
- Generally: ES generally for function optimization; EP for AI applications such as automatic programming

ES Algorithm with Noise-Free Loss Measurements

Step 0 (initialization)Randomly or deterministically generate initial population of N values of and evaluate L for each of the values.

Step 1 (offspring)Generate offspring from current population of N candidate values such that all values satisfy direct or indirect constraints on .

Step 2 (selection)For (N+)-ES, select N best values from combined population of Noriginal values plus offspring; for (N,)-ES, select N best values from population of > N offspring only.

Step 3 (repeat or terminate) Repeat steps 1 and 2 or terminate.

Schema Theory for GAs

- Key innovation in Holland (1975) is a form of theoretical foundation for GAs based on schemas
- Represents first attempt at serious theoretical analysis
- But not entirely successful, as “leap of faith” required to relate schema theory to actual convergence of GA

- “GAs work by discovering, emphasizing, and recombining good ‘building blocks’ of solutions in a highly parallel fashion.” (Melanie Mitchell, An Introduction to Genetic Algorithms [p. 27], 1996, paraphrasing John Holland)
- Statement above more intuitive than formal
- Notion of building block is characterized via schemas
- Schemas are propagated or destroyed according to the laws of probability

Schema Theory for GAs

- Schema is template for chromosomes in GAs
- Example: [* 1 0 * * * * 1], where the * symbol represents a don’t care (or free) element
- [11001101] is specific instance of this schema

- Schemas sometimes called building blocks of GAs
- Two fundamental results: Schema theorem and implicit parallelism
- Schema theorem says that better templates dominate the population as generations proceed
- Implicit parallelism says that GA processes >> N schemas at each iteration
- Schema theory is controversial
- Not connected to algorithm performance in same direct way as usual convergence theory for iterates of algorithm

Convergence Theory via Markov Chains

- Schema theory inadequate
- Mathematics behind schema theory not fully rigorous
- Unjustified claims about implications of schema theory

- More rigorous convergence theory exists
- Pertains to noise-free loss (fitness) measurements
- Pertains to finite representation (e.g., bit coding or floating point representation on digital computer)

- Convergence theory relies on Markov chains
- Each state in chain represents possible population
- Markov transition matrix P contains all information for Markov chain analysis

GA Markov Chain Model

- GAs with binary bit coding can be modeled as (discrete state) Markov chains
- Recall states in chain represent possible populations
- ith element of probability vector pk represents probability of achieving ith population at iteration k
- Transition matrix: The i, j element of P represents the probability of population i producing population j through the selection, crossover and mutation operations
- Depends on loss (fitness) function, selection method, and reproduction and mutation parameters

- Given transition matrix P, it is known that

Rudolph (1994) and Markov Chain Analysis for Canonical GA

- Rudolph (1994, IEEE Trans. Neural Nets.) uses Markov chain analysis to study “canonical GA” (CGA)
- CGA includes binary bit coding, crossover, mutation, and “roulette wheel” selection
- CGA is focus of seminal book, Holland (1975)

- CGA does not include elitismlack of elitism is critical aspect of theoretical analysis
- CGA assumes mutation probability 0 < Pm < 1 and single-point crossover probability 0 Pc 1
- Key preliminary result: CGA is ergodic Markov chain:
- Exists a unique limiting distribution for the states of chain
- Nonzero probability of being in any state regardless of initial condition

Rudolph (1994) and Markov Chain Analysis for CGA (cont’d)

- Ergodicity for CGA provides a negative result on convergence in Rudolph (1994)
- Let denote lowest of N (= population size) loss values within population at iteration k
- represents loss value for in population k that has maximum fitness value

- Main theorem: CGA satisfies
(above limit on left-hand side exists by ergodicity)

- Implies CGA does not converge to the global optimum

Rudolph (1994) and Markov Chain Analysis for CGA (cont’d)

- Fundamental problem with CGA is that optimal solutions are found but then lost
- CGA has no mechanism for retaining optimal solution
- Rudolph discusses modification to CGA yielding positive convergence results
- Appends “super individual” to each population
- Super individual represents best chromosome so far
- Not eligible for GA operations (selection, crossover, mutation)
- Not same as elitism

- CGA with added super individual converges in probability

Contrast of Suzuki (1995) and Rudolph (1994) in Markov Chain Analysis for GA

- Suzuki (1995, IEEE Trans. Systems, Man, and Cyber.) uses Markov chain analysis to study GA with elitism
- Same as CGA of Rudolph (1994) except for elitism

- Suzuki (1995) only considers unique states (populations)
- Rudolph (1994) includes redundant states

- With N = population size and B = no. of bits/chromosome:
unique states in Suzuki (1995),

2NB states in Rudolph (1994) (much larger than number of unique states above)

- Above affects bookkeeping; does not fundamentally change relative results of Suzuki (1995) and Rudolph (1994)

Convergence Under Elitism Analysis for GA

- In both CGA case (Rudolph, 1994) and case with elitism (Suzuki, 1995) the limit exists:
(dimension of differs according to definition of states, unique or nonunique as on previous slide)

- Suzuki (1995) assumes each population includes one elite element and that crossover probability Pc = 1
- Let represent jth element of , and J represent indices j where population j includes chromosome achieving L()
- Then from Suzuki (1995):
- Implies GA with elitism converges in probability to set of optima

Calculation of Stationary Distribution Analysis for GA

- Markov chain theory provides useful conceptual device
- Practical calculation difficult due to explosive growth of number of possible populations (states)
- Growth is in terms of factorials of N and bit string length (B)
- Practical calculation of pk usually impossible due to difficulty in getting P
- Transition matrix can be very large in practice
- E.g., if N = B = 6, P is 108108 matrix!
- Real problems have N and Bmuch larger than 6

- Ongoing work attempts to severelyreduce dimension by limiting states to only most important (e.g., Spears, 1999; Moey and Rowe, 2004)

Example 10.2 from Analysis for GAISSO: Markov Chain Calculations for Small-Scale Implementation

- Consider L() = = [0,15]
- Function has local and global minimum; plot on next slide
- Several GA implementations with very small population sizes (N) and numbers of bits (B)
- Small scale implementations imply Markov transition matrices are computable
- But still not trivial, as matrix dimensions range from approximately 20002000 to 40004000

Loss Function for Example 10.2 in Analysis for GAISSOMarkov chain theory provides probability of finding solution ( = 15) in given number of iterations

Example 10.2 (cont’d): Probability Calculations for Very Small-Scale GAs

Summary of GA Convergence Theory Small-Scale GAs

- Schema theory (Holland, 1975) was most popular method for theoretical analysis until approximately mid-1990s
- Schema theory not fully rigorous and not fully connected to actual algorithm performance

- Markov chain theory provides more formal means of convergence—and convergence rate—analysis
- Rudolph (1994) used Markov chains to provide largely negative result on convergence for canonical GAs
- Canonical GA does not converge to optimum

- Suzuki (1995) considered GAs with elitism; unlike Rudolph (1994), GA is now convergent
- Challenges exist in practical calculation of Markov transition matrix

No Free Lunch Theorems (Reprise, Chap. 1) Small-Scale GAs

- No free lunch (NFL) Theorems apply to EC algorithms
- Theorems imply there can be no universally efficient EC algorithm
- Performance of one algorithm when averaged over all problems is identical to that of any other algorithm

- Suppose EC algorithm A applied to loss L
- Let denote lowest loss value from most recent N population elements after nN unique function evaluations

- Consider the probability that after n unique evaluations of the loss:

NFL theorems state that the sum of above probabilities over all loss functions is independent of A

Comparison of Algorithms for Stochastic Optimization in Chaps. 2 – 10 of ISSO

- Table next slide is rough summary of relative merits of several algorithms for stochastic optimization
- Comparisons based on semi-subjective impressions from numerical experience (author and others) and theoretical or analytical evidence
- NFL theorems not generally relevant as only considering “typical” problems of interest, not all possible problems

- Table does not consider root-finding per se
- Table is for “basic” implementation forms of algorithms
- Ratings range fromL(low),ML(medium-low), M(medium), MH(mediumhigh), andH(high)
- These scales are for stochastic optimization setting and have no meaning relative to classical deterministic methods

Comparison of Algorithms Chaps. 2 – 10 of

Download Presentation

Connecting to Server..