1 / 86

# Coevolving Solutions to the Shortest Common - PowerPoint PPT Presentation

Coevolving Solutions to the Shortest Common Superstring Problem Assaf Zaritsky & Moshe Sipper Ben-Gurion University, Israel www.cs.bgu.ac.il/~assafza Outline The “Shortest Common Superstring” problem. DNA sequencing and the input domain.

Related searches for Coevolving Solutions to the Shortest Common

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Coevolving Solutions to the Shortest Common ' - adamdaniel

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Coevolving Solutions to the Shortest Common Superstring Problem

Assaf Zaritsky & Moshe Sipper

Ben-Gurion University, Israel

www.cs.bgu.ac.il/~assafza

• The “Shortest Common Superstring” problem.

• DNA sequencing and the input domain.

• Standard and cooperative coevolutionary genetic algorithm (GA).

• The Puzzle approach.

• Conclusions and future work.

• Messy Puzzle.

• Let S = {s1,…,sn} be a set of strings (blocks) over some alphabet Σ. A superstring of S is a string x such that each si in S is a substring of x.

• Problem: Find shortest (common) superstring.

• NP-Complete.

• MAX-SNP hard.

• Motivation: DNA sequencing, data compression.

• S = {ate, half, lethal, alpha, alfalfa}

• A trivial superstring is “atehalflethalalphaalfalfa” of length 25 (a simple concatenation of all blocks).

• A shortest common superstring is “lethalphalfalfate” of length 17.

• Note that a “compressed” permutation of the blocks is actually a superstring.

• Several linear approximations for SCS have been proposed, most of which rely on greedy approaches.

• GREEDY

The most widely heuristic used in DNA sequencing.

• Conjecture [Blum 1994, Sweedyk 1999]: Superstring produced by GREEDY is of length at most two times the optimal.

• We are not aware of any previous evolutionary approach to the SCS problem.

• The “Shortest Common Superstring” problem.

• DNA sequencing and the input domain.

• Standard and cooperative coevolutionary genetic algorithm (GA).

• The Puzzle approach.

• Conclusions and future work.

• Messy Puzzle.

The most common usage of the SCS problem.

• The problem: “read” a string of DNA.

• Short DNA strands can be read in laboratory.

• To sequence a long DNA strand:

(The DNA sequence appears in many copies)

• Cut the DNA to short fragments using restriction enzymes.

• Sequence each of the resulting fragments.

• Order those fragments using a SCS algorithm.

The input strings used in the experiments were inspired by DNA sequencing:

NB: increasing number of blocks results in exponential growth of the problem’s complexity.

• The “Shortest Common Superstring” problem.

• DNA sequencing and the input domain.

• Standard and cooperative coevolutionary genetic algorithm (GA).

• The Puzzle approach.

• Conclusions and future work.

• Messy Puzzle.

produce an initial population of individuals

evaluate fitness of all individuals

while termination condition not met do

select fitter individuals for reproduction

recombine individuals

mutate individuals

evaluate fitness of modified individuals

generate a new population

end while

• Given a set of strings as input, generate initial population of random candidate solutions.

• The fitness of each individual depends on its length and accuracy.

• The GA uses selection, recombination, and mutation to create the next generation, each individual of which is then evaluated.

• Theses steps are repeated a predefined number of times or until the solution is deemed satisfactory.

• Blocks of the input set are atomic components.

• Representation: An individual’s genome is represented as a sequence of blocks.

• An individual may have missing blocks or contain duplicate copies of the same block.

• Permutation Representation: Good or Bad?

• Evaluation: fitness of an individual is the length of it’s compressed genome + the total length of the blocks that are not covered by the individual.

• Genetic operators:

• Fitness proportionate selection.

• Two-points recombination. Allows growth and reduction in genome’s length.

• Block-change mutation.

• S = {s1,s2,s3,s4}; s1 = 0011, s2 = 1100, s3 = 1001, s4 = 111.

• Fitness (< s2,s1>) = |110011| + |111| = 6 + 3 = 9.

• Fitness (< s4,s2,s1,s4>) = |11100111| = 8.

• Recombination:

• p1 = <s1,|s2,s3|,s4>

• p2 = <s4,|s1,s3,s2|>

• p3 = recombine1(p1,p2) = <s1,s1,s3,s2,s4>

• p4 = recombine2(p1,p2) = <s4,s2,s3>

• mutate (<s1,s2,s2>) = <s1,s4,s2>

• Simultaneous evolution of two or more species with coupled fitness.

• Coevolving species either compete or cooperate.

• Competitive coevolution: Fitness of individual based on direct competition with individuals of other species, which in turn evolve separately in their own populations (“prey-predator”).

• Cooperative Coevolution involves a number of independently evolving species.

• Interaction between species occurs via fitness function only.

• The fitness of an individual depends on its ability to collaborate with individuals from other species.

Source: Potter & DeJong (1997)

• Two species evolve simultaneously.

• First species contains prefixes of candidate solutions to the SCS problem at hand.

• Second species contains candidate suffixes.

• Fitness of an individual in each species depends on how good it interacts with representatives from other species to construct a global solution.

Individual

Prefixes population

Suffixes population

Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process)

Merge

Prefixes population

Suffixes population

Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process)

Evaluate

Compare: GREEDY, Standard GA, Cooperative Coevolution

Each type of GA was executed twice on each problem instance; the better run of the two was used for statistical purposes.

Average of the best superstring lengths

Algorithm

Problem size

GREEDY

Genetic

Cooperative

50 blocks

80 blocks

The collaboration between the two populations results in a good decomposition of the problem into two smaller sub-problems, each is solved using a standard GA.

• The “Shortest Common Superstring” problem.

• DNA sequencing and the input domain.

• Standard and cooperative coevolutionary genetic algorithm (GA).

• The Puzzle approach.

• Conclusions and future work.

• Messy Puzzle.

The Puzzle Algorithm

“Short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations of a genetic algorithm.”

Holland (1975)

“A genetic algorithm seeks near-optimal performance through the juxtaposition of short, low-order, high-performance schemata, called the building blocks.”

“The success of GAs stems from their ability to combine quality sub-solutions (building blocks) from separate individuals in order to form better global solutions.”

Problems in nature have an inherent structural design. Even when the structure is not known explicitly GAs detect it implicitly and gradually enhance good building blocks.

Recombination may destroy quality building blocks found by the GA.

0010101010101010101000011110100010000

Example

2. Blond

But not very beautiful…

Example (con’t)

Brain Appearance

0010101010101010101000011110100010000

Puzzle Algorithm: The Idea for Fitness: The Puzzle Algorithm

• Improve Recombination Operator.

• Preserve good building blocks discovered by GA using selection of recombination loci that do not destroy good building blocks.

• Result: Assembly of good building blocks to construct better solutions (as in a puzzle).

Building blocks population for Fitness: The Puzzle Algorithm

Candidate solutions population

Puzzle Algorithm (cont’d)

• Two populations:

1. Candidate solutions: As in simple GA.

2. Building blocks: Each individual is a sequence of blocks contained in at least one candidate solution.

Building blocks population for Fitness: The Puzzle Algorithm

Candidate solutions population

Puzzle Algorithm (cont’d)

• Interaction between candidate solutions and building blocks is through fitness function.

• Interaction between building blocks and candidate solutions is through constraints on recombination points.

Fitness evaluation

Crossover location

Candidate solutions population for Fitness: The Puzzle Algorithm

Building blocks population

Fitness evaluation

each individual is a sequence of blocks

Crossover location

Puzzle Algorithm: Zoom In

Candidate solutions population for Fitness: The Puzzle Algorithm

Building blocks population

Fitness evaluation

each building block is contained in at least one individual in the solutions population

Crossover location

overlapping building blocks

Puzzle Algorithm: Zoom In

Fitness evaluation for Fitness: The Puzzle Algorithm

Building blocks population

Candidate solutions population

Crossover location

The Candidate Solutions Population

• Representation, fitness evaluation, selection, and mutation are identical to the simple GA.

• Recombination-aid vector aids in selecting the recombination loci.

• Recombination-aid vector is updated by building blocks individuals.

Fitness evaluation for Fitness: The Puzzle Algorithm

Building blocks population

Candidate solutions population

Crossover location

The Building Blocks Population

• An individual is represented as a sequence of blocks, contained in at least one candidate solution.

• Fitness of an individual is the average of the fitness of candidate solutions containing it.

• Fitness-proportionate selection.

Fitness evaluation for Fitness: The Puzzle Algorithm

Building blocks population

Candidate solutions population

Crossover location

The Building Blocks Population (con’t)

• “Unisex” individuals.

• Two modification operators:

• Expansion: Increase it’s genome by one block. Occurs with high probability.

• Exploration: “Die”, and start over as a new 2-block individual. Occurs with low probability.

Candidate solutions population for Fitness: The Puzzle Algorithm

Building blocks population

Building Blocks – Candidate Solutions

Fitness evaluation

f1

f2

f3

f4

Candidate solutions population for Fitness: The Puzzle Algorithm

Building blocks population

f3

f2

f1

f1

f2

f3

f4

Building Blocks – Candidate Solutions

Fitness evaluation

f1

f2

f3

f4

Update “recombination-aid” vector

Recombination-aid for Fitness: The Puzzle Algorithmvector

0

0

0

0

0

0

0

Solution’s genome

building block #1 fitness = 0.3

building block #2 fitness = 0.4

building block #3 fitness = 0.6

Update Recombination-aid vector

Recombination-aid for Fitness: The Puzzle Algorithmvector

0

0.3

0.3

0

0.4

0.6

0

Solution’s genome

building block #1 fitness = 0.3

building block #2 fitness = 0.4

building block #3 fitness = 0.6

Update Recombination-aid vector

Recombination-aid for Fitness: The Puzzle Algorithmvector

0.3

0.3

0.3

0

0.4

0.6

0.6

Solution’s genome

building block #1 fitness = 0.3

building block #2 fitness = 0.4

building block #3 fitness = 0.6

Update Recombination-aid vector

Recombination-aid for Fitness: The Puzzle Algorithmvector

0.3

0.3

0.3

0

0.4

0.6

0.6

Solution’s genome

Recombination-loci selection

* Ties are broken arbitrarily

Experiments for Fitness: The Puzzle Algorithm

Compare: GREEDY, Standard GA, Puzzle

Building Blocks - Experimental Setup for Fitness: The Puzzle Algorithm

Cooperative for Fitness: The Puzzle Algorithm

Results: Experiment III (~50 blocks)

Cooperative for Fitness: The Puzzle Algorithm

Results: Experiment IV (~80 blocks)

Did we lose to cooperative?

NO!

Results: Summary for Fitness: The Puzzle Algorithm

Average of the best superstring lengths

Algorithm

Problem size

GREEDY

Genetic

Puzzle

50 blocks

80 blocks

puzzle for Fitness: The Puzzle Algorithm

cooperation

Cooperative

Puzzle

cooperation

puzzle

Relations Between The Algorithms

Co-Puzzle

GA

The for Fitness: The Puzzle AlgorithmCo-Puzzle Algorithm

Fitnessevaluation

Fitness eval

Fitness eval

Possible building blocks population

Candidate prefixes population

Possible building blocks population

Candidate suffixes population

Crossover location

Crossover location

Experiments for Fitness: The Puzzle Algorithm

Compare: GREEDY, Cooperative Coevolution, Co-Puzzle

Results: Experiment V for Fitness: The Puzzle Algorithm(~80 blocks)

Puzzle for Fitness: The Puzzle Algorithm

Results: Experiment VI (~50 blocks)

????

42% for Fitness: The Puzzle Algorithmimprovement over cooperative

Results: Summary

size of shortest common superstring

Algorithm

Problem size

GREEDY

Cooperative

Co-puzzle

50 blocks

80 blocks

Outline for Fitness: The Puzzle Algorithm

• The “Shortest Common Superstring” problem.

• DNA sequencing and the input domain.

• Standard and cooperative coevolutionary genetic algorithm (GA).

• The Puzzle approach.

• Conclusions and future work.

• Messy Puzzle.

Results: Summary for Fitness: The Puzzle Algorithm

size of shortest common superstring

Algorithm

Problem size

GREEDY

Cooperative

Puzzle

Co-puzzle

83% better

50 blocks

42% better

80 blocks

20 problem instances per experiment

25% better

90 blocks

13% better

100 blocks

Larger Problems - Using More Species for Fitness: The Puzzle Algorithm

size of shortest common superstring

Algorithm

Problem size

GREEDY

Co-puzzle

3-Co-puzzle

110 blocks

120 blocks

Conclusions for Fitness: The Puzzle Algorithm

• Cooperative coevolution might prove deleterious when too many species are used (when close to optimum?).

• When a suitable number of species are used, cooperative coevolution improves performance by decomposing the problem to several easier subproblems.

Conclusions (con’t) for Fitness: The Puzzle Algorithm

• Evolving a population of building blocks to aid in the selection of recombination loci improves drastically the performance of a standard GA.

• Cooperation between cooperative coevolution and Puzzle ultimately improves global performance.

Future Work for Fitness: The Puzzle Algorithm

• Test the (Co-) Puzzle approach on other problem domains.

• A hybrid GA.

• Tackle larger problems.

• Comparison to greedy-stochastically based local-search algorithms.

Outline for Fitness: The Puzzle Algorithm

• The “Shortest Common Superstring” problem.

• DNA sequencing and the input domain.

• Standard and cooperative coevolutionary genetic algorithm (GA).

• The Puzzle approach.

• Conclusions and future work.

• Messy Puzzle.

The for Fitness: The Puzzle AlgorithmMessy Puzzle Algorithm

Hillel Maoz

Ben-Gurion University, Israel

b

a

a

• A binary Genome of size n = 14.

• Genes a and btogether encode important information.

• Random cross over is applied.

Survival probability = The chance to appear in the offspring

• Left genome – 4/15

• Right genome – 14/15

In many cases it is hard to know the optimal representation

• Input: undirected weighted graph G=(V, E, W).

• Output: a partition of V into two disjoint sets (S,V\S).

• Goal: maximal sum of edge weights between the sets.

• NP-complete.

Cut = 34

Cut = 47

Simple GA for MaxCut Linkage Problem

• Population of candidate solutions

• Give each node with a number

• Assign ‘0’ or ‘1’ to indicate which set the node belongs to

• Iteration step

• Select any two parents

• Recombine and create an offspring

• Repeat until a new population is generated

• Fitness – The weight of the cut

“How to define the order of the vertices within the genome ?”

• The main difficulty: identifying the related vertexes.

• Messy gene is an ordered pair<allele-locus,allele-value>.

• Possible solution:

• Use some sort of messy genes to detect related genes.

• Use the Puzzle approach to keep them together.

A building block’s genome is represented as a sequence of messy genes

<2,0>

<1,1>

<5,0>

<6,1>

Messy Puzzle Algorithm

• Two population setup as in the puzzle algorithm.

• Enhanced recombination operator.

• Evolved building blocks structure (similar to puzzle).

1 2 3 4 5 6 7 8 Linkage Problem

1 2 3 4 5 6 7 8

0.8 0.7 0.6

I)

Add the 1st BB - success

II)

Add the 2nd BB - failure

Add the 3rd BB - success

III)

Simple crossover

IV)

Enhanced Recombination

Static Detection of Building Blocks Linkage Problem

• Building blocks do not truly evolve.

• No Expansion and Exploration operators.

• Building blocks’ fitness is based on a number of generations.

• Purpose: to check and understand the core of the messy puzzle algorithm.

Results

• Random Generated Graphs.

• 1000 generations.

• 10 separate experiments per problem instance.

Conclusions and Future Work Linkage Problem

• Do messy work to solve the linkage problem.

• Even a small population of building blocks improves the GA performance.

• Messy puzzle is better when inner structures exists.

• Applying evolution to the building blocks population.

• Comparing to different representation-search techniques.