Coevolving solutions to the shortest common superstring problem
Download
1 / 86

Coevolving Solutions to the Shortest Common - PowerPoint PPT Presentation


  • 282 Views
  • Updated On :

Coevolving Solutions to the Shortest Common Superstring Problem Assaf Zaritsky & Moshe Sipper Ben-Gurion University, Israel www.cs.bgu.ac.il/~assafza Outline The “Shortest Common Superstring” problem. DNA sequencing and the input domain.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Coevolving Solutions to the Shortest Common ' - adamdaniel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Coevolving solutions to the shortest common superstring problem l.jpg

Coevolving Solutions to the Shortest Common Superstring Problem

Assaf Zaritsky & Moshe Sipper

Ben-Gurion University, Israel

www.cs.bgu.ac.il/~assafza


Outline l.jpg
Outline

  • The “Shortest Common Superstring” problem.

  • DNA sequencing and the input domain.

  • Standard and cooperative coevolutionary genetic algorithm (GA).

  • The Puzzle approach.

  • Conclusions and future work.

  • Messy Puzzle.


The shortest common superstring problem scs l.jpg
The Shortest Common Superstring Problem (SCS)

  • Let S = {s1,…,sn} be a set of strings (blocks) over some alphabet Σ. A superstring of S is a string x such that each si in S is a substring of x.

  • Problem: Find shortest (common) superstring.

  • NP-Complete.

  • MAX-SNP hard.

  • Motivation: DNA sequencing, data compression.


Scs example l.jpg
SCS: Example

  • S = {ate, half, lethal, alpha, alfalfa}

  • A trivial superstring is “atehalflethalalphaalfalfa” of length 25 (a simple concatenation of all blocks).

  • A shortest common superstring is “lethalphalfalfate” of length 17.

  • Note that a “compressed” permutation of the blocks is actually a superstring.


Approximation algorithms l.jpg
Approximation Algorithms

  • Several linear approximations for SCS have been proposed, most of which rely on greedy approaches.

  • GREEDY

    The most widely heuristic used in DNA sequencing.

  • Conjecture [Blum 1994, Sweedyk 1999]: Superstring produced by GREEDY is of length at most two times the optimal.

  • We are not aware of any previous evolutionary approach to the SCS problem.


Outline6 l.jpg
Outline

  • The “Shortest Common Superstring” problem.

  • DNA sequencing and the input domain.

  • Standard and cooperative coevolutionary genetic algorithm (GA).

  • The Puzzle approach.

  • Conclusions and future work.

  • Messy Puzzle.


Dna sequencing l.jpg
DNA Sequencing

The most common usage of the SCS problem.


Dna sequencing cont d l.jpg
DNA Sequencing (cont’d)

  • The problem: “read” a string of DNA.

  • Short DNA strands can be read in laboratory.

  • To sequence a long DNA strand:

    (The DNA sequence appears in many copies)

    • Cut the DNA to short fragments using restriction enzymes.

    • Sequence each of the resulting fragments.

    • Order those fragments using a SCS algorithm.


The input domain l.jpg
The Input Domain

The input strings used in the experiments were inspired by DNA sequencing:


Input generation setup parameters l.jpg
Input Generation Setup: Parameters

NB: increasing number of blocks results in exponential growth of the problem’s complexity.


Outline11 l.jpg
Outline

  • The “Shortest Common Superstring” problem.

  • DNA sequencing and the input domain.

  • Standard and cooperative coevolutionary genetic algorithm (GA).

  • The Puzzle approach.

  • Conclusions and future work.

  • Messy Puzzle.


Simple genetic algorithm l.jpg
Simple Genetic Algorithm

produce an initial population of individuals

evaluate fitness of all individuals

while termination condition not met do

select fitter individuals for reproduction

recombine individuals

mutate individuals

evaluate fitness of modified individuals

generate a new population

end while


Simple ga for the scs problem l.jpg
Simple GA for the SCS Problem

  • Given a set of strings as input, generate initial population of random candidate solutions.

  • The fitness of each individual depends on its length and accuracy.

  • The GA uses selection, recombination, and mutation to create the next generation, each individual of which is then evaluated.

  • Theses steps are repeated a predefined number of times or until the solution is deemed satisfactory.


Simple ga for the scs problem cont d l.jpg
Simple GA for the SCS Problem (cont’d)

  • Blocks of the input set are atomic components.

  • Representation: An individual’s genome is represented as a sequence of blocks.

    • An individual may have missing blocks or contain duplicate copies of the same block.

    • Permutation Representation: Good or Bad?


Simple ga for the scs problem cont d15 l.jpg
Simple GA for the SCS Problem (cont’d)

  • Evaluation: fitness of an individual is the length of it’s compressed genome + the total length of the blocks that are not covered by the individual.

  • Genetic operators:

    • Fitness proportionate selection.

    • Two-points recombination. Allows growth and reduction in genome’s length.

    • Block-change mutation.


Simple ga for the scs problem example l.jpg
Simple GA for the SCS Problem (example)

  • S = {s1,s2,s3,s4}; s1 = 0011, s2 = 1100, s3 = 1001, s4 = 111.

  • Fitness (< s2,s1>) = |110011| + |111| = 6 + 3 = 9.

  • Fitness (< s4,s2,s1,s4>) = |11100111| = 8.

  • Recombination:

    • p1 = <s1,|s2,s3|,s4>

    • p2 = <s4,|s1,s3,s2|>

    • p3 = recombine1(p1,p2) = <s1,s1,s3,s2,s4>

    • p4 = recombine2(p1,p2) = <s4,s2,s3>

  • mutate (<s1,s2,s2>) = <s1,s4,s2>


Coevolution l.jpg
Coevolution

  • Simultaneous evolution of two or more species with coupled fitness.

  • Coevolving species either compete or cooperate.

  • Competitive coevolution: Fitness of individual based on direct competition with individuals of other species, which in turn evolve separately in their own populations (“prey-predator”).



Cooperative coevolution cont d l.jpg
Cooperative Coevolution (cont’d)

  • Cooperative Coevolution involves a number of independently evolving species.

  • Interaction between species occurs via fitness function only.

  • The fitness of an individual depends on its ability to collaborate with individuals from other species.


Cooperative coevolution cont d20 l.jpg
Cooperative Coevolution (cont’d)

Source: Potter & DeJong (1997)


Cooperative coevolutionary algorithm for the scs problem l.jpg
Cooperative Coevolutionary Algorithm for the SCS Problem

  • Two species evolve simultaneously.

  • First species contains prefixes of candidate solutions to the SCS problem at hand.

  • Second species contains candidate suffixes.

  • Fitness of an individual in each species depends on how good it interacts with representatives from other species to construct a global solution.


Cooperative coevolutionary algorithm for the scs problem evaluation process l.jpg

Suffix Representative

Individual

Prefixes population

Suffixes population

Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process)

Merge


Cooperative coevolutionary algorithm for the scs problem evaluation process23 l.jpg

Fitness

Prefixes population

Suffixes population

Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process)

Evaluate


Experiments l.jpg
Experiments

Compare: GREEDY, Standard GA, Cooperative Coevolution


Experimental setup l.jpg
Experimental Setup

Each type of GA was executed twice on each problem instance; the better run of the two was used for statistical purposes.




Results summary l.jpg
Results: Summary

Average of the best superstring lengths

Algorithm

Problem size

GREEDY

Genetic

Cooperative

50 blocks

80 blocks


Conclusion l.jpg
Conclusion:

The collaboration between the two populations results in a good decomposition of the problem into two smaller sub-problems, each is solved using a standard GA.


Outline30 l.jpg
Outline

  • The “Shortest Common Superstring” problem.

  • DNA sequencing and the input domain.

  • Standard and cooperative coevolutionary genetic algorithm (GA).

  • The Puzzle approach.

  • Conclusions and future work.

  • Messy Puzzle.


The puzzle algorithm l.jpg
The Puzzle Algorithm


The schema theorem l.jpg
The Schema Theorem

“Short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations of a genetic algorithm.”

Holland (1975)


Building blocks hypothesis l.jpg
Building Blocks Hypothesis

“A genetic algorithm seeks near-optimal performance through the juxtaposition of short, low-order, high-performance schemata, called the building blocks.”


Our interpretation l.jpg
Our Interpretation

“The success of GAs stems from their ability to combine quality sub-solutions (building blocks) from separate individuals in order to form better global solutions.”


The main assumption l.jpg
The Main Assumption

Problems in nature have an inherent structural design. Even when the structure is not known explicitly GAs detect it implicitly and gradually enhance good building blocks.


A problem l.jpg
A Problem

Recombination may destroy quality building blocks found by the GA.


Example l.jpg

Brain Appearance

0010101010101010101000011110100010000

Example


Example con t l.jpg

1. Smart (assumable)

2. Blond

But not very beautiful…

Example (con’t)

Brain Appearance

0010101010101010101000011110100010000



Puzzle algorithm the idea l.jpg
Puzzle Algorithm: The Idea for Fitness: The Puzzle Algorithm

  • Improve Recombination Operator.

  • Preserve good building blocks discovered by GA using selection of recombination loci that do not destroy good building blocks.

  • Result: Assembly of good building blocks to construct better solutions (as in a puzzle).


Puzzle algorithm cont d l.jpg

Building blocks population for Fitness: The Puzzle Algorithm

Candidate solutions population

Puzzle Algorithm (cont’d)

  • Two populations:

    1. Candidate solutions: As in simple GA.

    2. Building blocks: Each individual is a sequence of blocks contained in at least one candidate solution.


Puzzle algorithm cont d42 l.jpg

Building blocks population for Fitness: The Puzzle Algorithm

Candidate solutions population

Puzzle Algorithm (cont’d)

  • Interaction between candidate solutions and building blocks is through fitness function.

  • Interaction between building blocks and candidate solutions is through constraints on recombination points.

Fitness evaluation

Crossover location


Puzzle algorithm zoom in l.jpg

Candidate solutions population for Fitness: The Puzzle Algorithm

Building blocks population

Fitness evaluation

each individual is a sequence of blocks

Crossover location

Puzzle Algorithm: Zoom In


Puzzle algorithm zoom in44 l.jpg

Candidate solutions population for Fitness: The Puzzle Algorithm

Building blocks population

Fitness evaluation

each building block is contained in at least one individual in the solutions population

Crossover location

overlapping building blocks

Puzzle Algorithm: Zoom In


The candidate solutions population l.jpg

Fitness evaluation for Fitness: The Puzzle Algorithm

Building blocks population

Candidate solutions population

Crossover location

The Candidate Solutions Population

  • Representation, fitness evaluation, selection, and mutation are identical to the simple GA.

  • Recombination-aid vector aids in selecting the recombination loci.

  • Recombination-aid vector is updated by building blocks individuals.


The building blocks population l.jpg

Fitness evaluation for Fitness: The Puzzle Algorithm

Building blocks population

Candidate solutions population

Crossover location

The Building Blocks Population

  • An individual is represented as a sequence of blocks, contained in at least one candidate solution.

  • Fitness of an individual is the average of the fitness of candidate solutions containing it.

  • Fitness-proportionate selection.


The building blocks population con t l.jpg

Fitness evaluation for Fitness: The Puzzle Algorithm

Building blocks population

Candidate solutions population

Crossover location

The Building Blocks Population (con’t)

  • “Unisex” individuals.

  • Two modification operators:

    • Expansion: Increase it’s genome by one block. Occurs with high probability.

    • Exploration: “Die”, and start over as a new 2-block individual. Occurs with low probability.


Building blocks candidate solutions l.jpg

Candidate solutions population for Fitness: The Puzzle Algorithm

Building blocks population

Building Blocks – Candidate Solutions

Fitness evaluation

f1

f2

f3

f4


Building blocks candidate solutions49 l.jpg

Candidate solutions population for Fitness: The Puzzle Algorithm

Building blocks population

f3

f2

f1

f1

f2

f3

f4

Building Blocks – Candidate Solutions

Fitness evaluation

f1

f2

f3

f4

Update “recombination-aid” vector


Update recombination aid vector l.jpg

Recombination-aid for Fitness: The Puzzle Algorithmvector

0

0

0

0

0

0

0

Solution’s genome

building block #1 fitness = 0.3

building block #2 fitness = 0.4

building block #3 fitness = 0.6

Update Recombination-aid vector


Update recombination aid vector51 l.jpg

Recombination-aid for Fitness: The Puzzle Algorithmvector

0

0.3

0.3

0

0.4

0.6

0

Solution’s genome

building block #1 fitness = 0.3

building block #2 fitness = 0.4

building block #3 fitness = 0.6

Update Recombination-aid vector


Update recombination aid vector52 l.jpg

Recombination-aid for Fitness: The Puzzle Algorithmvector

0.3

0.3

0.3

0

0.4

0.6

0.6

Solution’s genome

building block #1 fitness = 0.3

building block #2 fitness = 0.4

building block #3 fitness = 0.6

Update Recombination-aid vector


Recombination loci selection l.jpg

Recombination-aid for Fitness: The Puzzle Algorithmvector

0.3

0.3

0.3

0

0.4

0.6

0.6

Solution’s genome

Recombination-loci selection

* Ties are broken arbitrarily


Experiments54 l.jpg
Experiments for Fitness: The Puzzle Algorithm

Compare: GREEDY, Standard GA, Puzzle


Building blocks experimental setup l.jpg
Building Blocks - Experimental Setup for Fitness: The Puzzle Algorithm


Results experiment iii 50 blocks l.jpg

Cooperative for Fitness: The Puzzle Algorithm

Results: Experiment III (~50 blocks)


Results experiment iv 80 blocks l.jpg

Cooperative for Fitness: The Puzzle Algorithm

Results: Experiment IV (~80 blocks)

Did we lose to cooperative?

NO!


Results summary58 l.jpg
Results: Summary for Fitness: The Puzzle Algorithm

Average of the best superstring lengths

Algorithm

Problem size

GREEDY

Genetic

Puzzle

50 blocks

80 blocks


Relations between the algorithms l.jpg

puzzle for Fitness: The Puzzle Algorithm

cooperation

Cooperative

Puzzle

cooperation

puzzle

Relations Between The Algorithms

Co-Puzzle

GA


The co puzzle algorithm l.jpg
The for Fitness: The Puzzle AlgorithmCo-Puzzle Algorithm

Fitnessevaluation

Fitness eval

Fitness eval

Possible building blocks population

Candidate prefixes population

Possible building blocks population

Candidate suffixes population

Crossover location

Crossover location


Experiments61 l.jpg
Experiments for Fitness: The Puzzle Algorithm

Compare: GREEDY, Cooperative Coevolution, Co-Puzzle


Results experiment v 80 blocks l.jpg
Results: Experiment V for Fitness: The Puzzle Algorithm(~80 blocks)


Results experiment vi 50 blocks l.jpg

Puzzle for Fitness: The Puzzle Algorithm

Results: Experiment VI (~50 blocks)

????


Results summary64 l.jpg

42% for Fitness: The Puzzle Algorithmimprovement over cooperative

Results: Summary

size of shortest common superstring

Algorithm

Problem size

GREEDY

Cooperative

Co-puzzle

50 blocks

80 blocks


Outline65 l.jpg
Outline for Fitness: The Puzzle Algorithm

  • The “Shortest Common Superstring” problem.

  • DNA sequencing and the input domain.

  • Standard and cooperative coevolutionary genetic algorithm (GA).

  • The Puzzle approach.

  • Conclusions and future work.

  • Messy Puzzle.


Results summary66 l.jpg
Results: Summary for Fitness: The Puzzle Algorithm

size of shortest common superstring

Algorithm

Problem size

GREEDY

Cooperative

Puzzle

Co-puzzle

83% better

50 blocks

42% better

80 blocks

20 problem instances per experiment

25% better

90 blocks

13% better

100 blocks


Larger problems using more species l.jpg
Larger Problems - Using More Species for Fitness: The Puzzle Algorithm

size of shortest common superstring

Algorithm

Problem size

GREEDY

Co-puzzle

3-Co-puzzle

110 blocks

120 blocks


Conclusions l.jpg
Conclusions for Fitness: The Puzzle Algorithm

  • Cooperative coevolution might prove deleterious when too many species are used (when close to optimum?).

  • When a suitable number of species are used, cooperative coevolution improves performance by decomposing the problem to several easier subproblems.


Conclusions con t l.jpg
Conclusions (con’t) for Fitness: The Puzzle Algorithm

  • Evolving a population of building blocks to aid in the selection of recombination loci improves drastically the performance of a standard GA.

  • Cooperation between cooperative coevolution and Puzzle ultimately improves global performance.


Future work l.jpg
Future Work for Fitness: The Puzzle Algorithm

  • Test the (Co-) Puzzle approach on other problem domains.

  • A hybrid GA.

    • Tackle larger problems.

    • Comparison to greedy-stochastically based local-search algorithms.


Outline71 l.jpg
Outline for Fitness: The Puzzle Algorithm

  • The “Shortest Common Superstring” problem.

  • DNA sequencing and the input domain.

  • Standard and cooperative coevolutionary genetic algorithm (GA).

  • The Puzzle approach.

  • Conclusions and future work.

  • Messy Puzzle.


The messy puzzle algorithm l.jpg
The for Fitness: The Puzzle AlgorithmMessy Puzzle Algorithm


Static detection of building blocks for addressing the linkage problem l.jpg

Static Detection of Building Blocks for addressing the Linkage Problem

Hillel Maoz

Ben-Gurion University, Israel


The linkage problem l.jpg

b Linkage Problem

b

a

a

The Linkage Problem

  • A binary Genome of size n = 14.

  • Genes a and btogether encode important information.

  • Random cross over is applied.

Survival probability = The chance to appear in the offspring

  • Left genome – 4/15

  • Right genome – 14/15


The linkage problem con t l.jpg
The Linkage Problem Linkage Problem (con’t)

In many cases it is hard to know the optimal representation


The maxcut problem l.jpg
The MaxCut Problem Linkage Problem

  • Input: undirected weighted graph G=(V, E, W).

  • Output: a partition of V into two disjoint sets (S,V\S).

  • Goal: maximal sum of edge weights between the sets.

  • NP-complete.


Maxcut example l.jpg
MaxCut - Example Linkage Problem

Cut = 34

Cut = 47


Simple ga for maxcut l.jpg
Simple GA for MaxCut Linkage Problem

  • Population of candidate solutions

    • Give each node with a number

    • Assign ‘0’ or ‘1’ to indicate which set the node belongs to

  • Iteration step

    • Select any two parents

    • Recombine and create an offspring

    • Repeat until a new population is generated

  • Fitness – The weight of the cut


The representation problem l.jpg
The Representation Problem Linkage Problem

“How to define the order of the vertices within the genome ?”


Messy genes l.jpg
Messy Genes Linkage Problem

  • The main difficulty: identifying the related vertexes.

  • Messy gene is an ordered pair<allele-locus,allele-value>.

  • Possible solution:

    • Use some sort of messy genes to detect related genes.

    • Use the Puzzle approach to keep them together.


The messy puzzle algorithm81 l.jpg
The Linkage ProblemMessy Puzzle Algorithm

A building block’s genome is represented as a sequence of messy genes


Messy puzzle algorithm l.jpg

<0,0> Linkage Problem

<2,0>

<1,1>

<5,0>

<6,1>

Messy Puzzle Algorithm

  • Two population setup as in the puzzle algorithm.

  • Enhanced recombination operator.

  • Evolved building blocks structure (similar to puzzle).


Enhanced recombination l.jpg

1 2 3 4 5 6 7 8 Linkage Problem

1 2 3 4 5 6 7 8

0.8 0.7 0.6

I)

Add the 1st BB - success

II)

Add the 2nd BB - failure

Add the 3rd BB - success

III)

Simple crossover

IV)

Enhanced Recombination


Static detection of building blocks l.jpg
Static Detection of Building Blocks Linkage Problem

  • Building blocks do not truly evolve.

  • No Expansion and Exploration operators.

  • Building blocks’ fitness is based on a number of generations.

  • Purpose: to check and understand the core of the messy puzzle algorithm.


Results l.jpg

Results

  • Random Generated Graphs.

  • 1000 generations.

  • 10 separate experiments per problem instance.


Conclusions and future work l.jpg
Conclusions and Future Work Linkage Problem

  • Do messy work to solve the linkage problem.

  • Even a small population of building blocks improves the GA performance.

  • Messy puzzle is better when inner structures exists.

  • Applying evolution to the building blocks population.

  • Comparing to different representation-search techniques.


ad