coevolving solutions to the shortest common superstring problem
Download
Skip this Video
Download Presentation
Coevolving Solutions to the Shortest Common Superstring Problem

Loading in 2 Seconds...

play fullscreen
1 / 86

Coevolving Solutions to the Shortest Common Superstring Problem - PowerPoint PPT Presentation


  • 282 Views
  • Uploaded on

Coevolving Solutions to the Shortest Common Superstring Problem Assaf Zaritsky & Moshe Sipper Ben-Gurion University, Israel www.cs.bgu.ac.il/~assafza Outline The “Shortest Common Superstring” problem. DNA sequencing and the input domain.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Coevolving Solutions to the Shortest Common Superstring Problem' - adamdaniel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
coevolving solutions to the shortest common superstring problem

Coevolving Solutions to the Shortest Common Superstring Problem

Assaf Zaritsky & Moshe Sipper

Ben-Gurion University, Israel

www.cs.bgu.ac.il/~assafza

outline
Outline
  • The “Shortest Common Superstring” problem.
  • DNA sequencing and the input domain.
  • Standard and cooperative coevolutionary genetic algorithm (GA).
  • The Puzzle approach.
  • Conclusions and future work.
  • Messy Puzzle.
the shortest common superstring problem scs
The Shortest Common Superstring Problem (SCS)
  • Let S = {s1,…,sn} be a set of strings (blocks) over some alphabet Σ. A superstring of S is a string x such that each si in S is a substring of x.
  • Problem: Find shortest (common) superstring.
  • NP-Complete.
  • MAX-SNP hard.
  • Motivation: DNA sequencing, data compression.
scs example
SCS: Example
  • S = {ate, half, lethal, alpha, alfalfa}
  • A trivial superstring is “atehalflethalalphaalfalfa” of length 25 (a simple concatenation of all blocks).
  • A shortest common superstring is “lethalphalfalfate” of length 17.
  • Note that a “compressed” permutation of the blocks is actually a superstring.
approximation algorithms
Approximation Algorithms
  • Several linear approximations for SCS have been proposed, most of which rely on greedy approaches.
  • GREEDY

The most widely heuristic used in DNA sequencing.

  • Conjecture [Blum 1994, Sweedyk 1999]: Superstring produced by GREEDY is of length at most two times the optimal.
  • We are not aware of any previous evolutionary approach to the SCS problem.
outline6
Outline
  • The “Shortest Common Superstring” problem.
  • DNA sequencing and the input domain.
  • Standard and cooperative coevolutionary genetic algorithm (GA).
  • The Puzzle approach.
  • Conclusions and future work.
  • Messy Puzzle.
dna sequencing
DNA Sequencing

The most common usage of the SCS problem.

dna sequencing cont d
DNA Sequencing (cont’d)
  • The problem: “read” a string of DNA.
  • Short DNA strands can be read in laboratory.
  • To sequence a long DNA strand:

(The DNA sequence appears in many copies)

    • Cut the DNA to short fragments using restriction enzymes.
    • Sequence each of the resulting fragments.
    • Order those fragments using a SCS algorithm.
the input domain
The Input Domain

The input strings used in the experiments were inspired by DNA sequencing:

input generation setup parameters
Input Generation Setup: Parameters

NB: increasing number of blocks results in exponential growth of the problem’s complexity.

outline11
Outline
  • The “Shortest Common Superstring” problem.
  • DNA sequencing and the input domain.
  • Standard and cooperative coevolutionary genetic algorithm (GA).
  • The Puzzle approach.
  • Conclusions and future work.
  • Messy Puzzle.
simple genetic algorithm
Simple Genetic Algorithm

produce an initial population of individuals

evaluate fitness of all individuals

while termination condition not met do

select fitter individuals for reproduction

recombine individuals

mutate individuals

evaluate fitness of modified individuals

generate a new population

end while

simple ga for the scs problem
Simple GA for the SCS Problem
  • Given a set of strings as input, generate initial population of random candidate solutions.
  • The fitness of each individual depends on its length and accuracy.
  • The GA uses selection, recombination, and mutation to create the next generation, each individual of which is then evaluated.
  • Theses steps are repeated a predefined number of times or until the solution is deemed satisfactory.
simple ga for the scs problem cont d
Simple GA for the SCS Problem (cont’d)
  • Blocks of the input set are atomic components.
  • Representation: An individual’s genome is represented as a sequence of blocks.
    • An individual may have missing blocks or contain duplicate copies of the same block.
    • Permutation Representation: Good or Bad?
simple ga for the scs problem cont d15
Simple GA for the SCS Problem (cont’d)
  • Evaluation: fitness of an individual is the length of it’s compressed genome + the total length of the blocks that are not covered by the individual.
  • Genetic operators:
    • Fitness proportionate selection.
    • Two-points recombination. Allows growth and reduction in genome’s length.
    • Block-change mutation.
simple ga for the scs problem example
Simple GA for the SCS Problem (example)
  • S = {s1,s2,s3,s4}; s1 = 0011, s2 = 1100, s3 = 1001, s4 = 111.
  • Fitness (< s2,s1>) = |110011| + |111| = 6 + 3 = 9.
  • Fitness (< s4,s2,s1,s4>) = |11100111| = 8.
  • Recombination:
    • p1 =
    • p2 =
    • p3 = recombine1(p1,p2) =
    • p4 = recombine2(p1,p2) =
  • mutate () =
coevolution
Coevolution
  • Simultaneous evolution of two or more species with coupled fitness.
  • Coevolving species either compete or cooperate.
  • Competitive coevolution: Fitness of individual based on direct competition with individuals of other species, which in turn evolve separately in their own populations (“prey-predator”).
cooperative coevolution cont d
Cooperative Coevolution (cont’d)
  • Cooperative Coevolution involves a number of independently evolving species.
  • Interaction between species occurs via fitness function only.
  • The fitness of an individual depends on its ability to collaborate with individuals from other species.
cooperative coevolution cont d20
Cooperative Coevolution (cont’d)

Source: Potter & DeJong (1997)

cooperative coevolutionary algorithm for the scs problem
Cooperative Coevolutionary Algorithm for the SCS Problem
  • Two species evolve simultaneously.
  • First species contains prefixes of candidate solutions to the SCS problem at hand.
  • Second species contains candidate suffixes.
  • Fitness of an individual in each species depends on how good it interacts with representatives from other species to construct a global solution.
cooperative coevolutionary algorithm for the scs problem evaluation process
Suffix Representative

Individual

Prefixes population

Suffixes population

Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process)

Merge

cooperative coevolutionary algorithm for the scs problem evaluation process23
Fitness

Prefixes population

Suffixes population

Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process)

Evaluate

experiments
Experiments

Compare: GREEDY, Standard GA, Cooperative Coevolution

experimental setup
Experimental Setup

Each type of GA was executed twice on each problem instance; the better run of the two was used for statistical purposes.

results summary
Results: Summary

Average of the best superstring lengths

Algorithm

Problem size

GREEDY

Genetic

Cooperative

50 blocks

80 blocks

conclusion
Conclusion:

The collaboration between the two populations results in a good decomposition of the problem into two smaller sub-problems, each is solved using a standard GA.

outline30
Outline
  • The “Shortest Common Superstring” problem.
  • DNA sequencing and the input domain.
  • Standard and cooperative coevolutionary genetic algorithm (GA).
  • The Puzzle approach.
  • Conclusions and future work.
  • Messy Puzzle.
the schema theorem
The Schema Theorem

“Short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations of a genetic algorithm.”

Holland (1975)

building blocks hypothesis
Building Blocks Hypothesis

“A genetic algorithm seeks near-optimal performance through the juxtaposition of short, low-order, high-performance schemata, called the building blocks.”

our interpretation
Our Interpretation

“The success of GAs stems from their ability to combine quality sub-solutions (building blocks) from separate individuals in order to form better global solutions.”

the main assumption
The Main Assumption

Problems in nature have an inherent structural design. Even when the structure is not known explicitly GAs detect it implicitly and gradually enhance good building blocks.

a problem
A Problem

Recombination may destroy quality building blocks found by the GA.

example
Brain Appearance

0010101010101010101000011110100010000

Example
example con t
1. Smart (assumable)

2. Blond

But not very beautiful…

Example (con’t)

Brain Appearance

0010101010101010101000011110100010000

slide39
The Preservation of Favoured Building Blocks in the Struggle for Fitness: The Puzzle Algorithm
puzzle algorithm the idea
Puzzle Algorithm: The Idea
  • Improve Recombination Operator.
  • Preserve good building blocks discovered by GA using selection of recombination loci that do not destroy good building blocks.
  • Result: Assembly of good building blocks to construct better solutions (as in a puzzle).
puzzle algorithm cont d
Building blocks population

Candidate solutions population

Puzzle Algorithm (cont’d)
  • Two populations:

1. Candidate solutions: As in simple GA.

2. Building blocks: Each individual is a sequence of blocks contained in at least one candidate solution.

puzzle algorithm cont d42
Building blocks population

Candidate solutions population

Puzzle Algorithm (cont’d)
  • Interaction between candidate solutions and building blocks is through fitness function.
  • Interaction between building blocks and candidate solutions is through constraints on recombination points.

Fitness evaluation

Crossover location

puzzle algorithm zoom in
Candidate solutions population

Building blocks population

Fitness evaluation

each individual is a sequence of blocks

Crossover location

Puzzle Algorithm: Zoom In
puzzle algorithm zoom in44
Candidate solutions population

Building blocks population

Fitness evaluation

each building block is contained in at least one individual in the solutions population

Crossover location

overlapping building blocks

Puzzle Algorithm: Zoom In
the candidate solutions population
Fitness evaluation

Building blocks population

Candidate solutions population

Crossover location

The Candidate Solutions Population
  • Representation, fitness evaluation, selection, and mutation are identical to the simple GA.
  • Recombination-aid vector aids in selecting the recombination loci.
  • Recombination-aid vector is updated by building blocks individuals.
the building blocks population
Fitness evaluation

Building blocks population

Candidate solutions population

Crossover location

The Building Blocks Population
  • An individual is represented as a sequence of blocks, contained in at least one candidate solution.
  • Fitness of an individual is the average of the fitness of candidate solutions containing it.
  • Fitness-proportionate selection.
the building blocks population con t
Fitness evaluation

Building blocks population

Candidate solutions population

Crossover location

The Building Blocks Population (con’t)
  • “Unisex” individuals.
  • Two modification operators:
    • Expansion: Increase it’s genome by one block. Occurs with high probability.
    • Exploration: “Die”, and start over as a new 2-block individual. Occurs with low probability.
building blocks candidate solutions
Candidate solutions population

Building blocks population

Building Blocks – Candidate Solutions

Fitness evaluation

f1

f2

f3

f4

building blocks candidate solutions49
Candidate solutions population

Building blocks population

f3

f2

f1

f1

f2

f3

f4

Building Blocks – Candidate Solutions

Fitness evaluation

f1

f2

f3

f4

Update “recombination-aid” vector

update recombination aid vector
Recombination-aid vector

0

0

0

0

0

0

0

Solution’s genome

building block #1 fitness = 0.3

building block #2 fitness = 0.4

building block #3 fitness = 0.6

Update Recombination-aid vector
update recombination aid vector51
Recombination-aid vector

0

0.3

0.3

0

0.4

0.6

0

Solution’s genome

building block #1 fitness = 0.3

building block #2 fitness = 0.4

building block #3 fitness = 0.6

Update Recombination-aid vector
update recombination aid vector52
Recombination-aid vector

0.3

0.3

0.3

0

0.4

0.6

0.6

Solution’s genome

building block #1 fitness = 0.3

building block #2 fitness = 0.4

building block #3 fitness = 0.6

Update Recombination-aid vector
recombination loci selection
Recombination-aid vector

0.3

0.3

0.3

0

0.4

0.6

0.6

Solution’s genome

Recombination-loci selection

* Ties are broken arbitrarily

experiments54
Experiments

Compare: GREEDY, Standard GA, Puzzle

results summary58
Results: Summary

Average of the best superstring lengths

Algorithm

Problem size

GREEDY

Genetic

Puzzle

50 blocks

80 blocks

relations between the algorithms
puzzle

cooperation

Cooperative

Puzzle

cooperation

puzzle

Relations Between The Algorithms

Co-Puzzle

GA

the co puzzle algorithm
The Co-Puzzle Algorithm

Fitnessevaluation

Fitness eval

Fitness eval

Possible building blocks population

Candidate prefixes population

Possible building blocks population

Candidate suffixes population

Crossover location

Crossover location

experiments61
Experiments

Compare: GREEDY, Cooperative Coevolution, Co-Puzzle

results summary64
42% improvement over cooperativeResults: Summary

size of shortest common superstring

Algorithm

Problem size

GREEDY

Cooperative

Co-puzzle

50 blocks

80 blocks

outline65
Outline
  • The “Shortest Common Superstring” problem.
  • DNA sequencing and the input domain.
  • Standard and cooperative coevolutionary genetic algorithm (GA).
  • The Puzzle approach.
  • Conclusions and future work.
  • Messy Puzzle.
results summary66
Results: Summary

size of shortest common superstring

Algorithm

Problem size

GREEDY

Cooperative

Puzzle

Co-puzzle

83% better

50 blocks

42% better

80 blocks

20 problem instances per experiment

25% better

90 blocks

13% better

100 blocks

larger problems using more species
Larger Problems - Using More Species

size of shortest common superstring

Algorithm

Problem size

GREEDY

Co-puzzle

3-Co-puzzle

110 blocks

120 blocks

conclusions
Conclusions
  • Cooperative coevolution might prove deleterious when too many species are used (when close to optimum?).
  • When a suitable number of species are used, cooperative coevolution improves performance by decomposing the problem to several easier subproblems.
conclusions con t
Conclusions (con’t)
  • Evolving a population of building blocks to aid in the selection of recombination loci improves drastically the performance of a standard GA.
  • Cooperation between cooperative coevolution and Puzzle ultimately improves global performance.
future work
Future Work
  • Test the (Co-) Puzzle approach on other problem domains.
  • A hybrid GA.
    • Tackle larger problems.
    • Comparison to greedy-stochastically based local-search algorithms.
outline71
Outline
  • The “Shortest Common Superstring” problem.
  • DNA sequencing and the input domain.
  • Standard and cooperative coevolutionary genetic algorithm (GA).
  • The Puzzle approach.
  • Conclusions and future work.
  • Messy Puzzle.
static detection of building blocks for addressing the linkage problem

Static Detection of Building Blocks for addressing the Linkage Problem

Hillel Maoz

Ben-Gurion University, Israel

the linkage problem
b

b

a

a

The Linkage Problem
  • A binary Genome of size n = 14.
  • Genes a and btogether encode important information.
  • Random cross over is applied.

Survival probability = The chance to appear in the offspring

  • Left genome – 4/15
  • Right genome – 14/15
the linkage problem con t
The Linkage Problem (con’t)

In many cases it is hard to know the optimal representation

the maxcut problem
The MaxCut Problem
  • Input: undirected weighted graph G=(V, E, W).
  • Output: a partition of V into two disjoint sets (S,V\S).
  • Goal: maximal sum of edge weights between the sets.
  • NP-complete.
maxcut example
MaxCut - Example

Cut = 34

Cut = 47

simple ga for maxcut
Simple GA for MaxCut
  • Population of candidate solutions
    • Give each node with a number
    • Assign ‘0’ or ‘1’ to indicate which set the node belongs to
  • Iteration step
    • Select any two parents
    • Recombine and create an offspring
    • Repeat until a new population is generated
  • Fitness – The weight of the cut
the representation problem
The Representation Problem

“How to define the order of the vertices within the genome ?”

messy genes
Messy Genes
  • The main difficulty: identifying the related vertexes.
  • Messy gene is an ordered pair.
  • Possible solution:
    • Use some sort of messy genes to detect related genes.
    • Use the Puzzle approach to keep them together.
the messy puzzle algorithm81
The Messy Puzzle Algorithm

A building block’s genome is represented as a sequence of messy genes

messy puzzle algorithm
<0,0>

<2,0>

<1,1>

<5,0>

<6,1>

Messy Puzzle Algorithm
  • Two population setup as in the puzzle algorithm.
  • Enhanced recombination operator.
  • Evolved building blocks structure (similar to puzzle).
enhanced recombination
1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

0.8 0.7 0.6

I)

Add the 1st BB - success

II)

Add the 2nd BB - failure

Add the 3rd BB - success

III)

Simple crossover

IV)

Enhanced Recombination
static detection of building blocks
Static Detection of Building Blocks
  • Building blocks do not truly evolve.
  • No Expansion and Exploration operators.
  • Building blocks’ fitness is based on a number of generations.
  • Purpose: to check and understand the core of the messy puzzle algorithm.
results
Distance to optimum
  • Puzzle addition
Results
  • Random Generated Graphs.
  • 1000 generations.
  • 10 separate experiments per problem instance.
conclusions and future work
Conclusions and Future Work
  • Do messy work to solve the linkage problem.
  • Even a small population of building blocks improves the GA performance.
  • Messy puzzle is better when inner structures exists.
  • Applying evolution to the building blocks population.
  • Comparing to different representation-search techniques.
ad