Coevolving Solutions to the Shortest Common Superstring Problem

1 / 86

# Coevolving Solutions to the Shortest Common Superstring Problem - PowerPoint PPT Presentation

Coevolving Solutions to the Shortest Common Superstring Problem Assaf Zaritsky & Moshe Sipper Ben-Gurion University, Israel www.cs.bgu.ac.il/~assafza Outline The “Shortest Common Superstring” problem. DNA sequencing and the input domain.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Coevolving Solutions to the Shortest Common Superstring Problem' - adamdaniel

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Coevolving Solutions to the Shortest Common Superstring Problem

Assaf Zaritsky & Moshe Sipper

Ben-Gurion University, Israel

www.cs.bgu.ac.il/~assafza

Outline
• The “Shortest Common Superstring” problem.
• DNA sequencing and the input domain.
• Standard and cooperative coevolutionary genetic algorithm (GA).
• The Puzzle approach.
• Conclusions and future work.
• Messy Puzzle.
The Shortest Common Superstring Problem (SCS)
• Let S = {s1,…,sn} be a set of strings (blocks) over some alphabet Σ. A superstring of S is a string x such that each si in S is a substring of x.
• Problem: Find shortest (common) superstring.
• NP-Complete.
• MAX-SNP hard.
• Motivation: DNA sequencing, data compression.
SCS: Example
• S = {ate, half, lethal, alpha, alfalfa}
• A trivial superstring is “atehalflethalalphaalfalfa” of length 25 (a simple concatenation of all blocks).
• A shortest common superstring is “lethalphalfalfate” of length 17.
• Note that a “compressed” permutation of the blocks is actually a superstring.
Approximation Algorithms
• Several linear approximations for SCS have been proposed, most of which rely on greedy approaches.
• GREEDY

The most widely heuristic used in DNA sequencing.

• Conjecture [Blum 1994, Sweedyk 1999]: Superstring produced by GREEDY is of length at most two times the optimal.
• We are not aware of any previous evolutionary approach to the SCS problem.
Outline
• The “Shortest Common Superstring” problem.
• DNA sequencing and the input domain.
• Standard and cooperative coevolutionary genetic algorithm (GA).
• The Puzzle approach.
• Conclusions and future work.
• Messy Puzzle.
DNA Sequencing

The most common usage of the SCS problem.

DNA Sequencing (cont’d)
• The problem: “read” a string of DNA.
• Short DNA strands can be read in laboratory.
• To sequence a long DNA strand:

(The DNA sequence appears in many copies)

• Cut the DNA to short fragments using restriction enzymes.
• Sequence each of the resulting fragments.
• Order those fragments using a SCS algorithm.
The Input Domain

The input strings used in the experiments were inspired by DNA sequencing:

Input Generation Setup: Parameters

NB: increasing number of blocks results in exponential growth of the problem’s complexity.

Outline
• The “Shortest Common Superstring” problem.
• DNA sequencing and the input domain.
• Standard and cooperative coevolutionary genetic algorithm (GA).
• The Puzzle approach.
• Conclusions and future work.
• Messy Puzzle.
Simple Genetic Algorithm

produce an initial population of individuals

evaluate fitness of all individuals

while termination condition not met do

select fitter individuals for reproduction

recombine individuals

mutate individuals

evaluate fitness of modified individuals

generate a new population

end while

Simple GA for the SCS Problem
• Given a set of strings as input, generate initial population of random candidate solutions.
• The fitness of each individual depends on its length and accuracy.
• The GA uses selection, recombination, and mutation to create the next generation, each individual of which is then evaluated.
• Theses steps are repeated a predefined number of times or until the solution is deemed satisfactory.
Simple GA for the SCS Problem (cont’d)
• Blocks of the input set are atomic components.
• Representation: An individual’s genome is represented as a sequence of blocks.
• An individual may have missing blocks or contain duplicate copies of the same block.
• Permutation Representation: Good or Bad?
Simple GA for the SCS Problem (cont’d)
• Evaluation: fitness of an individual is the length of it’s compressed genome + the total length of the blocks that are not covered by the individual.
• Genetic operators:
• Fitness proportionate selection.
• Two-points recombination. Allows growth and reduction in genome’s length.
• Block-change mutation.
Simple GA for the SCS Problem (example)
• S = {s1,s2,s3,s4}; s1 = 0011, s2 = 1100, s3 = 1001, s4 = 111.
• Fitness (< s2,s1>) = |110011| + |111| = 6 + 3 = 9.
• Fitness (< s4,s2,s1,s4>) = |11100111| = 8.
• Recombination:
• p1 = <s1,|s2,s3|,s4>
• p2 = <s4,|s1,s3,s2|>
• p3 = recombine1(p1,p2) = <s1,s1,s3,s2,s4>
• p4 = recombine2(p1,p2) = <s4,s2,s3>
• mutate (<s1,s2,s2>) = <s1,s4,s2>
Coevolution
• Simultaneous evolution of two or more species with coupled fitness.
• Coevolving species either compete or cooperate.
• Competitive coevolution: Fitness of individual based on direct competition with individuals of other species, which in turn evolve separately in their own populations (“prey-predator”).
Cooperative Coevolution (cont’d)
• Cooperative Coevolution involves a number of independently evolving species.
• Interaction between species occurs via fitness function only.
• The fitness of an individual depends on its ability to collaborate with individuals from other species.
Cooperative Coevolution (cont’d)

Source: Potter & DeJong (1997)

Cooperative Coevolutionary Algorithm for the SCS Problem
• Two species evolve simultaneously.
• First species contains prefixes of candidate solutions to the SCS problem at hand.
• Second species contains candidate suffixes.
• Fitness of an individual in each species depends on how good it interacts with representatives from other species to construct a global solution.

Suffix Representative

Individual

Prefixes population

Suffixes population

Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process)

Merge

Fitness

Prefixes population

Suffixes population

Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process)

Evaluate

Experiments

Compare: GREEDY, Standard GA, Cooperative Coevolution

Experimental Setup

Each type of GA was executed twice on each problem instance; the better run of the two was used for statistical purposes.

Results: Summary

Average of the best superstring lengths

Algorithm

Problem size

GREEDY

Genetic

Cooperative

50 blocks

80 blocks

Conclusion:

The collaboration between the two populations results in a good decomposition of the problem into two smaller sub-problems, each is solved using a standard GA.

Outline
• The “Shortest Common Superstring” problem.
• DNA sequencing and the input domain.
• Standard and cooperative coevolutionary genetic algorithm (GA).
• The Puzzle approach.
• Conclusions and future work.
• Messy Puzzle.
The Schema Theorem

“Short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations of a genetic algorithm.”

Holland (1975)

Building Blocks Hypothesis

“A genetic algorithm seeks near-optimal performance through the juxtaposition of short, low-order, high-performance schemata, called the building blocks.”

Our Interpretation

“The success of GAs stems from their ability to combine quality sub-solutions (building blocks) from separate individuals in order to form better global solutions.”

The Main Assumption

Problems in nature have an inherent structural design. Even when the structure is not known explicitly GAs detect it implicitly and gradually enhance good building blocks.

A Problem

Recombination may destroy quality building blocks found by the GA.

Brain Appearance

0010101010101010101000011110100010000

Example

1. Smart (assumable)

2. Blond

But not very beautiful…

Example (con’t)

Brain Appearance

0010101010101010101000011110100010000

Puzzle Algorithm: The Idea
• Improve Recombination Operator.
• Preserve good building blocks discovered by GA using selection of recombination loci that do not destroy good building blocks.
• Result: Assembly of good building blocks to construct better solutions (as in a puzzle).

Building blocks population

Candidate solutions population

Puzzle Algorithm (cont’d)
• Two populations:

1. Candidate solutions: As in simple GA.

2. Building blocks: Each individual is a sequence of blocks contained in at least one candidate solution.

Building blocks population

Candidate solutions population

Puzzle Algorithm (cont’d)
• Interaction between candidate solutions and building blocks is through fitness function.
• Interaction between building blocks and candidate solutions is through constraints on recombination points.

Fitness evaluation

Crossover location

Candidate solutions population

Building blocks population

Fitness evaluation

each individual is a sequence of blocks

Crossover location

Puzzle Algorithm: Zoom In

Candidate solutions population

Building blocks population

Fitness evaluation

each building block is contained in at least one individual in the solutions population

Crossover location

overlapping building blocks

Puzzle Algorithm: Zoom In

Fitness evaluation

Building blocks population

Candidate solutions population

Crossover location

The Candidate Solutions Population
• Representation, fitness evaluation, selection, and mutation are identical to the simple GA.
• Recombination-aid vector aids in selecting the recombination loci.
• Recombination-aid vector is updated by building blocks individuals.

Fitness evaluation

Building blocks population

Candidate solutions population

Crossover location

The Building Blocks Population
• An individual is represented as a sequence of blocks, contained in at least one candidate solution.
• Fitness of an individual is the average of the fitness of candidate solutions containing it.
• Fitness-proportionate selection.

Fitness evaluation

Building blocks population

Candidate solutions population

Crossover location

The Building Blocks Population (con’t)
• “Unisex” individuals.
• Two modification operators:
• Expansion: Increase it’s genome by one block. Occurs with high probability.
• Exploration: “Die”, and start over as a new 2-block individual. Occurs with low probability.

Candidate solutions population

Building blocks population

Building Blocks – Candidate Solutions

Fitness evaluation

f1

f2

f3

f4

Candidate solutions population

Building blocks population

f3

f2

f1

f1

f2

f3

f4

Building Blocks – Candidate Solutions

Fitness evaluation

f1

f2

f3

f4

Update “recombination-aid” vector

Recombination-aid vector

0

0

0

0

0

0

0

Solution’s genome

building block #1 fitness = 0.3

building block #2 fitness = 0.4

building block #3 fitness = 0.6

Update Recombination-aid vector

Recombination-aid vector

0

0.3

0.3

0

0.4

0.6

0

Solution’s genome

building block #1 fitness = 0.3

building block #2 fitness = 0.4

building block #3 fitness = 0.6

Update Recombination-aid vector

Recombination-aid vector

0.3

0.3

0.3

0

0.4

0.6

0.6

Solution’s genome

building block #1 fitness = 0.3

building block #2 fitness = 0.4

building block #3 fitness = 0.6

Update Recombination-aid vector

Recombination-aid vector

0.3

0.3

0.3

0

0.4

0.6

0.6

Solution’s genome

Recombination-loci selection

* Ties are broken arbitrarily

Experiments

Compare: GREEDY, Standard GA, Puzzle

Results: Summary

Average of the best superstring lengths

Algorithm

Problem size

GREEDY

Genetic

Puzzle

50 blocks

80 blocks

puzzle

cooperation

Cooperative

Puzzle

cooperation

puzzle

Relations Between The Algorithms

Co-Puzzle

GA

The Co-Puzzle Algorithm

Fitnessevaluation

Fitness eval

Fitness eval

Possible building blocks population

Candidate prefixes population

Possible building blocks population

Candidate suffixes population

Crossover location

Crossover location

Experiments

Compare: GREEDY, Cooperative Coevolution, Co-Puzzle

42% improvement over cooperative

Results: Summary

size of shortest common superstring

Algorithm

Problem size

GREEDY

Cooperative

Co-puzzle

50 blocks

80 blocks

Outline
• The “Shortest Common Superstring” problem.
• DNA sequencing and the input domain.
• Standard and cooperative coevolutionary genetic algorithm (GA).
• The Puzzle approach.
• Conclusions and future work.
• Messy Puzzle.
Results: Summary

size of shortest common superstring

Algorithm

Problem size

GREEDY

Cooperative

Puzzle

Co-puzzle

83% better

50 blocks

42% better

80 blocks

20 problem instances per experiment

25% better

90 blocks

13% better

100 blocks

Larger Problems - Using More Species

size of shortest common superstring

Algorithm

Problem size

GREEDY

Co-puzzle

3-Co-puzzle

110 blocks

120 blocks

Conclusions
• Cooperative coevolution might prove deleterious when too many species are used (when close to optimum?).
• When a suitable number of species are used, cooperative coevolution improves performance by decomposing the problem to several easier subproblems.
Conclusions (con’t)
• Evolving a population of building blocks to aid in the selection of recombination loci improves drastically the performance of a standard GA.
• Cooperation between cooperative coevolution and Puzzle ultimately improves global performance.
Future Work
• Test the (Co-) Puzzle approach on other problem domains.
• A hybrid GA.
• Tackle larger problems.
• Comparison to greedy-stochastically based local-search algorithms.
Outline
• The “Shortest Common Superstring” problem.
• DNA sequencing and the input domain.
• Standard and cooperative coevolutionary genetic algorithm (GA).
• The Puzzle approach.
• Conclusions and future work.
• Messy Puzzle.

Hillel Maoz

Ben-Gurion University, Israel

b

b

a

a

• A binary Genome of size n = 14.
• Genes a and btogether encode important information.
• Random cross over is applied.

Survival probability = The chance to appear in the offspring

• Left genome – 4/15
• Right genome – 14/15

In many cases it is hard to know the optimal representation

The MaxCut Problem
• Input: undirected weighted graph G=(V, E, W).
• Output: a partition of V into two disjoint sets (S,V\S).
• Goal: maximal sum of edge weights between the sets.
• NP-complete.
MaxCut - Example

Cut = 34

Cut = 47

Simple GA for MaxCut
• Population of candidate solutions
• Give each node with a number
• Assign ‘0’ or ‘1’ to indicate which set the node belongs to
• Iteration step
• Select any two parents
• Recombine and create an offspring
• Repeat until a new population is generated
• Fitness – The weight of the cut
The Representation Problem

“How to define the order of the vertices within the genome ?”

Messy Genes
• The main difficulty: identifying the related vertexes.
• Messy gene is an ordered pair<allele-locus,allele-value>.
• Possible solution:
• Use some sort of messy genes to detect related genes.
• Use the Puzzle approach to keep them together.
The Messy Puzzle Algorithm

A building block’s genome is represented as a sequence of messy genes

<0,0>

<2,0>

<1,1>

<5,0>

<6,1>

Messy Puzzle Algorithm
• Two population setup as in the puzzle algorithm.
• Enhanced recombination operator.
• Evolved building blocks structure (similar to puzzle).

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

0.8 0.7 0.6

I)

Add the 1st BB - success

II)

Add the 2nd BB - failure

Add the 3rd BB - success

III)

Simple crossover

IV)

Enhanced Recombination
Static Detection of Building Blocks
• Building blocks do not truly evolve.
• No Expansion and Exploration operators.
• Building blocks’ fitness is based on a number of generations.
• Purpose: to check and understand the core of the messy puzzle algorithm.

Distance to optimum