1 / 222

# The Population Haplotyping problem - PowerPoint PPT Presentation

The Population Haplotyping problem. NOTATION : each SNP only two values in a population ( bio ). Call them 0 and 1 . Also , call * the fact that a site is heterozygous. HAPLOTYPE : string over 0, 1 GENOTYPE : string over 0, 1, *

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'The Population Haplotyping problem' - luke-boyer

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### The PopulationHaplotyping problem

NOTATION: each SNP onlytwovalues in a population (bio).

Call them0and 1. Also, call *the factthat a site isheterozygous

HAPLOTYPE: string over 0, 1

GENOTYPE: string over 0, 1, *

where0={0}, 1={1}, *={0,1}

01

0*

00

10

1*

11

11

11

11

01

**

10

10

*0

00

10

10

10

*0

10

00

0 +

0 =

---

0

1 +

1 =

---

1

0 + 1 +

1 = 0 =

--- ---

* *

Heterozygous (ambiguous) sites

Homozygous sites

01

00

0*

10

1*

11

01

**

11

10

**

00

10

00

10

*0

10

10

11101

10000

11100

10001

11001

10100

11000

10101

1**0*

For k heterozygous (ambiguous) sites, there are 2k-1 possible phasings

It is too expensive to determine haplotypes directly

Much cheaper to determine genotypes, and then infer haplotypes in silico:

Given genotypes of k individuals, determine the phasings

of all heterozygous sites.

This yields a set H, of (at most) 2k haplotypes. H is a resolution of G.

The input is GENOTYPE data

*1**1

11**1

11011

****1

00011

INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }

The input is GENOTYPE data

11011

01101

*1**1

11011

11101

11**1

11011

11011

11011

00011

11101

****1

00011

00011

00011

INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }

OUTPUT: H = { 11011, 11101, 00011, 01101}

Each genotype is resolved by two haplotypes

We will define some objectives for H

-without objectives/constraints, the

haplotyping problem would be

(mathematically)trivial

E.g., always put 0 above and 1 below

**0*1  00001

11011

1*0** 10000

11011

-the objectives/constraints must be

“driven by biology”

1°) Clark’s inference rule

2°) Perfect Phylogeny

3°) DiseaseAssociation

4°) (parsimony): minimize |H|

Obj: Clark’s rule

for a compatible pair h , g

known haplotypeh

1011001011 +

?????????? =

1**1001*1*

known (ambiguos) genotype g

for a compatible pair h , g

known haplotypeh

1011001011 +

1101001110 =

1**1001*1*

new (derived) haplotype h’

known (ambiguos) genotype g

We write h + h’ = g

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

1100

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

1100

SUCCESS

1111

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

0100

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

0100

FAILURE (can’t resolve 1122 )

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic: the algorithm could end without explaining

all genotypes even if an explanation was possible.

The number of genotypes solved depends on order of application.

OBJ: find order of application rule that leaves the fewest elements in G

(ISMB 2000, and Journal of Comp. Biol., 2001)

- problem is APX-hard

- it corresponds to finding largest forest in a graph with

haplotypes as nodes and arcs for possible derivations

-solved via ILP of exponential-size

(practical for small real instances)

Obj: Perfect Phylogeny

mutations/evolution of haplotypes

- parsimony is very relialable on “small”

haplotype blocks

- when haplotypes are large (span several

SNPs, we should consider evolutionionary

events and recombination)

- the cleanest model for evolution is the

perfect phylogeny

3rd objective is based on perfect phylogeny

- A phylogeny expalains set of binary features (e.g. flies, has fur…)

with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

3rd objective is based on perfect phylogeny

- A phylogeny expalains set of binary features (e.g. flies, has fur…)

with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

has 2 legs

has tail

flies

Perfect phylogeny is possible…

- A phylogeny expalains set of binary features (e.g. flies, has fur…)

with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

has 2 legs

has tail

flies

two legs

tail

Human 1 0 0

Mouse 0 1 0

Spider 0 0 0

Eagle 1 0 1

Theorem: such matrix has p.p. iff there is not a 00 4x2 minor

10

01

11

two legs

tail

Human 1 0 0

Mouse 0 1 0

Spider 0 0 0

Eagle 1 0 1

Mickey mouse 1 1 0

Theorem: such matrix has p.p. iff there is not a 00 4x2 minor

10

01

11

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 * 0

* 1 0 *

* 0 * 0

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 0 0

0 1 1 0

1 1 0 1

0 1 0 0

1 0 0 0

0 0 1 0

0 1 * 0

* 1 0 *

* 0 * 0

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 0 0

0 1 1 0

1 1 0 1

0 1 0 0

1 0 0 0

0 0 1 0

0 1 * 0

* 1 0 *

* 0 * 0

NOT a perfect phylogeny solution !

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 * 0

0 1 0 *

0 0 0 *

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 0 0

0 1 1 0

0 1 0 0

1 1 0 1

0 0 0 0

0 0 0 1

0 1 * 0

0 1 0 *

0 0 0 *

A perfect phylogeny

Theorem: The Perfect Phylogeny Haplotyping problem is polynomial

Theorem: The Perfect Phylogeny Haplotyping problem is polynomial

Algorithms are of combinatorial nature

- There is a graph for which SNPs are columns and edges are of two types

(forced and free)

- forced edges connect pairs of SNPs that must be phased in the same way

** 00 + 11 or ** 01 + 10

- a complex visit of the graph decides how to phase free SNPs

Obj: Disease Association

Some diseases may be due to a gene which has “faulty” configurations

RECESSIVE DISEASE (e.g. cystic fibrosis, sickle cell anemia): to be diseased

one must have both copies faulty. With one copy one is a carrier of the disease

DOMINANT DISEASE (e.g. Huntington’s disease, Marfan’s syndrome): to be

diseased it is enough to have one faulty copy

Two individuals of which one is healthy and the other diseased

may have the same genotype.

The explanation of the disease lies in a difference in their haplotypes

11**1 configurations

*1**1

0*011

11**1

0**01

00011

INPUT: GD = {11**1,*1**1,0*011},

GH ={11**1,0**01,00011}

11011 configurations

01101

11011

11101

01011

00011

11**1

*1**1

0*011

11001

11111

01101

00001

00011

00011

11**1

0**01

00011

INPUT: GD = {11**1,*1**1,0*011},

GH ={11**1,0**01,00011}

OUTPUT: H = { 11011,01011,00001,11111,11101,00011,01101}

H contains HD, s.t. each diseased has >=1 haplotype in HD and each healty none

Theorem 1 is proved via a reduction from 3 SAT configurations

Theorem 2 has a mathematical proof (coloring argument) with

little relation to biology:

There is R (depending on input) s.t. a haplotype is healthy

if the sum of its bits is congruent to R modulo 3

This means the model must be refined!

4th configurations

Obj: MaxParsimony

 See separate slides…

Summary: configurations

- haplotyping in-silico needed for economical reasons

- several objectives, all biologically driven

- nice combinatorial problems

(mostly due to binary nature of SNPs)

- these problems are technology-dependant and may

become obsolete (hopefully after we have retired)

111 configurations

111

011

101

011

000

111

111

010

001

010

011

022

222

012

221

222

011

012

022

022

211

111

012

2nd Objective (parsimony) configurations :

minimize |H|

1. The problem is APX-Hard configurations

Reduction from VERTEX-COVER

B configurations

A

C

D

E

A B C D E * configurations

B

A

C

D

E

A B C D E * configurations

AB

BC

AE

DE

B

A

C

D

E

A B C D E * configurations

AB

BC

AE

DE

A

B

C

D

E

B

A

C

D

E

A B C D E * configurations

AB 2 2

BC 2 2

AE 22

DE 2 2

A

B

C

D

E

B

A

C

D

E

A B C D E * configurations

AB 2 2

BC 2 2

AE 22

DE 2 2

A 0

B 0

C 0

D 0

E 0

B

A

C

D

E

A B C D E * configurations

AB 2 2 2

BC 2 2 2

AE 222

DE 2 22

A 00

B 00

C 00

D 00

E 00

B

A

C

D

E

A B C D E * configurations

AB 2 2 1 1 1 2

BC 1 2 2 1 1 2

AE 2 1 1 1 22

DE 1 1 1 2 22

AD 2 1 1 2 1 2

A 0 1 1 1 1 0

B 1 0 1 1 1 0

C 1 1 0 1 1 0

D 1 1 1 0 1 0

E 1 1 1 1 00

B

A

C

D

E

A B C D E * configurations

AB 2 2 1 1 1 2

BC 1 2 2 1 1 2

AE 2 1 1 1 22

DE 1 1 1 2 22

AD 2 1 1 2 1 2

A 0 1 1 1 1 0

B 1 0 1 1 1 0

C 1 1 0 1 1 0

D 1 1 1 0 1 0

E 1 1 1 1 00

B

A

C

D

E

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that

explain all genotypes

A B C D E * configurations

AB 2 2 1 1 1 2

BC 1 2 2 1 1 2

AE 2 1 1 1 22

DE 1 1 1 2 22

AD 2 1 1 2 1 2

A 0 1 1 1 1 0

B 1 0 1 1 1 0

C 1 1 0 1 1 0

D 1 1 1 0 1 0

E 1 1 1 1 00

B

A

C

D

E

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that

explain all genotypes

A B C D E * configurations

AB 2 2 1 1 1 2

BC 1 2 2 1 1 2

AE 2 1 1 1 22

DE 1 1 1 2 22

AD 2 1 1 2 1 2

A 0 1 1 1 1 0

B 1 0 1 1 1 0

C 1 1 0 1 1 0

D 1 1 1 0 1 0

E 1 1 1 1 00

A’ 0 1 1 1 1 1

B’ 1 0 1 1 1 1

E’ 1 1 1 1 0 1

B

A

C

D

E

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that

explain all genotypes

A basic ILP formulation configurations

A basic ILP formulation configurations

Expand your input G in all possible ways

220

022

120

A basic ILP formulation configurations

Expand your input G in all possible ways

220

022

120

010 + 100, 000 + 110

000 + 011, 001 + 010

100 + 110

A basic ILP formulation configurations

Expand your input G in all possible ways

220

022

120

010 + 100, 000 + 110

000 + 011, 001 + 010

100 + 110

The input is formulations GENOTYPE data

11011

01101

*1**1

11011

11101

11**1

11011

11011

11O11

00011

11101

****1

00011

00011

OOO11

INPUT: G = { 11**1, ****1, 11011, *1**1, 00011}

OUTPUT: H = { 11011, 11101, 00011, 01101}

Each genotype is explained by two haplotypes

We will define some objectives for H

1st Objective formulations (open research problem):

minimize |H|

2nd Objective based on inference rule:

1st Objective (parsimony) formulations :

minimize |H|

An easy SQRT(n) approximation: k haplotypes can explain at most k(k-1)/2 genotypes, hence, we need at least LB = SQRT(n) haplotypes.

BUT any greedy algorithm can find 2 haplotypes to explain a genotype, giving a

solution of <= 2n haplotypes, i.e. <= SQRT(n) * LB

It’s difficult, but not impossible, to come up with better approximations, like constants

(Lancia, Pinotti, Rizzi ’02)

2nd Objective formulations based on inference rule:

Inference Rule formulations

known haplotypeh

xoxxooxoxx +

********** =

x??xoox?x?

known (ambiguos) genotype g

Inference Rule formulations

known haplotypeh

xoxxooxoxx +

xxoxooxxxo =

x??xoox?x?

new (derived) haplotype h’

known (ambiguos) genotype g

Inference Rule formulations

known haplotypeh

xoxxooxoxx +

xxoxooxxxo =

x??xoox?x?

new (derived) haplotype h’

known (ambiguos) genotype g

We write h + h’ = g

g and h must be compatible to derive h’

2nd Objective (Clark, 1990) formulations

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

2nd Objective (Clark, 1990) formulations

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

2nd Objective (Clark, 1990) formulations

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

2nd Objective (Clark, 1990) formulations

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

xxoo

2nd Objective (Clark, 1990) formulations

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

xxoo

SUCCESS

xxxx

2nd Objective (Clark, 1990) formulations

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

2nd Objective (Clark, 1990) formulations

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

oxoo

2nd Objective (Clark, 1990) formulations

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

oxoo

FAILURE (can’t resolve xx?? )

OBJ: find order of application rule that leaves the fewest elements in G

- Problem is APX-hard (Gusfield,00) formulations

- Graph-Model + Integer Programming for practical solution (G.,01)

- Problem is APX-hard (Gusfield,00) formulations

- Graph-Model + Integer Programming for practical solution (G.,01)

x??o?

1. expand

genotypes

- Problem is APX-hard (Gusfield,00) formulations

- Graph-Model + Integer Programming for practical solution (G.,01)

xoooo

xooox

xoxoo

x??o?

xoxox

xxooo

1. expand

genotypes

xxoox

xxxoo

xxxox

- Problem is APX-hard (Gusfield,00) formulations

- Graph-Model + Integer Programming for practical solution (G.,01)

xoooo

xooox

xoxoo

2. create (h, h’) if exists g s.t. h’ can be

derived from g and h

x??o?

xoxox

xxooo

1. expand

genotypes

xxoox

3. Largest number of nodes in forest

rooted at unambiguos genotpes =

= largest number of ambiguous genotypes

resolved

xxxoo

xxxox

Hence, find largest number of nodes in forest rooted at unambiguos genotpes. Use I.P. model with vars x(ij).

This reduction is exponential. Is there a better practical approach?

3rd Objective formulations (open research problem)

Disease Detection:

?x??x

xx??x

??oxx

????x

oooxx

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

3rd Objective formulations (open research problem)

Disease Detection:

xxoxx

oxxox

?x??x

xxoxx

xxxox

xx??x

xxoxx

oooxx

??oxx

oooxx

xxxox

????x

oooxx

oooxx

oooxx

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}

H contains H’, s.t. each diseased has one haplotype in H’ and each healty none

minimize | H|

### Genome Rearrangements and Evolutionary Distances formulations

Each species has a genome (organized in pairs of chromosomes)

tcgtgatggat………………ttgatggattga

tcgattatggat………………ttttgatatcca

Genomes evolve by means of

• Insertions

• Deletions

• Inversions

• Transpositions

• Translocations

of DNA regions

deletion chromosomes)

insertion chromosomes)

deletion

insertion chromosomes)

deletion

translocation

insertion chromosomes)

deletion

translocation

inversion

insertion chromosomes)

deletion

transposition

translocation

inversion

Combinatorial problem chromosomes): given 2 permutations P, Q and operators in a set F find a

shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))

Very difficult problem! We focus on operators all of the same type (e.g. inversions)

(…still difficult…)

Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions,

transpositions…)

We focus on inversions, that are the most important in Nature

5 6 4 8 3 2 1 9 7

Example:

1 2 3 8 4 6 5 9 7

1 2 3 8 4 5 6 9 7

1 2 3 6 5 4 8 9 7

1 2 3 6 5 4 8 7 9

1 2 3 4 5 6 8 7 9

1 2 3 4 5 6 7 8 9

Combinatorial problem chromosomes): given 2 permutations P, Q and operators in a set F find a

shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))

Very difficult problem! We focus on operators all of the same type (e.g. inversions)

(…still difficult…)

Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions,

transposition…)

We focus on inversions, that are the most important in Nature

+5 +6 -4 -8 -3 -2 -1 -9 +7

Example:

+1 +2 +3 +8 +4 -6 -5 -9 +7

+1 +2 +3 +8 +4 +5 +6 -9 +7

+1 +2 +3 -6 -5 -4 -8 -9 +7

+1 +2 +3 -6 -5 -4 -8 -7 +9

+1 +2 +3 +4 +5 +6 -8 -7 +9

+1 +2 +3 +4 +5 +6 +7 +8 +9

There is also a SIGNED VERSION of the problem !

(Unsigned) Sorting by Inversions is chromosomes)NP-hard (longstanding question, settled by Caprara ‘98)

Surprisingly, Signed Sorting by Inversions is Polynomial (beautifultheory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown

(Unsigned) Sorting by Inversions is chromosomes)NP-hard (longstanding question, settled by Caprara ‘98)

Surprisingly, Signed Sorting by Inversions is Polynomial (beautifultheory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown

The concept of breakpoint

Breakpoint at position i if | p(i) - p(i+1) | > 1

0

5 7 8 2 1 4 3 6 9

10

(Unsigned) Sorting by Inversions is chromosomes)NP-hard (longstanding question, settled by Caprara ‘98)

Surprisingly, Signed Sorting by Inversions is Polynomial (beautifultheory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown

The concept of breakpoint

Breakpoint at position i if | p(i) - p(i+1) | > 1

0

5 7 8 2 1 4 3 6 9

10

d(p) = inversion distance

b(p) = # breakpoints

TRIVIAL BOUND: d(p) >= b(p) / 2

Example: d(p) >= 6 / 2 = 3

The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

Each node has degree...

10

4

6

0 2 or 4 …

hence the graph can be decomposed in cycles!

The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

Alternating cycle decomposition

The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

Alternating cycle decomposition

The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

Alternating cycle decomposition

c(p) = max # cycles in alternating decomposition

VERY STRONG BOUND : d (p) >= b(p) - c(p)

Example: c(p)= 2 and d (p) >= 6 - 2 = 4

The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

The best algorithm for this problem is based on an Integer Programming

formulation of the max cycle decomposition

A variable xC for each cycle (exponential # of vars…)

A constraint SxC = 1 for each edge e

C containing e

Objective: maximize SCxC

PRIMAL chromosomes)

max S xC

C

S xC = 1 for all edges e

C\ni e

xC \in {0,1} for all alt. cycles C

DUAL

min S ye

e

S ye <= 1 for all alt. Cycles C

e\in C

ye \in R for all edges e

PRIMAL chromosomes)

max S xC

C

S xC = 1 for all edges e

C\ni e

xC \in {0,1} for all alt. cycles C

DUAL

min S ye

e

S ye <= 1 for all alt. Cycles C

e\in C

ye \in R for all edges e

0 chromosomes)

5 7 8 2 1 4 3 6 9

10

Pricing out the cycles for which y*(C) < 1

0 chromosomes)

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

Split the graph in two copies

0 chromosomes)

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

Connect twins

0 chromosomes)

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles

0 chromosomes)

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles

0 chromosomes)

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles

0 chromosomes)

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles

0 chromosomes)

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles

.6 chromosomes)

1

.4

0

5 7 8 2 1 4 3 6 9

10

0

0

5 7 8 2 1 4 3 6 9

10

.5

.2

The weight of the matching is the y*-weight of the cycles

.6 chromosomes)

1

.4

0

5 7 8 2 1 4 3 6 9

10

100000

0

5 7 8 2 1 4 3 6 9

10

.5

.2

Forcing a cycle to use a certain node

- These cycles would not use the same node twice, but with simple trick is possible to model (OMISSIS)

BRANCH&PRICE algorithm by Caprara, Lancia, Ng (1999,2001)

BRANCH&BOUND combinatorial algorithm by Kececioglu, Sankoff (1996)

KS can solve at most n=40. Take days for n=50

CLN can solve for n=200. Takes few seconds (say 5) for n=100

NP-hard problem practically solved to optimality!

Statistical view of evolution simple trick is possible to model (OMISSIS)

• Genome evolve by random inversions

• It’s like a random walk on a huge graph with an edge for each permutation an edge for each inversion

• It is not clear why the shortest solution should be the one followed by Nature (in fact, often it isn’t)

• We want to find the most likely number of inversions that lead from (1 2 … n ) to p

• We use the expected number of breakpoints after k inversions as a way to guess the # of inversions

Given a simple trick is possible to model (OMISSIS) p obtained by h random inversions from (1 … n ) we want to estimate h

The inversion distance is only a lower bound: h >= d(p) but the gap could be big

Let B(k)be the (r.v.) number of breakpoint after k random inversions from (1..n)

We estimate E[B(k)]. Then, faced with some p, we pick h such that E[B(h)] is as

close as possible to b(p) (maximum likelihood). CL ,2000, have shown:

E[B(k)] = ( n - 1 ) ( 1 - ())

k

n - 3

n - 1

Question: estimate E[D(k)], the (r.v.) inversion distance after k random inversions

Example: n = 200, k (u.a.r. in 1…n) inversions simple trick is possible to model (OMISSIS)

k k’ d(p) b

8 8 8 16

19 19 19 34

68 67 67 98

69 73 68 104

73 79 73 109

85 91 83 120

86 85 83 115

87 90 84 119

118 117 109 138

184 184 135 168

### Protein Structure Alignments: the Maximum Contact Map Overlap Problem

A Overlap ProblemProtein is a complex molecule with a primary,

linear structure (a sequence of aminoacids) and a

3-Dimensional structure (the protein fold).

Protein STRUCTURE determines its FUNCTION

For instance, the Drug Design problem

calls for constructing peptides with a 3D

shape complementary to a protein, so as

to dock onto it.

Problem: Overlap Problem

Motivation:

Align two 3D protein structures

Structure Alignment is Important for:

- Discovery of Protein Function (shape determines function)

- Search in 3D data bases

- Protein Classification and Evolutionary Studies

- ...

### Contact Maps Overlap Problem

CONTACT MAPS Overlap Problem

Unfolded protein

CONTACT MAPS Overlap Problem

Unfolded protein

Folded protein = contacts

CONTACT MAPS Overlap Problem

Unfolded protein

Folded protein = contacts

Contact map = graph

CONTACT MAPS Overlap Problem

Unfolded protein

Folded protein = contacts

Contact map = graph

OBJECTIVE:align 3d folds of proteins

= align contact maps

### Contact Map Alignments Overlap Problem

Non-crossing Alignments Overlap Problem

Protein 1

Protein 2

non-crossing map of residues in protein 1 and protein 2

The value of an alignment Overlap Problem

The value of an alignment Overlap Problem

The value of an alignment Overlap Problem

The value of an alignment Overlap Problem

Value = 3

The value of an alignment Overlap Problem

Value = 3

We want to maximize the value

The value of an alignment Overlap Problem

NP-Hard

Integer Programming Formulation Overlap Problem

e

0-1 VARIABLES

yef

yef for e and f contacts

f

Integer Programming Formulation Overlap Problem

e

0-1 VARIABLES

yef

yef for e and f contacts

e’

e

f

CONSTRAINTS

yef + ye’f’<= 1

f’

f

Integer Programming Formulation Overlap Problem

e

0-1 VARIABLES

yef

yef for e and f contacts

e’

e

f

CONSTRAINTS

yef + ye’f’<= 1

f’

f

max SeSf yef

OBJECTIVE

Independent Set Problem Overlap Problem

It’s just a huge max independent set problem in Gy:

• a node for each sharing

• an edge for each pair of incompatible sharings

e’

e

e’’

e’

f’

e

f

e’’

f’’

f’’

f

f’

Independent Set Problem Overlap Problem

It’s just a huge max independent set problem in Gy:

• a node for each sharing

• an edge for each pair of incompatible sharings

e’

e

e’’

e’

f’

e

f

e’’

f’’

f’’

f

f’

|Gy|=|E1|*|E2| (approximately 5000 for two proteins with 50 residues and 75 contacts each)

The best exact algorithm for independent set can solve for at most a few hundred

nodes

Node to Node Variables Overlap Problem

New variables x provide an easy check for the non-crossing conditions

e

NEW VARIABLES

i

xij for i and j residues

yef

xij

j

f

Node to Node Variables Overlap Problem

New variables x provide an easy check for the non-crossing conditions

e

NEW VARIABLES

i

xij for i and j residues

yef

xij

j

f

NEW CONSTRAINTS

i

i’

j’

j

xij + xi’j’ <= 1

Node to Node Variables Overlap Problem

New variables x provide an easy check for the non-crossing conditions

e

NEW VARIABLES

i

xij for i and j residues

yef

xij

j

f

NEW CONSTRAINTS

i

i’

p

i

q

j

j’

j

xij + xi’j’<= 1

y(ip)(jq) <= xij and y(ip)(jq)<= xpq

Clique Constraints Overlap Problem

Variables x define a graph Gx:

• A node for each line

• An edge between each pair of crossing lines

i

i’

i’

j’

i

j

j’

j

Clique Constraints Overlap Problem

Variables x define a graph Gx:

• A node for each line

• An edge between each pair of crossing lines

i

i’

i’

j’

i

j

j’

j

• Gx is much smaller than Gy

• Gx has nice proprieties (it’s a perfect graph)

• It’s easier to find large independent sets in Gx

Clique Constraints Overlap Problem

Non-crossing constraints can be extended to

CLIQUE CONSTRAINTS

S xij<= 1

[i,j] in M

For all sets M of mutually incompatible (i.e. crossing) lines

All clique constraints satisfied (and Gx perfect) imply a strong bound!

Structure of Maximal cliques in G Overlap Problemx

1. Pick two subsets of same size

Structure of Maximal cliques in G Overlap Problemx

2. Connect them in a zig-zag fashion

Structure of Maximal cliques in G Overlap Problemx

3. Throw in all lines included in a zig or a zag

Structure of Maximal cliques in G Overlap Problemx

3. Throw in all lines included in a zig or a zag

Structure of Maximal cliques in G Overlap Problemx

The result is a maximal clique in Gx

### Separation of Clique Inequalities Overlap Problem

Separation of Clique Inequalities Overlap Problem

PROBLEM

There exist exponentially many such cliques (O(22n) inequalities).

We need to generate in polynomial time a clique inequality when needed,

i.e., when violated by the current LP solution x*

S x*ij> 1

[i,j] in M

THEOREM

We can find the most violated clique inequality in time O(n2)

Separation of Clique Inequalities Overlap Problem

PROOF (sketch)

1) Clique = zigzag path

Separation of Clique Inequalities Overlap Problem

PROOF (sketch)

1) Clique = zigzag path

1 2 3 4 5 6 7 8

Separation of Clique Inequalities Overlap Problem

PROOF (sketch)

2) Flip one graph: zigzag leftright

1) Clique = zigzag path

1 2 3 4 5 6 7 8

8 7 6 5 4 3 2 1

Separation of Clique Inequalities Overlap Problem

PROOF (sketch)

2) Flip one graph: zigzag leftright

1) Clique = zigzag path

1 2 3 4 5 6 7 8

8 7 6 5 4 3 2 1

3) Define a grid with lengths for arcs so that length(P) = x*(clique(P)). Use Dyn. Progr.

to find longest path in grid, time O(n^2)

Separation of cliques Overlap Problem

2

n1

1

i

1

2

u

n2

Create n1 x n2 grid

Orient all edges and give weights

Separation of cliques Overlap Problem

2

n1

1

i

1

2

x*iu

u

x*iu

n2

Create n1 x n2 grid

Orient all edges and give weights

Separation of cliques Overlap Problem

B=(n1,1)

A=(1,n2)

Create n1 x n2 grid

Orient all edges and give weights

There is violated clique iff longest A,B path has length > 1

Gx is a Perfect Graph Overlap Problem

We show why polynomial separation is possible:

Gx is weakly triangulated (no chordless cycles >= 5 in Gx or Gx)

=> Gx is perfect (Hayward, 1985)

Gx is a Perfect Graph Overlap Problem

L1 and L3 don’t cross. Wlog RIGHT(L3, L1)

PROOF (Sketch, for Gx)

L1

L7

L2

L6

L3

L5

L4

Gx is a Perfect Graph Overlap Problem

L1 and L3 don’t cross. Wlog RIGHT(L3, L1)

L1

L7

L2

L6

L3

L1

L3

L5

L4

Gx is a Perfect Graph Overlap Problem

For i=4,5,… Li crosses Li-1 but not L1

=> RIGHT (Li, L1)

L1

L7

L2

L6

L3

L1

L3

L5

L4

Gx is a Perfect Graph Overlap Problem

For i=4,5,… Li crosses Li-1 but not L1

=> RIGHT (Li, L1)

L1

L7

L2

L6

L3

L4

L1

L3

L5

L4

Gx is a Perfect Graph Overlap Problem

For i=4,5,… Li crosses Li-1 but not L1

=> RIGHT (Li, L1)

L1

L7

L2

L6

L3

L4

L5

L1

L5

L4

Gx is a Perfect Graph Overlap Problem

For i=4,5,… Li crosses Li-1 but not L1

=> RIGHT (Li, L1)

L1

L7

L2

L6

L3

L6

L5

L1

L5

L4

Gx is a Perfect Graph Overlap Problem

We get LEFT(L1, {L3, L4, L5, L6})

L1

L7

L2

L6

L3

L6

L1

L5

L4

L3, L4, L5 L6

Gx is a Perfect Graph Overlap Problem

A symmetric argument started at L6, with LEFT(L1, L6) implies LEFT(Li, L6) for i=2,3,4,5

L1

L7

L2

L6

L3

L6

L1

L5

L4

L3, L4, L5 L6

Gx is a Perfect Graph Overlap Problem

A symmetric argument started at L6, with LEFT(L1, L6) implies LEFT(Li, L6) for i=2,3,4,5

L1

L7

L2

L6

L3

L6

L1

L5

L4

L3, L4, L5 L6

L2, L3, L4 L5

Gx is a Perfect Graph Overlap Problem

Then {L3, L4, L5} are between L1 and L6

L1

L7

L2

L6

L3

L6

L1

L5

L4

L3, L4, L5L6

L2, L3, L4 L5

Gx is a Perfect Graph Overlap Problem

Then {L3, L4, L5} are between L1 and L6

But L7 crosses L1 and L6, and so should cross them all !

L1

L7

L2

L6

L3

L6

L7

L1

L5

L4

L3, L4, L5L6

L2, L3, L4 L5

It can be applied to small or moderate proteins (up to 80 residues/150 contacts)

In 2002, a new approach, by Caprara and Lancia, based on LAGRANGIAN

RELAXATION. Approach borrowed from Quadratic Assignment. With new

approach we can solve important proteins (up to 150 residues/300 contacts)

### What about Heuristics? Walenz (2001)E.g., genetic algorithms…

Population Walenz (2001)

at time t

Population

at time t+1

Recombination

operators

Evaluation

function

Genetic Algorithm Overview

• A Population of candidate solutions thatevolve (improve) over time

• Recombination creates new candidate solutions viacrossover and mutation

Crossover Walenz (2001)

• Crossover selects pieces from both parents and creates two offspring solutions

Offspring

Blue Parent

Red Parent

Crossover Walenz (2001)

• Crossover selects pieces from both parents and creates two offspring solutions

• Select a set of edges in one parent to copy to the child

Crossover Walenz (2001)

• Crossover selects pieces from both parents and creates two offspring solutions

• Select a set of edges in one parent to copy to the child

Crossover Walenz (2001)

• Crossover selects pieces from both parents and creates two offspring solutions

• Select a set of edges in one parent to copy to the child

• Copy as many edges as possible from the other parent

These edges conflict with existing Walenz (2001)

edges and are not copied

Crossover

• Crossover selects pieces from both parents and creates two offspring solutions

• Select a set of edges in one parent to copy to the child

• Copy as many edges as possible from the other parent

Crossover Walenz (2001)

• Crossover selects pieces from both parents and creates two offspring solutions

• Select a set of edges in one parent to copy to the child

• Copy as many edges as possible from the other parent

• Add random edges to fill any remaining space

Crossover Walenz (2001)

• Crossover selects pieces from both parents and creates two offspring solutions

• Select a set of edges in one parent to copy to the child

• Copy as many edges as possible from the other parent

• Add random edges to fill any remaining space

Mutation Walenz (2001)

• Mutation introduces small changes to existing solutions by shifting edge endpoints

Mutation Walenz (2001)

• Mutation introduces small changes to existing solutions by shifting edge endpoints

• Select a set of endpoints to shift

Mutation Walenz (2001)

• Mutation introduces small changes to existing solutions by shifting edge endpoints

• Select a set of endpoints to shift

Mutation Walenz (2001)

• Mutation introduces small changes to existing solutions by shifting edge endpoints

• Select a set of endpoints to shift

This edge “fell off” the

end of the contact map

and is removed

Mutation Walenz (2001)

• Mutation introduces small changes to existing solutions by shifting edge endpoints

• Select a set of endpoints to shift

Mutation Walenz (2001)

• Mutation introduces small changes to existing solutions by shifting edge endpoints

• Select a set of endpoints to shift

### Computational Results Walenz (2001)

Computational Results Walenz (2001)

• 269 proteins

• 70 -100 residues

• 80 to 140 contacts

• Picked 10,000 pairs of proteins out of 36046 possible

• Took a weekend on PC

• 500 were solved to optimality

• 2500 had a gap <= 10 contacts

### Skolnick Clustering Test Walenz (2001)

Skolnick Results Walenz (2001)

• Four Families

• Flavodoxin-like fold Che-Y related

• Plastocyanin

• TIM Barrel

• Ferratin

• alpha-beta

• 8 structures

• up to 124 residues

• 15-30% sequence similarity

• < 3Å RMSD

Skolnick Results Walenz (2001)

• Four Families

• Flavodoxin-like fold Che-Y related

• Plastocyanin

• TIM Barrel

• Ferratin

• beta

• 8 structures

• up to 99 residues

• 35-90% sequence similarity

• < 2Å RMSD

Skolnick Results Walenz (2001)

• Four Families

• Flavodoxin-like fold Che-Y related

• Plastocyanin

• TIM Barrel

• Ferratin

• alpha-beta

• 11 structures

• up to 250 residues

• 30-90% sequence similarity

• < 2Å RMSD

Skolnick Results Walenz (2001)

• Four Families

• Flavodoxin-like fold Che-Y related

• Plastocyanin

• TIM Barrel

• Ferratin

• alpha

• 6 structures

• up to 170 residues

• 7-70% sequence similarity

• < 4Å RMSD

Skolnick Results Walenz (2001)

• Four Families

• Flavodoxin-like fold Che-Y related

• Plastocyanin

• TIM Barrel

• Ferratin

Clustering Walenz (2001)

Define score(P1, P2) as

# shared contacts

Min # of contacts of P1,P2

0 <=

<= 1

Put P1, P2 in same family if score(P1, P2) >= threshold

Clustering Walenz (2001)

Define score(P1, P2) as

# shared contacts

Min # of contacts of P1,P2

0 <=

<= 1

Put P1, P2 in same family if score(P1, P2) >= threshold

If P1, P2 too big, use G.A. and local search to compute score

L.P. gives then bounds:

HEUR score <= OPT score <= LP bound

and we know how far off OPT we are

Clustering validation Walenz (2001)

We got some known families from biologists, PDB.

Experiment: Take a family F of proteins and align them against each

other and against the remaining.

Clustering validation Walenz (2001)

We got some known families from biologists, PDB.

Experiment: Take a family F of proteins and align them against each

other and against the remaining.

TYPICAL BEHAVIOUR

score proteins were…

0.05 MISMATCH

0.1 MISMATCH

0.15 MISMATCH

0.2 MISMATCH

0.25 MISMATCH

0.3 MISMATCH

0.35 MATCH

…… ……

1.0 MATCH

Skolnick Results Walenz (2001)

• Performance

• 528 alignments

• 1.3% false negative

• 0.0% false positive

Clustering Walenz (2001)

Computed, for 1st time, provably optimal alignments for 150 pairs

(inter-family)

Used the CMO value to cluster: retrieves the clusters.

Set S(i,j) = 1 if CMO >= a, S(i,j) = 0 otherwise

Use TSP to find a block diagonal structure for S

Clustering Walenz (2001)

Last Open Problem Walenz (2001)

?

?