- 33 Views
- Uploaded on
- Presentation posted in: General

The Population Haplotyping problem

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

The PopulationHaplotyping problem

NOTATION: each SNP onlytwovalues in a population (bio).

Call them0and 1. Also, call *the factthat a site isheterozygous

HAPLOTYPE: string over 0, 1

GENOTYPE: string over 0, 1, *

where0={0}, 1={1}, *={0,1}

01

0*

00

10

1*

11

11

11

11

01

**

10

10

*0

00

10

10

10

*0

10

00

ALGEBRA OF HAPLOTYPES:

0 +

0 =

---

0

1 +

1 =

---

1

0 + 1 +

1 = 0 =

--- ---

* *

Heterozygous (ambiguous) sites

Homozygous sites

01

00

0*

10

1*

11

01

**

11

10

**

00

10

00

10

*0

10

10

Phasing the alleles

11101

10000

11100

10001

11001

10100

11000

10101

1**0*

For k heterozygous (ambiguous) sites, there are 2k-1 possible phasings

THE PHASING (or HAPLOTYPING) PROBLEM

It is too expensive to determine haplotypes directly

Much cheaper to determine genotypes, and then infer haplotypes in silico:

Given genotypes of k individuals, determine the phasings

of all heterozygous sites.

This yields a set H, of (at most) 2k haplotypes. H is a resolution of G.

The input is GENOTYPE data

*1**1

11**1

11011

****1

00011

INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }

The input is GENOTYPE data

11011

01101

*1**1

11011

11101

11**1

11011

11011

11011

00011

11101

****1

00011

00011

00011

INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }

OUTPUT: H = { 11011, 11101, 00011, 01101}

Each genotype is resolved by two haplotypes

We will define some objectives for H

OBJECTIVES

-without objectives/constraints, the

haplotyping problem would be

(mathematically)trivial

E.g., always put 0 above and 1 below

**0*1 00001

11011

1*0** 10000

11011

-the objectives/constraints must be

“driven by biology”

OBJECTIVES

1°) Clark’s inference rule

2°) Perfect Phylogeny

3°) DiseaseAssociation

4°) (parsimony): minimize |H|

1st

Obj: Clark’s rule

Inference Rule

for a compatible pair h , g

known haplotypeh

1011001011 +

?????????? =

1**1001*1*

known (ambiguos) genotype g

Inference Rule

for a compatible pair h , g

known haplotypeh

1011001011 +

1101001110 =

1**1001*1*

new (derived) haplotype h’

known (ambiguos) genotype g

We write h + h’ = g

1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

1100

1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

1100

SUCCESS

1111

1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

0100

1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

0100

FAILURE (can’t resolve 1122 )

1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic: the algorithm could end without explaining

all genotypes even if an explanation was possible.

The number of genotypes solved depends on order of application.

OBJ: find order of application rule that leaves the fewest elements in G

The problem was studied by Gusfield

(ISMB 2000, and Journal of Comp. Biol., 2001)

- problem is APX-hard

- it corresponds to finding largest forest in a graph with

haplotypes as nodes and arcs for possible derivations

-solved via ILP of exponential-size

(practical for small real instances)

2nd

Obj: Perfect Phylogeny

- Parsimony does not take into account

mutations/evolution of haplotypes

- parsimony is very relialable on “small”

haplotype blocks

- when haplotypes are large (span several

SNPs, we should consider evolutionionary

events and recombination)

- the cleanest model for evolution is the

perfect phylogeny

3rd objective is based on perfect phylogeny

- A phylogeny expalains set of binary features (e.g. flies, has fur…)

with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

3rd objective is based on perfect phylogeny

- A phylogeny expalains set of binary features (e.g. flies, has fur…)

with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

has 2 legs

has tail

flies

But…a new species may come along so that no

Perfect phylogeny is possible…

- A phylogeny expalains set of binary features (e.g. flies, has fur…)

with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

has 2 legs

has tail

flies

flies

two legs

tail

Human 1 0 0

Mouse 0 1 0

Spider 0 0 0

Eagle 1 0 1

Theorem: such matrix has p.p. iff there is not a 00 4x2 minor

10

01

11

flies

two legs

tail

Human 1 0 0

Mouse 0 1 0

Spider 0 0 0

Eagle 1 0 1

Mickey mouse 1 1 0

Theorem: such matrix has p.p. iff there is not a 00 4x2 minor

10

01

11

We can consider each SNP as a binary feature

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

We can consider each SNP as a binary feature

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 * 0

* 1 0 *

* 0 * 0

We can consider each SNP as a binary feature

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 0 0

0 1 1 0

1 1 0 1

0 1 0 0

1 0 0 0

0 0 1 0

0 1 * 0

* 1 0 *

* 0 * 0

We can consider each SNP as a binary feature

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 0 0

0 1 1 0

1 1 0 1

0 1 0 0

1 0 0 0

0 0 1 0

0 1 * 0

* 1 0 *

* 0 * 0

NOT a perfect phylogeny solution !

We can consider each SNP as a binary feature

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 * 0

0 1 0 *

0 0 0 *

We can consider each SNP as a binary feature

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 0 0

0 1 1 0

0 1 0 0

1 1 0 1

0 0 0 0

0 0 0 1

0 1 * 0

0 1 0 *

0 0 0 *

A perfect phylogeny

Theorem: The Perfect Phylogeny Haplotyping problem is polynomial

Theorem: The Perfect Phylogeny Haplotyping problem is polynomial

Algorithms are of combinatorial nature

- There is a graph for which SNPs are columns and edges are of two types

(forced and free)

- forced edges connect pairs of SNPs that must be phased in the same way

** 00 + 11 or ** 01 + 10

- a complex visit of the graph decides how to phase free SNPs

3rd

Obj: Disease Association

Some diseases may be due to a gene which has “faulty” configurations

RECESSIVE DISEASE (e.g. cystic fibrosis, sickle cell anemia): to be diseased

one must have both copies faulty. With one copy one is a carrier of the disease

DOMINANT DISEASE (e.g. Huntington’s disease, Marfan’s syndrome): to be

diseased it is enough to have one faulty copy

Two individuals of which one is healthy and the other diseased

may have the same genotype.

The explanation of the disease lies in a difference in their haplotypes

11**1

*1**1

0*011

11**1

0**01

00011

INPUT: GD = {11**1,*1**1,0*011},

GH ={11**1,0**01,00011}

11011

01101

11011

11101

01011

00011

11**1

*1**1

0*011

11001

11111

01101

00001

00011

00011

11**1

0**01

00011

INPUT: GD = {11**1,*1**1,0*011},

GH ={11**1,0**01,00011}

OUTPUT: H = { 11011,01011,00001,11111,11101,00011,01101}

H contains HD, s.t. each diseased has >=1 haplotype in HD and each healty none

Theorem 1 is proved via a reduction from 3 SAT

Theorem 2 has a mathematical proof (coloring argument) with

little relation to biology:

There is R (depending on input) s.t. a haplotype is healthy

if the sum of its bits is congruent to R modulo 3

This means the model must be refined!

4th

Obj: MaxParsimony

See separate slides…

Summary:

- haplotyping in-silico needed for economical reasons

- several objectives, all biologically driven

- nice combinatorial problems

(mostly due to binary nature of SNPs)

- these problems are technology-dependant and may

become obsolete (hopefully after we have retired)

111

111

011

101

011

000

111

111

010

001

010

011

022

222

012

221

222

011

012

022

022

211

111

012

2nd Objective (parsimony) :

minimize |H|

1. The problem is APX-Hard

Reduction from VERTEX-COVER

B

A

C

D

E

A B C D E *

B

A

C

D

E

A B C D E *

AB

BC

AE

DE

AD

B

A

C

D

E

A B C D E *

AB

BC

AE

DE

AD

A

B

C

D

E

B

A

C

D

E

A B C D E *

AB 2 2

BC 2 2

AE 22

DE 2 2

AD 22

A

B

C

D

E

B

A

C

D

E

A B C D E *

AB 2 2

BC 2 2

AE 22

DE 2 2

AD 22

A 0

B 0

C 0

D 0

E 0

B

A

C

D

E

A B C D E *

AB 2 2 2

BC 2 2 2

AE 222

DE 2 22

AD 222

A 00

B 00

C 00

D 00

E 00

B

A

C

D

E

A B C D E *

AB 2 2 1 1 1 2

BC 1 2 2 1 1 2

AE 2 1 1 1 22

DE 1 1 1 2 22

AD 2 1 1 2 1 2

A 0 1 1 1 1 0

B 1 0 1 1 1 0

C 1 1 0 1 1 0

D 1 1 1 0 1 0

E 1 1 1 1 00

B

A

C

D

E

A B C D E *

AB 2 2 1 1 1 2

BC 1 2 2 1 1 2

AE 2 1 1 1 22

DE 1 1 1 2 22

AD 2 1 1 2 1 2

A 0 1 1 1 1 0

B 1 0 1 1 1 0

C 1 1 0 1 1 0

D 1 1 1 0 1 0

E 1 1 1 1 00

B

A

C

D

E

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that

explain all genotypes

A B C D E *

AB 2 2 1 1 1 2

BC 1 2 2 1 1 2

AE 2 1 1 1 22

DE 1 1 1 2 22

AD 2 1 1 2 1 2

A 0 1 1 1 1 0

B 1 0 1 1 1 0

C 1 1 0 1 1 0

D 1 1 1 0 1 0

E 1 1 1 1 00

B

A

C

D

E

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that

explain all genotypes

A B C D E *

AB 2 2 1 1 1 2

BC 1 2 2 1 1 2

AE 2 1 1 1 22

DE 1 1 1 2 22

AD 2 1 1 2 1 2

A 0 1 1 1 1 0

B 1 0 1 1 1 0

C 1 1 0 1 1 0

D 1 1 1 0 1 0

E 1 1 1 1 00

A’ 0 1 1 1 1 1

B’ 1 0 1 1 1 1

E’ 1 1 1 1 0 1

B

A

C

D

E

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that

explain all genotypes

A basic ILP formulation

A basic ILP formulation

Expand your input G in all possible ways

220

022

120

A basic ILP formulation

Expand your input G in all possible ways

220

022

120

010 + 100, 000 + 110

000 + 011, 001 + 010

100 + 110

A basic ILP formulation

Expand your input G in all possible ways

220

022

120

010 + 100, 000 + 110

000 + 011, 001 + 010

100 + 110

The resulting Integer Program (IP1):

Other ILP formulation are possible. E.g. POLY-SIZE ILP formulations

The input is GENOTYPE data

11011

01101

*1**1

11011

11101

11**1

11011

11011

11O11

00011

11101

****1

00011

00011

OOO11

INPUT: G = { 11**1, ****1, 11011, *1**1, 00011}

OUTPUT: H = { 11011, 11101, 00011, 01101}

Each genotype is explained by two haplotypes

We will define some objectives for H

1st Objective (open research problem):

minimize |H|

2nd Objective based on inference rule:

1st Objective (parsimony) :

minimize |H|

An easy SQRT(n) approximation: k haplotypes can explain at most k(k-1)/2 genotypes, hence, we need at least LB = SQRT(n) haplotypes.

BUT any greedy algorithm can find 2 haplotypes to explain a genotype, giving a

solution of <= 2n haplotypes, i.e. <= SQRT(n) * LB

It’s difficult, but not impossible, to come up with better approximations, like constants

(Lancia, Pinotti, Rizzi ’02)

2nd Objective based on inference rule:

Inference Rule

known haplotypeh

xoxxooxoxx +

********** =

x??xoox?x?

known (ambiguos) genotype g

Inference Rule

known haplotypeh

xoxxooxoxx +

xxoxooxxxo =

x??xoox?x?

new (derived) haplotype h’

known (ambiguos) genotype g

Inference Rule

known haplotypeh

xoxxooxoxx +

xxoxooxxxo =

x??xoox?x?

new (derived) haplotype h’

known (ambiguos) genotype g

We write h + h’ = g

g and h must be compatible to derive h’

2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

xxoo

2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

xxoo

SUCCESS

xxxx

2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

oxoo

2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

oxoo

FAILURE (can’t resolve xx?? )

OBJ: find order of application rule that leaves the fewest elements in G

- Problem is APX-hard (Gusfield,00)

- Graph-Model + Integer Programming for practical solution (G.,01)

- Problem is APX-hard (Gusfield,00)

- Graph-Model + Integer Programming for practical solution (G.,01)

x??o?

1. expand

genotypes

- Problem is APX-hard (Gusfield,00)

- Graph-Model + Integer Programming for practical solution (G.,01)

xoooo

xooox

xoxoo

x??o?

xoxox

xxooo

1. expand

genotypes

xxoox

xxxoo

xxxox

- Problem is APX-hard (Gusfield,00)

- Graph-Model + Integer Programming for practical solution (G.,01)

xoooo

xooox

xoxoo

2. create (h, h’) if exists g s.t. h’ can be

derived from g and h

x??o?

xoxox

xxooo

1. expand

genotypes

xxoox

3. Largest number of nodes in forest

rooted at unambiguos genotpes =

= largest number of ambiguous genotypes

resolved

xxxoo

xxxox

Hence, find largest number of nodes in forest rooted at unambiguos genotpes. Use I.P. model with vars x(ij).

This reduction is exponential. Is there a better practical approach?

3rd Objective (open research problem)

Disease Detection:

?x??x

xx??x

??oxx

????x

oooxx

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

3rd Objective (open research problem)

Disease Detection:

xxoxx

oxxox

?x??x

xxoxx

xxxox

xx??x

xxoxx

oooxx

??oxx

oooxx

xxxox

????x

oooxx

oooxx

oooxx

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}

H contains H’, s.t. each diseased has one haplotype in H’ and each healty none

minimize | H|

Genome Rearrangements and Evolutionary Distances

Each species has a genome (organized in pairs of chromosomes)

tcgtgatggat………………ttgatggattga

tcgattatggat………………ttttgatatcca

Genomes evolve by means of

- Insertions
- Deletions
- Inversions
- Transpositions
- Translocations

of DNA regions

deletion

insertion

deletion

insertion

deletion

translocation

insertion

deletion

translocation

inversion

insertion

deletion

transposition

translocation

inversion

Combinatorial problem: given 2 permutations P, Q and operators in a set F find a

shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))

Very difficult problem! We focus on operators all of the same type (e.g. inversions)

(…still difficult…)

Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions,

transpositions…)

We focus on inversions, that are the most important in Nature

5 6 4 8 3 2 1 9 7

Example:

1 2 3 8 4 6 5 9 7

1 2 3 8 4 5 6 9 7

1 2 3 6 5 4 8 9 7

1 2 3 6 5 4 8 7 9

1 2 3 4 5 6 8 7 9

1 2 3 4 5 6 7 8 9

Combinatorial problem: given 2 permutations P, Q and operators in a set F find a

shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))

Very difficult problem! We focus on operators all of the same type (e.g. inversions)

(…still difficult…)

Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions,

transposition…)

We focus on inversions, that are the most important in Nature

+5 +6 -4 -8 -3 -2 -1 -9 +7

Example:

+1 +2 +3 +8 +4 -6 -5 -9 +7

+1 +2 +3 +8 +4 +5 +6 -9 +7

+1 +2 +3 -6 -5 -4 -8 -9 +7

+1 +2 +3 -6 -5 -4 -8 -7 +9

+1 +2 +3 +4 +5 +6 -8 -7 +9

+1 +2 +3 +4 +5 +6 +7 +8 +9

There is also a SIGNED VERSION of the problem !

(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98)

Surprisingly, Signed Sorting by Inversions is Polynomial (beautifultheory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown

(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98)

Surprisingly, Signed Sorting by Inversions is Polynomial (beautifultheory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown

The concept of breakpoint

Breakpoint at position i if | p(i) - p(i+1) | > 1

0

5 7 8 2 1 4 3 6 9

10

(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98)

Surprisingly, Signed Sorting by Inversions is Polynomial (beautifultheory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown

The concept of breakpoint

Breakpoint at position i if | p(i) - p(i+1) | > 1

0

5 7 8 2 1 4 3 6 9

10

d(p) = inversion distance

b(p) = # breakpoints

TRIVIAL BOUND: d(p) >= b(p) / 2

Example: d(p) >= 6 / 2 = 3

The Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

The Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

The Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

The Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

The Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

Each node has degree...

10

4

6

0 2 or 4 …

hence the graph can be decomposed in cycles!

The Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

Alternating cycle decomposition

The Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

Alternating cycle decomposition

The Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

Alternating cycle decomposition

c(p) = max # cycles in alternating decomposition

VERY STRONG BOUND : d (p) >= b(p) - c(p)

Example: c(p)= 2 and d (p) >= 6 - 2 = 4

The Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

The best algorithm for this problem is based on an Integer Programming

formulation of the max cycle decomposition

A variable xC for each cycle (exponential # of vars…)

A constraint SxC = 1 for each edge e

C containing e

Objective: maximize SCxC

PRIMAL

max S xC

C

S xC = 1 for all edges e

C\ni e

xC \in {0,1} for all alt. cycles C

DUAL

min S ye

e

S ye <= 1 for all alt. Cycles C

e\in C

ye \in R for all edges e

PRIMAL

max S xC

C

S xC = 1 for all edges e

C\ni e

xC \in {0,1} for all alt. cycles C

DUAL

min S ye

e

S ye <= 1 for all alt. Cycles C

e\in C

ye \in R for all edges e

0

5 7 8 2 1 4 3 6 9

10

Pricing out the cycles for which y*(C) < 1

0

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

Split the graph in two copies

0

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

Connect twins

0

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles

0

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles

0

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles

0

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles

0

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles

.6

1

.4

0

5 7 8 2 1 4 3 6 9

10

0

0

5 7 8 2 1 4 3 6 9

10

.5

.2

The weight of the matching is the y*-weight of the cycles

.6

1

.4

0

5 7 8 2 1 4 3 6 9

10

100000

0

5 7 8 2 1 4 3 6 9

10

.5

.2

Forcing a cycle to use a certain node

- These cycles would not use the same node twice, but with simple trick is possible to model (OMISSIS)

BRANCH&PRICE algorithm by Caprara, Lancia, Ng (1999,2001)

BRANCH&BOUND combinatorial algorithm by Kececioglu, Sankoff (1996)

KS can solve at most n=40. Take days for n=50

CLN can solve for n=200. Takes few seconds (say 5) for n=100

NP-hard problem practically solved to optimality!

- Genome evolve by random inversions
- It’s like a random walk on a huge graph with an edge for each permutation an edge for each inversion
- It is not clear why the shortest solution should be the one followed by Nature (in fact, often it isn’t)
- We want to find the most likely number of inversions that lead from (1 2 … n ) to p
- We use the expected number of breakpoints after k inversions as a way to guess the # of inversions

Given a p obtained by h random inversions from (1 … n ) we want to estimate h

The inversion distance is only a lower bound: h >= d(p) but the gap could be big

Let B(k)be the (r.v.) number of breakpoint after k random inversions from (1..n)

We estimate E[B(k)]. Then, faced with some p, we pick h such that E[B(h)] is as

close as possible to b(p) (maximum likelihood). CL ,2000, have shown:

E[B(k)] = ( n - 1 ) ( 1 - ())

k

n - 3

n - 1

Question: estimate E[D(k)], the (r.v.) inversion distance after k random inversions

Example: n = 200, k (u.a.r. in 1…n) inversions

k k’ d(p) b

88816

19191934

68676798

697368104

737973109

859183120

868583115

879084119

118117109138

184184135168

Protein Structure Alignments: the Maximum Contact Map Overlap Problem

A Protein is a complex molecule with a primary,

linear structure (a sequence of aminoacids) and a

3-Dimensional structure (the protein fold).

Protein STRUCTURE determines its FUNCTION

For instance, the Drug Design problem

calls for constructing peptides with a 3D

shape complementary to a protein, so as

to dock onto it.

Problem:

Align two 3D protein structures

Structure Alignment is Important for:

- Discovery of Protein Function (shape determines function)

- Search in 3D data bases

- Protein Classification and Evolutionary Studies

- ...

Contact Maps

Unfolded protein

Unfolded protein

Folded protein = contacts

Unfolded protein

Folded protein = contacts

Contact map = graph

Unfolded protein

Folded protein = contacts

Contact map = graph

OBJECTIVE:align 3d folds of proteins

= align contact maps

Contact Map Alignments

Protein 1

Protein 2

non-crossing map of residues in protein 1 and protein 2

The value of an alignment

The value of an alignment

The value of an alignment

Value = 3

The value of an alignment

Value = 3

We want to maximize the value

The value of an alignment

NP-Hard

e

0-1 VARIABLES

yef

yef for e and f contacts

f

e

0-1 VARIABLES

yef

yef for e and f contacts

e’

e

f

CONSTRAINTS

yef + ye’f’<= 1

f’

f

e

0-1 VARIABLES

yef

yef for e and f contacts

e’

e

f

CONSTRAINTS

yef + ye’f’<= 1

f’

f

max SeSf yef

OBJECTIVE

It’s just a huge max independent set problem in Gy:

- a node for each sharing
- an edge for each pair of incompatible sharings

e’

e

e’’

e’

f’

e

f

e’’

f’’

f’’

f

f’

It’s just a huge max independent set problem in Gy:

- a node for each sharing
- an edge for each pair of incompatible sharings

e’

e

e’’

e’

f’

e

f

e’’

f’’

f’’

f

f’

|Gy|=|E1|*|E2| (approximately 5000 for two proteins with 50 residues and 75 contacts each)

The best exact algorithm for independent set can solve for at most a few hundred

nodes

New variables x provide an easy check for the non-crossing conditions

e

NEW VARIABLES

i

xij for i and j residues

yef

xij

j

f

New variables x provide an easy check for the non-crossing conditions

e

NEW VARIABLES

i

xij for i and j residues

yef

xij

j

f

NEW CONSTRAINTS

i

i’

j’

j

xij + xi’j’ <= 1

New variables x provide an easy check for the non-crossing conditions

e

NEW VARIABLES

i

xij for i and j residues

yef

xij

j

f

NEW CONSTRAINTS

i

i’

p

i

q

j

j’

j

xij + xi’j’<= 1

y(ip)(jq) <= xij and y(ip)(jq)<= xpq

Variables x define a graph Gx:

- A node for each line
- An edge between each pair of crossing lines

i

i’

i’

j’

i

j

j’

j

Variables x define a graph Gx:

- A node for each line
- An edge between each pair of crossing lines

i

i’

i’

j’

i

j

j’

j

- Gx is much smaller than Gy
- Gx has nice proprieties (it’s a perfect graph)
- It’s easier to find large independent sets in Gx

Non-crossing constraints can be extended to

CLIQUE CONSTRAINTS

S xij<= 1

[i,j] in M

For all sets M of mutually incompatible (i.e. crossing) lines

All clique constraints satisfied (and Gx perfect) imply a strong bound!

1. Pick two subsets of same size

2. Connect them in a zig-zag fashion

3. Throw in all lines included in a zig or a zag

3. Throw in all lines included in a zig or a zag

The result is a maximal clique in Gx

Separation of Clique Inequalities

PROBLEM

There exist exponentially many such cliques (O(22n) inequalities).

We need to generate in polynomial time a clique inequality when needed,

i.e., when violated by the current LP solution x*

S x*ij> 1

[i,j] in M

THEOREM

We can find the most violated clique inequality in time O(n2)

PROOF (sketch)

1) Clique = zigzag path

PROOF (sketch)

1) Clique = zigzag path

1 2 3 4 5 6 7 8

PROOF (sketch)

2) Flip one graph: zigzag leftright

1) Clique = zigzag path

1 2 3 4 5 6 7 8

8 7 6 5 4 3 2 1

PROOF (sketch)

2) Flip one graph: zigzag leftright

1) Clique = zigzag path

1 2 3 4 5 6 7 8

8 7 6 5 4 3 2 1

3) Define a grid with lengths for arcs so that length(P) = x*(clique(P)). Use Dyn. Progr.

to find longest path in grid, time O(n^2)

2

n1

1

i

1

2

u

n2

Create n1 x n2 grid

Orient all edges and give weights

2

n1

1

i

1

2

x*iu

u

x*iu

n2

Create n1 x n2 grid

Orient all edges and give weights

B=(n1,1)

A=(1,n2)

Create n1 x n2 grid

Orient all edges and give weights

There is violated clique iff longest A,B path has length > 1

We show why polynomial separation is possible:

Gx is weakly triangulated (no chordless cycles >= 5 in Gx or Gx)

=> Gx is perfect (Hayward, 1985)

L1 and L3 don’t cross. Wlog RIGHT(L3, L1)

PROOF (Sketch, for Gx)

L1

L7

L2

L6

L3

L5

L4

L1 and L3 don’t cross. Wlog RIGHT(L3, L1)

L1

L7

L2

L6

L3

L1

L3

L5

L4

For i=4,5,… Li crosses Li-1 but not L1

=> RIGHT (Li, L1)

L1

L7

L2

L6

L3

L1

L3

L5

L4

For i=4,5,… Li crosses Li-1 but not L1

=> RIGHT (Li, L1)

L1

L7

L2

L6

L3

L4

L1

L3

L5

L4

For i=4,5,… Li crosses Li-1 but not L1

=> RIGHT (Li, L1)

L1

L7

L2

L6

L3

L4

L5

L1

L5

L4

For i=4,5,… Li crosses Li-1 but not L1

=> RIGHT (Li, L1)

L1

L7

L2

L6

L3

L6

L5

L1

L5

L4

We get LEFT(L1, {L3, L4, L5, L6})

L1

L7

L2

L6

L3

L6

L1

L5

L4

L3, L4, L5 L6

A symmetric argument started at L6, with LEFT(L1, L6) implies LEFT(Li, L6) for i=2,3,4,5

L1

L7

L2

L6

L3

L6

L1

L5

L4

L3, L4, L5 L6

A symmetric argument started at L6, with LEFT(L1, L6) implies LEFT(Li, L6) for i=2,3,4,5

L1

L7

L2

L6

L3

L6

L1

L5

L4

L3, L4, L5 L6

L2, L3, L4 L5

Then {L3, L4, L5} are between L1 and L6

L1

L7

L2

L6

L3

L6

L1

L5

L4

L3, L4, L5L6

L2, L3, L4 L5

Then {L3, L4, L5} are between L1 and L6

But L7 crosses L1 and L6, and so should cross them all !

L1

L7

L2

L6

L3

L6

L7

L1

L5

L4

L3, L4, L5L6

L2, L3, L4 L5

The approach just seen is due to Lancia, Carr, Istrail, Walenz (2001)

It can be applied to small or moderate proteins (up to 80 residues/150 contacts)

In 2002, a new approach, by Caprara and Lancia, based on LAGRANGIAN

RELAXATION. Approach borrowed from Quadratic Assignment. With new

approach we can solve important proteins (up to 150 residues/300 contacts)

What about Heuristics?E.g., genetic algorithms…

Population

at time t

Population

at time t+1

Recombination

operators

Evaluation

function

- A Population of candidate solutions thatevolve (improve) over time
- Recombination creates new candidate solutions viacrossover and mutation

- Crossover selects pieces from both parents and creates two offspring solutions

Offspring

Blue Parent

Red Parent

- Crossover selects pieces from both parents and creates two offspring solutions
- Select a set of edges in one parent to copy to the child

- Crossover selects pieces from both parents and creates two offspring solutions
- Select a set of edges in one parent to copy to the child

- Crossover selects pieces from both parents and creates two offspring solutions
- Select a set of edges in one parent to copy to the child
- Copy as many edges as possible from the other parent

These edges conflict with existing

edges and are not copied

- Crossover selects pieces from both parents and creates two offspring solutions
- Select a set of edges in one parent to copy to the child
- Copy as many edges as possible from the other parent

- Crossover selects pieces from both parents and creates two offspring solutions
- Select a set of edges in one parent to copy to the child
- Copy as many edges as possible from the other parent
- Add random edges to fill any remaining space

- Crossover selects pieces from both parents and creates two offspring solutions
- Select a set of edges in one parent to copy to the child
- Copy as many edges as possible from the other parent
- Add random edges to fill any remaining space

- Mutation introduces small changes to existing solutions by shifting edge endpoints

- Mutation introduces small changes to existing solutions by shifting edge endpoints
- Select a set of endpoints to shift

- Mutation introduces small changes to existing solutions by shifting edge endpoints
- Select a set of endpoints to shift

- Mutation introduces small changes to existing solutions by shifting edge endpoints
- Select a set of endpoints to shift

This edge “fell off” the

end of the contact map

and is removed

- Mutation introduces small changes to existing solutions by shifting edge endpoints
- Select a set of endpoints to shift
- Randomly add new edges

- Mutation introduces small changes to existing solutions by shifting edge endpoints
- Select a set of endpoints to shift
- Randomly add new edges

Computational Results

- 269 proteins
- 70 -100 residues
- 80 to 140 contacts

- Picked 10,000 pairs of proteins out of 36046 possible
- Took a weekend on PC
- 500 were solved to optimality
- 2500 had a gap <= 10 contacts

Skolnick Clustering Test

- Four Families
- Flavodoxin-like fold Che-Y related
- Plastocyanin
- TIM Barrel
- Ferratin

- alpha-beta
- 8 structures
- up to 124 residues
- 15-30% sequence similarity
- < 3Å RMSD

- Four Families
- Flavodoxin-like fold Che-Y related
- Plastocyanin
- TIM Barrel
- Ferratin

- beta
- 8 structures
- up to 99 residues
- 35-90% sequence similarity
- < 2Å RMSD

- Four Families
- Flavodoxin-like fold Che-Y related
- Plastocyanin
- TIM Barrel
- Ferratin

- alpha-beta
- 11 structures
- up to 250 residues
- 30-90% sequence similarity
- < 2Å RMSD

- Four Families
- Flavodoxin-like fold Che-Y related
- Plastocyanin
- TIM Barrel
- Ferratin

- alpha
- 6 structures
- up to 170 residues
- 7-70% sequence similarity
- < 4Å RMSD

- Four Families
- Flavodoxin-like fold Che-Y related
- Plastocyanin
- TIM Barrel
- Ferratin

Define score(P1, P2) as

# shared contacts

Min # of contacts of P1,P2

0 <=

<= 1

Put P1, P2 in same family if score(P1, P2) >= threshold

Define score(P1, P2) as

# shared contacts

Min # of contacts of P1,P2

0 <=

<= 1

Put P1, P2 in same family if score(P1, P2) >= threshold

If P1, P2 too big, use G.A. and local search to compute score

L.P. gives then bounds:

HEUR score <= OPT score <= LP bound

and we know how far off OPT we are

We got some known families from biologists, PDB.

Experiment: Take a family F of proteins and align them against each

other and against the remaining.

We got some known families from biologists, PDB.

Experiment: Take a family F of proteins and align them against each

other and against the remaining.

TYPICAL BEHAVIOUR

score proteins were…

0.05 MISMATCH

0.1 MISMATCH

0.15 MISMATCH

0.2 MISMATCH

0.25 MISMATCH

0.3 MISMATCH

0.35 MATCH

…… ……

1.0 MATCH

- Performance
- 528 alignments
- 1.3% false negative
- 0.0% false positive

Computed, for 1st time, provably optimal alignments for 150 pairs

(inter-family)

Used the CMO value to cluster: retrieves the clusters.

Set S(i,j) = 1 if CMO >= a, S(i,j) = 0 otherwise

Use TSP to find a block diagonal structure for S

?

?