The population haplotyping problem
Download
1 / 222

The Population Haplotyping problem - PowerPoint PPT Presentation


  • 49 Views
  • Uploaded on

The Population Haplotyping problem. NOTATION : each SNP only two values in a population ( bio ). Call them 0 and 1 . Also , call * the fact that a site is heterozygous. HAPLOTYPE : string over 0, 1 GENOTYPE : string over 0, 1, *

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The Population Haplotyping problem' - luke-boyer


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The population haplotyping problem

The PopulationHaplotyping problem


NOTATION: each SNP onlytwovalues in a population (bio).

Call them0and 1. Also, call *the factthat a site isheterozygous

HAPLOTYPE: string over 0, 1

GENOTYPE: string over 0, 1, *

where0={0}, 1={1}, *={0,1}

01

0*

00

10

1*

11

11

11

11

01

**

10

10

*0

00

10

10

10

*0

10

00


ALGEBRA OF HAPLOTYPES:

0 +

0 =

---

0

1 +

1 =

---

1

0 + 1 +

1 = 0 =

--- ---

* *

Heterozygous (ambiguous) sites

Homozygous sites

01

00

0*

10

1*

11

01

**

11

10

**

00

10

00

10

*0

10

10


Phasing the alleles

11101

10000

11100

10001

11001

10100

11000

10101

1**0*

For k heterozygous (ambiguous) sites, there are 2k-1 possible phasings


THE PHASING (or HAPLOTYPING) PROBLEM

It is too expensive to determine haplotypes directly

Much cheaper to determine genotypes, and then infer haplotypes in silico:

Given genotypes of k individuals, determine the phasings

of all heterozygous sites.

This yields a set H, of (at most) 2k haplotypes. H is a resolution of G.


The input is GENOTYPE data

*1**1

11**1

11011

****1

00011

INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }


The input is GENOTYPE data

11011

01101

*1**1

11011

11101

11**1

11011

11011

11011

00011

11101

****1

00011

00011

00011

INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }

OUTPUT: H = { 11011, 11101, 00011, 01101}

Each genotype is resolved by two haplotypes

We will define some objectives for H


OBJECTIVES

-without objectives/constraints, the

haplotyping problem would be

(mathematically)trivial

E.g., always put 0 above and 1 below

**0*1  00001

11011

1*0** 10000

11011

-the objectives/constraints must be

“driven by biology”


OBJECTIVES

1°) Clark’s inference rule

2°) Perfect Phylogeny

3°) DiseaseAssociation

4°) (parsimony): minimize |H|


1st

Obj: Clark’s rule


Inference Rule

for a compatible pair h , g

known haplotypeh

1011001011 +

?????????? =

1**1001*1*

known (ambiguos) genotype g


Inference Rule

for a compatible pair h , g

known haplotypeh

1011001011 +

1101001110 =

1**1001*1*

new (derived) haplotype h’

known (ambiguos) genotype g

We write h + h’ = g


1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while


1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic


1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**


1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

1100


1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

1100

SUCCESS

1111


1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**


1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

0100


1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000

1000

**00

11**

0100

FAILURE (can’t resolve 1122 )


1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G

3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic: the algorithm could end without explaining

all genotypes even if an explanation was possible.

The number of genotypes solved depends on order of application.

OBJ: find order of application rule that leaves the fewest elements in G


The problem was studied by Gusfield

(ISMB 2000, and Journal of Comp. Biol., 2001)

- problem is APX-hard

- it corresponds to finding largest forest in a graph with

haplotypes as nodes and arcs for possible derivations

-solved via ILP of exponential-size

(practical for small real instances)


2nd

Obj: Perfect Phylogeny


- Parsimony does not take into account

mutations/evolution of haplotypes

- parsimony is very relialable on “small”

haplotype blocks

- when haplotypes are large (span several

SNPs, we should consider evolutionionary

events and recombination)

- the cleanest model for evolution is the

perfect phylogeny


3rd objective is based on perfect phylogeny

- A phylogeny expalains set of binary features (e.g. flies, has fur…)

with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it


3rd objective is based on perfect phylogeny

- A phylogeny expalains set of binary features (e.g. flies, has fur…)

with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

has 2 legs

has tail

flies


But…a new species may come along so that no

Perfect phylogeny is possible…

- A phylogeny expalains set of binary features (e.g. flies, has fur…)

with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

has 2 legs

has tail

flies


flies

two legs

tail

Human 1 0 0

Mouse 0 1 0

Spider 0 0 0

Eagle 1 0 1

Theorem: such matrix has p.p. iff there is not a 00 4x2 minor

10

01

11


flies

two legs

tail

Human 1 0 0

Mouse 0 1 0

Spider 0 0 0

Eagle 1 0 1

Mickey mouse 1 1 0

Theorem: such matrix has p.p. iff there is not a 00 4x2 minor

10

01

11


We can consider each SNP as a binary feature

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)


We can consider each SNP as a binary feature

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 * 0

* 1 0 *

* 0 * 0


We can consider each SNP as a binary feature

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 0 0

0 1 1 0

1 1 0 1

0 1 0 0

1 0 0 0

0 0 1 0

0 1 * 0

* 1 0 *

* 0 * 0


We can consider each SNP as a binary feature

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 0 0

0 1 1 0

1 1 0 1

0 1 0 0

1 0 0 0

0 0 1 0

0 1 * 0

* 1 0 *

* 0 * 0

NOT a perfect phylogeny solution !


We can consider each SNP as a binary feature

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 * 0

0 1 0 *

0 0 0 *


We can consider each SNP as a binary feature

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently

along a tree)

0 1 0 0

0 1 1 0

0 1 0 0

1 1 0 1

0 0 0 0

0 0 0 1

0 1 * 0

0 1 0 *

0 0 0 *

A perfect phylogeny


Theorem: The Perfect Phylogeny Haplotyping problem is polynomial


Theorem: The Perfect Phylogeny Haplotyping problem is polynomial

Algorithms are of combinatorial nature

- There is a graph for which SNPs are columns and edges are of two types

(forced and free)

- forced edges connect pairs of SNPs that must be phased in the same way

** 00 + 11 or ** 01 + 10

- a complex visit of the graph decides how to phase free SNPs


3rd

Obj: Disease Association


Some diseases may be due to a gene which has “faulty” configurations

RECESSIVE DISEASE (e.g. cystic fibrosis, sickle cell anemia): to be diseased

one must have both copies faulty. With one copy one is a carrier of the disease

DOMINANT DISEASE (e.g. Huntington’s disease, Marfan’s syndrome): to be

diseased it is enough to have one faulty copy

Two individuals of which one is healthy and the other diseased

may have the same genotype.

The explanation of the disease lies in a difference in their haplotypes


11**1 configurations

*1**1

0*011

11**1

0**01

00011

INPUT: GD = {11**1,*1**1,0*011},

GH ={11**1,0**01,00011}


11011 configurations

01101

11011

11101

01011

00011

11**1

*1**1

0*011

11001

11111

01101

00001

00011

00011

11**1

0**01

00011

INPUT: GD = {11**1,*1**1,0*011},

GH ={11**1,0**01,00011}

OUTPUT: H = { 11011,01011,00001,11111,11101,00011,01101}

H contains HD, s.t. each diseased has >=1 haplotype in HD and each healty none


Theorem 1 is proved via a reduction from 3 SAT configurations

Theorem 2 has a mathematical proof (coloring argument) with

little relation to biology:

There is R (depending on input) s.t. a haplotype is healthy

if the sum of its bits is congruent to R modulo 3

This means the model must be refined!


4th configurations

Obj: MaxParsimony

 See separate slides…


Summary: configurations

- haplotyping in-silico needed for economical reasons

- several objectives, all biologically driven

- nice combinatorial problems

(mostly due to binary nature of SNPs)

- these problems are technology-dependant and may

become obsolete (hopefully after we have retired)


111 configurations

111

011

101

011

000

111

111

010

001

010

011

022

222

012

221

222

011

012

022

022

211

111

012


2nd Objective (parsimony) configurations :

minimize |H|


1. The problem is APX-Hard configurations

Reduction from VERTEX-COVER


B configurations

A

C

D

E


A B C D E * configurations

B

A

C

D

E


A B C D E * configurations

AB

BC

AE

DE

AD

B

A

C

D

E


A B C D E * configurations

AB

BC

AE

DE

AD

A

B

C

D

E

B

A

C

D

E


A B C D E * configurations

AB 2 2

BC 2 2

AE 22

DE 2 2

AD 22

A

B

C

D

E

B

A

C

D

E


A B C D E * configurations

AB 2 2

BC 2 2

AE 22

DE 2 2

AD 22

A 0

B 0

C 0

D 0

E 0

B

A

C

D

E


A B C D E * configurations

AB 2 2 2

BC 2 2 2

AE 222

DE 2 22

AD 222

A 00

B 00

C 00

D 00

E 00

B

A

C

D

E


A B C D E * configurations

AB 2 2 1 1 1 2

BC 1 2 2 1 1 2

AE 2 1 1 1 22

DE 1 1 1 2 22

AD 2 1 1 2 1 2

A 0 1 1 1 1 0

B 1 0 1 1 1 0

C 1 1 0 1 1 0

D 1 1 1 0 1 0

E 1 1 1 1 00

B

A

C

D

E


A B C D E * configurations

AB 2 2 1 1 1 2

BC 1 2 2 1 1 2

AE 2 1 1 1 22

DE 1 1 1 2 22

AD 2 1 1 2 1 2

A 0 1 1 1 1 0

B 1 0 1 1 1 0

C 1 1 0 1 1 0

D 1 1 1 0 1 0

E 1 1 1 1 00

B

A

C

D

E

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that

explain all genotypes


A B C D E * configurations

AB 2 2 1 1 1 2

BC 1 2 2 1 1 2

AE 2 1 1 1 22

DE 1 1 1 2 22

AD 2 1 1 2 1 2

A 0 1 1 1 1 0

B 1 0 1 1 1 0

C 1 1 0 1 1 0

D 1 1 1 0 1 0

E 1 1 1 1 00

B

A

C

D

E

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that

explain all genotypes


A B C D E * configurations

AB 2 2 1 1 1 2

BC 1 2 2 1 1 2

AE 2 1 1 1 22

DE 1 1 1 2 22

AD 2 1 1 2 1 2

A 0 1 1 1 1 0

B 1 0 1 1 1 0

C 1 1 0 1 1 0

D 1 1 1 0 1 0

E 1 1 1 1 00

A’ 0 1 1 1 1 1

B’ 1 0 1 1 1 1

E’ 1 1 1 1 0 1

B

A

C

D

E

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that

explain all genotypes


A basic ILP formulation configurations


A basic ILP formulation configurations

Expand your input G in all possible ways

220

022

120


A basic ILP formulation configurations

Expand your input G in all possible ways

220

022

120

010 + 100, 000 + 110

000 + 011, 001 + 010

100 + 110


A basic ILP formulation configurations

Expand your input G in all possible ways

220

022

120

010 + 100, 000 + 110

000 + 011, 001 + 010

100 + 110




The input is formulations GENOTYPE data

11011

01101

*1**1

11011

11101

11**1

11011

11011

11O11

00011

11101

****1

00011

00011

OOO11

INPUT: G = { 11**1, ****1, 11011, *1**1, 00011}

OUTPUT: H = { 11011, 11101, 00011, 01101}

Each genotype is explained by two haplotypes

We will define some objectives for H


1st Objective formulations (open research problem):

minimize |H|

2nd Objective based on inference rule:


1st Objective (parsimony) formulations :

minimize |H|

An easy SQRT(n) approximation: k haplotypes can explain at most k(k-1)/2 genotypes, hence, we need at least LB = SQRT(n) haplotypes.

BUT any greedy algorithm can find 2 haplotypes to explain a genotype, giving a

solution of <= 2n haplotypes, i.e. <= SQRT(n) * LB

It’s difficult, but not impossible, to come up with better approximations, like constants

(Lancia, Pinotti, Rizzi ’02)


2nd Objective formulations based on inference rule:


Inference Rule formulations

known haplotypeh

xoxxooxoxx +

********** =

x??xoox?x?

known (ambiguos) genotype g


Inference Rule formulations

known haplotypeh

xoxxooxoxx +

xxoxooxxxo =

x??xoox?x?

new (derived) haplotype h’

known (ambiguos) genotype g


Inference Rule formulations

known haplotypeh

xoxxooxoxx +

xxoxooxxxo =

x??xoox?x?

new (derived) haplotype h’

known (ambiguos) genotype g

We write h + h’ = g

g and h must be compatible to derive h’


2nd Objective (Clark, 1990) formulations

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while


2nd Objective (Clark, 1990) formulations

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic


2nd Objective (Clark, 1990) formulations

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??


2nd Objective (Clark, 1990) formulations

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

xxoo


2nd Objective (Clark, 1990) formulations

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

xxoo

SUCCESS

xxxx


2nd Objective (Clark, 1990) formulations

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??


2nd Objective (Clark, 1990) formulations

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

oxoo


2nd Objective (Clark, 1990) formulations

1. Start with H = nonambiguos genotypes

2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

oooo

xooo

??oo

xx??

oxoo

FAILURE (can’t resolve xx?? )

OBJ: find order of application rule that leaves the fewest elements in G


- Problem is APX-hard (Gusfield,00) formulations

- Graph-Model + Integer Programming for practical solution (G.,01)


- Problem is APX-hard (Gusfield,00) formulations

- Graph-Model + Integer Programming for practical solution (G.,01)

x??o?

1. expand

genotypes


- Problem is APX-hard (Gusfield,00) formulations

- Graph-Model + Integer Programming for practical solution (G.,01)

xoooo

xooox

xoxoo

x??o?

xoxox

xxooo

1. expand

genotypes

xxoox

xxxoo

xxxox


- Problem is APX-hard (Gusfield,00) formulations

- Graph-Model + Integer Programming for practical solution (G.,01)

xoooo

xooox

xoxoo

2. create (h, h’) if exists g s.t. h’ can be

derived from g and h

x??o?

xoxox

xxooo

1. expand

genotypes

xxoox

3. Largest number of nodes in forest

rooted at unambiguos genotpes =

= largest number of ambiguous genotypes

resolved

xxxoo

xxxox

Hence, find largest number of nodes in forest rooted at unambiguos genotpes. Use I.P. model with vars x(ij).

This reduction is exponential. Is there a better practical approach?


3rd Objective formulations (open research problem)

Disease Detection:

?x??x

xx??x

??oxx

????x

oooxx

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }


3rd Objective formulations (open research problem)

Disease Detection:

xxoxx

oxxox

?x??x

xxoxx

xxxox

xx??x

xxoxx

oooxx

??oxx

oooxx

xxxox

????x

oooxx

oooxx

oooxx

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}

H contains H’, s.t. each diseased has one haplotype in H’ and each healty none

minimize | H|



Each species has a genome (organized in pairs of chromosomes)

tcgtgatggat………………ttgatggattga

tcgattatggat………………ttttgatatcca

Genomes evolve by means of

  • Insertions

  • Deletions

  • Inversions

  • Transpositions

  • Translocations

of DNA regions


deletion chromosomes)


insertion chromosomes)

deletion


insertion chromosomes)

deletion

translocation


insertion chromosomes)

deletion

translocation

inversion


insertion chromosomes)

deletion

transposition

translocation

inversion


Combinatorial problem chromosomes): given 2 permutations P, Q and operators in a set F find a

shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))

Very difficult problem! We focus on operators all of the same type (e.g. inversions)

(…still difficult…)

Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions,

transpositions…)

We focus on inversions, that are the most important in Nature

5 6 4 8 3 2 1 9 7

Example:

1 2 3 8 4 6 5 9 7

1 2 3 8 4 5 6 9 7

1 2 3 6 5 4 8 9 7

1 2 3 6 5 4 8 7 9

1 2 3 4 5 6 8 7 9

1 2 3 4 5 6 7 8 9


Combinatorial problem chromosomes): given 2 permutations P, Q and operators in a set F find a

shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))

Very difficult problem! We focus on operators all of the same type (e.g. inversions)

(…still difficult…)

Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions,

transposition…)

We focus on inversions, that are the most important in Nature

+5 +6 -4 -8 -3 -2 -1 -9 +7

Example:

+1 +2 +3 +8 +4 -6 -5 -9 +7

+1 +2 +3 +8 +4 +5 +6 -9 +7

+1 +2 +3 -6 -5 -4 -8 -9 +7

+1 +2 +3 -6 -5 -4 -8 -7 +9

+1 +2 +3 +4 +5 +6 -8 -7 +9

+1 +2 +3 +4 +5 +6 +7 +8 +9

There is also a SIGNED VERSION of the problem !


(Unsigned) Sorting by Inversions is chromosomes)NP-hard (longstanding question, settled by Caprara ‘98)

Surprisingly, Signed Sorting by Inversions is Polynomial (beautifultheory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown


(Unsigned) Sorting by Inversions is chromosomes)NP-hard (longstanding question, settled by Caprara ‘98)

Surprisingly, Signed Sorting by Inversions is Polynomial (beautifultheory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown

The concept of breakpoint

Breakpoint at position i if | p(i) - p(i+1) | > 1

0

5 7 8 2 1 4 3 6 9

10


(Unsigned) Sorting by Inversions is chromosomes)NP-hard (longstanding question, settled by Caprara ‘98)

Surprisingly, Signed Sorting by Inversions is Polynomial (beautifultheory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown

The concept of breakpoint

Breakpoint at position i if | p(i) - p(i+1) | > 1

0

5 7 8 2 1 4 3 6 9

10

d(p) = inversion distance

b(p) = # breakpoints

TRIVIAL BOUND: d(p) >= b(p) / 2

Example: d(p) >= 6 / 2 = 3


The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10


The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10


The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10


The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10


The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

Each node has degree...

10

4

6

0 2 or 4 …

hence the graph can be decomposed in cycles!


The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

Alternating cycle decomposition


The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

Alternating cycle decomposition


The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

Alternating cycle decomposition

c(p) = max # cycles in alternating decomposition

VERY STRONG BOUND : d (p) >= b(p) - c(p)

Example: c(p)= 2 and d (p) >= 6 - 2 = 4


The chromosomes)Breakpoint Graph

0

5 7 8 2 1 4 3 6 9

10

The best algorithm for this problem is based on an Integer Programming

formulation of the max cycle decomposition

A variable xC for each cycle (exponential # of vars…)

A constraint SxC = 1 for each edge e

C containing e

Objective: maximize SCxC


PRIMAL chromosomes)

max S xC

C

S xC = 1 for all edges e

C\ni e

xC \in {0,1} for all alt. cycles C

DUAL

min S ye

e

S ye <= 1 for all alt. Cycles C

e\in C

ye \in R for all edges e


PRIMAL chromosomes)

max S xC

C

S xC = 1 for all edges e

C\ni e

xC \in {0,1} for all alt. cycles C

DUAL

min S ye

e

S ye <= 1 for all alt. Cycles C

e\in C

ye \in R for all edges e


0 chromosomes)

5 7 8 2 1 4 3 6 9

10

Pricing out the cycles for which y*(C) < 1


0 chromosomes)

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

Split the graph in two copies


0 chromosomes)

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

Connect twins


0 chromosomes)

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles


0 chromosomes)

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles


0 chromosomes)

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles


0 chromosomes)

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles


0 chromosomes)

5 7 8 2 1 4 3 6 9

10

0

5 7 8 2 1 4 3 6 9

10

A perfect matching corresponds to (a set of) alternating cycles


.6 chromosomes)

1

.4

0

5 7 8 2 1 4 3 6 9

10

0

0

5 7 8 2 1 4 3 6 9

10

.5

.2

The weight of the matching is the y*-weight of the cycles


.6 chromosomes)

1

.4

0

5 7 8 2 1 4 3 6 9

10

100000

0

5 7 8 2 1 4 3 6 9

10

.5

.2

Forcing a cycle to use a certain node


- These cycles would not use the same node twice, but with simple trick is possible to model (OMISSIS)

BRANCH&PRICE algorithm by Caprara, Lancia, Ng (1999,2001)

BRANCH&BOUND combinatorial algorithm by Kececioglu, Sankoff (1996)

KS can solve at most n=40. Take days for n=50

CLN can solve for n=200. Takes few seconds (say 5) for n=100

NP-hard problem practically solved to optimality!


Statistical view of evolution
Statistical view of evolution simple trick is possible to model (OMISSIS)

  • Genome evolve by random inversions

  • It’s like a random walk on a huge graph with an edge for each permutation an edge for each inversion

  • It is not clear why the shortest solution should be the one followed by Nature (in fact, often it isn’t)

  • We want to find the most likely number of inversions that lead from (1 2 … n ) to p

  • We use the expected number of breakpoints after k inversions as a way to guess the # of inversions


Given a simple trick is possible to model (OMISSIS) p obtained by h random inversions from (1 … n ) we want to estimate h

The inversion distance is only a lower bound: h >= d(p) but the gap could be big

Let B(k)be the (r.v.) number of breakpoint after k random inversions from (1..n)

We estimate E[B(k)]. Then, faced with some p, we pick h such that E[B(h)] is as

close as possible to b(p) (maximum likelihood). CL ,2000, have shown:

E[B(k)] = ( n - 1 ) ( 1 - ())

k

n - 3

n - 1

Question: estimate E[D(k)], the (r.v.) inversion distance after k random inversions


Example: n = 200, k (u.a.r. in 1…n) inversions simple trick is possible to model (OMISSIS)

k k’ d(p) b

8 8 8 16

19 19 19 34

68 67 67 98

69 73 68 104

73 79 73 109

85 91 83 120

86 85 83 115

87 90 84 119

118 117 109 138

184 184 135 168



A Overlap ProblemProtein is a complex molecule with a primary,

linear structure (a sequence of aminoacids) and a

3-Dimensional structure (the protein fold).

Protein STRUCTURE determines its FUNCTION

For instance, the Drug Design problem

calls for constructing peptides with a 3D

shape complementary to a protein, so as

to dock onto it.


Motivation

Problem: Overlap Problem

Motivation:

Align two 3D protein structures

Structure Alignment is Important for:

- Discovery of Protein Function (shape determines function)

- Search in 3D data bases

- Protein Classification and Evolutionary Studies

- ...


Contact maps

Contact Maps Overlap Problem


Contact maps1
CONTACT MAPS Overlap Problem

Unfolded protein


Contact maps2
CONTACT MAPS Overlap Problem

Unfolded protein

Folded protein = contacts


Contact maps3
CONTACT MAPS Overlap Problem

Unfolded protein

Folded protein = contacts

Contact map = graph


Contact maps4
CONTACT MAPS Overlap Problem

Unfolded protein

Folded protein = contacts

Contact map = graph

OBJECTIVE:align 3d folds of proteins

= align contact maps


Contact map alignments

Contact Map Alignments Overlap Problem


Non crossing alignments
Non-crossing Alignments Overlap Problem

Protein 1

Protein 2

non-crossing map of residues in protein 1 and protein 2


The value of an alignment
The value of an alignment Overlap Problem


The value of an alignment Overlap Problem


The value of an alignment Overlap Problem


The value of an alignment Overlap Problem

Value = 3


The value of an alignment Overlap Problem

Value = 3

We want to maximize the value


The value of an alignment Overlap Problem

NP-Hard



Integer programming formulation1
Integer Programming Formulation Overlap Problem

e

0-1 VARIABLES

yef

yef for e and f contacts

f


Integer programming formulation2
Integer Programming Formulation Overlap Problem

e

0-1 VARIABLES

yef

yef for e and f contacts

e’

e

f

CONSTRAINTS

yef + ye’f’<= 1

f’

f


Integer programming formulation3
Integer Programming Formulation Overlap Problem

e

0-1 VARIABLES

yef

yef for e and f contacts

e’

e

f

CONSTRAINTS

yef + ye’f’<= 1

f’

f

max SeSf yef

OBJECTIVE


Independent set problem
Independent Set Problem Overlap Problem

It’s just a huge max independent set problem in Gy:

  • a node for each sharing

  • an edge for each pair of incompatible sharings

e’

e

e’’

e’

f’

e

f

e’’

f’’

f’’

f

f’


Independent set problem1
Independent Set Problem Overlap Problem

It’s just a huge max independent set problem in Gy:

  • a node for each sharing

  • an edge for each pair of incompatible sharings

e’

e

e’’

e’

f’

e

f

e’’

f’’

f’’

f

f’

|Gy|=|E1|*|E2| (approximately 5000 for two proteins with 50 residues and 75 contacts each)

The best exact algorithm for independent set can solve for at most a few hundred

nodes


Node to node variables
Node to Node Variables Overlap Problem

New variables x provide an easy check for the non-crossing conditions

e

NEW VARIABLES

i

xij for i and j residues

yef

xij

j

f


Node to node variables1
Node to Node Variables Overlap Problem

New variables x provide an easy check for the non-crossing conditions

e

NEW VARIABLES

i

xij for i and j residues

yef

xij

j

f

NEW CONSTRAINTS

i

i’

j’

j

xij + xi’j’ <= 1


Node to node variables2
Node to Node Variables Overlap Problem

New variables x provide an easy check for the non-crossing conditions

e

NEW VARIABLES

i

xij for i and j residues

yef

xij

j

f

NEW CONSTRAINTS

i

i’

p

i

q

j

j’

j

xij + xi’j’<= 1

y(ip)(jq) <= xij and y(ip)(jq)<= xpq


Clique constraints
Clique Constraints Overlap Problem

Variables x define a graph Gx:

  • A node for each line

  • An edge between each pair of crossing lines

i

i’

i’

j’

i

j

j’

j


Clique constraints1
Clique Constraints Overlap Problem

Variables x define a graph Gx:

  • A node for each line

  • An edge between each pair of crossing lines

i

i’

i’

j’

i

j

j’

j

  • Gx is much smaller than Gy

  • Gx has nice proprieties (it’s a perfect graph)

  • It’s easier to find large independent sets in Gx


Clique constraints2
Clique Constraints Overlap Problem

Non-crossing constraints can be extended to

CLIQUE CONSTRAINTS

S xij<= 1

[i,j] in M

For all sets M of mutually incompatible (i.e. crossing) lines

All clique constraints satisfied (and Gx perfect) imply a strong bound!


Structure of maximal cliques in g x
Structure of Maximal cliques in G Overlap Problemx

1. Pick two subsets of same size


Structure of maximal cliques in g x1
Structure of Maximal cliques in G Overlap Problemx

2. Connect them in a zig-zag fashion











Structure of maximal cliques in g x11
Structure of Maximal cliques in G Overlap Problemx

3. Throw in all lines included in a zig or a zag


Structure of maximal cliques in g x12
Structure of Maximal cliques in G Overlap Problemx

3. Throw in all lines included in a zig or a zag


Structure of maximal cliques in g x13
Structure of Maximal cliques in G Overlap Problemx

The result is a maximal clique in Gx



Separation of clique inequalities1
Separation of Clique Inequalities Overlap Problem

PROBLEM

There exist exponentially many such cliques (O(22n) inequalities).

We need to generate in polynomial time a clique inequality when needed,

i.e., when violated by the current LP solution x*

S x*ij> 1

[i,j] in M

THEOREM

We can find the most violated clique inequality in time O(n2)


Separation of clique inequalities2
Separation of Clique Inequalities Overlap Problem

PROOF (sketch)

1) Clique = zigzag path


Separation of clique inequalities3
Separation of Clique Inequalities Overlap Problem

PROOF (sketch)

1) Clique = zigzag path

1 2 3 4 5 6 7 8


Separation of clique inequalities4
Separation of Clique Inequalities Overlap Problem

PROOF (sketch)

2) Flip one graph: zigzag leftright

1) Clique = zigzag path

1 2 3 4 5 6 7 8

8 7 6 5 4 3 2 1


Separation of clique inequalities5
Separation of Clique Inequalities Overlap Problem

PROOF (sketch)

2) Flip one graph: zigzag leftright

1) Clique = zigzag path

1 2 3 4 5 6 7 8

8 7 6 5 4 3 2 1

3) Define a grid with lengths for arcs so that length(P) = x*(clique(P)). Use Dyn. Progr.

to find longest path in grid, time O(n^2)


Separation of cliques
Separation of cliques Overlap Problem

2

n1

1

i

1

2

u

n2

Create n1 x n2 grid

Orient all edges and give weights


Separation of cliques1
Separation of cliques Overlap Problem

2

n1

1

i

1

2

x*iu

u

x*iu

n2

Create n1 x n2 grid

Orient all edges and give weights


Separation of cliques2
Separation of cliques Overlap Problem

B=(n1,1)

A=(1,n2)

Create n1 x n2 grid

Orient all edges and give weights

There is violated clique iff longest A,B path has length > 1


Gx is a perfect graph
Gx is a Perfect Graph Overlap Problem

We show why polynomial separation is possible:

Gx is weakly triangulated (no chordless cycles >= 5 in Gx or Gx)

=> Gx is perfect (Hayward, 1985)


Gx is a perfect graph1
Gx is a Perfect Graph Overlap Problem

L1 and L3 don’t cross. Wlog RIGHT(L3, L1)

PROOF (Sketch, for Gx)

L1

L7

L2

L6

L3

L5

L4


Gx is a perfect graph2
Gx is a Perfect Graph Overlap Problem

L1 and L3 don’t cross. Wlog RIGHT(L3, L1)

L1

L7

L2

L6

L3

L1

L3

L5

L4


Gx is a perfect graph3
Gx is a Perfect Graph Overlap Problem

For i=4,5,… Li crosses Li-1 but not L1

=> RIGHT (Li, L1)

L1

L7

L2

L6

L3

L1

L3

L5

L4


Gx is a perfect graph4
Gx is a Perfect Graph Overlap Problem

For i=4,5,… Li crosses Li-1 but not L1

=> RIGHT (Li, L1)

L1

L7

L2

L6

L3

L4

L1

L3

L5

L4


Gx is a perfect graph5
Gx is a Perfect Graph Overlap Problem

For i=4,5,… Li crosses Li-1 but not L1

=> RIGHT (Li, L1)

L1

L7

L2

L6

L3

L4

L5

L1

L5

L4


Gx is a perfect graph6
Gx is a Perfect Graph Overlap Problem

For i=4,5,… Li crosses Li-1 but not L1

=> RIGHT (Li, L1)

L1

L7

L2

L6

L3

L6

L5

L1

L5

L4


Gx is a perfect graph7
Gx is a Perfect Graph Overlap Problem

We get LEFT(L1, {L3, L4, L5, L6})

L1

L7

L2

L6

L3

L6

L1

L5

L4

L3, L4, L5 L6


Gx is a perfect graph8
Gx is a Perfect Graph Overlap Problem

A symmetric argument started at L6, with LEFT(L1, L6) implies LEFT(Li, L6) for i=2,3,4,5

L1

L7

L2

L6

L3

L6

L1

L5

L4

L3, L4, L5 L6


Gx is a perfect graph9
Gx is a Perfect Graph Overlap Problem

A symmetric argument started at L6, with LEFT(L1, L6) implies LEFT(Li, L6) for i=2,3,4,5

L1

L7

L2

L6

L3

L6

L1

L5

L4

L3, L4, L5 L6

L2, L3, L4 L5


Gx is a perfect graph10
Gx is a Perfect Graph Overlap Problem

Then {L3, L4, L5} are between L1 and L6

L1

L7

L2

L6

L3

L6

L1

L5

L4

L3, L4, L5L6

L2, L3, L4 L5


Gx is a perfect graph11
Gx is a Perfect Graph Overlap Problem

Then {L3, L4, L5} are between L1 and L6

But L7 crosses L1 and L6, and so should cross them all !

L1

L7

L2

L6

L3

L6

L7

L1

L5

L4

L3, L4, L5L6

L2, L3, L4 L5


The approach just seen is due to Lancia, Carr, Istrail, Walenz (2001)

It can be applied to small or moderate proteins (up to 80 residues/150 contacts)

In 2002, a new approach, by Caprara and Lancia, based on LAGRANGIAN

RELAXATION. Approach borrowed from Quadratic Assignment. With new

approach we can solve important proteins (up to 150 residues/300 contacts)


What about heuristics e g genetic algorithms

What about Heuristics? Walenz (2001)E.g., genetic algorithms…


Genetic algorithm overview

Population Walenz (2001)

at time t

Population

at time t+1

Recombination

operators

Evaluation

function

Genetic Algorithm Overview

  • A Population of candidate solutions thatevolve (improve) over time

  • Recombination creates new candidate solutions viacrossover and mutation


Crossover
Crossover Walenz (2001)

  • Crossover selects pieces from both parents and creates two offspring solutions

Offspring

Blue Parent

Red Parent


Crossover1
Crossover Walenz (2001)

  • Crossover selects pieces from both parents and creates two offspring solutions

    • Select a set of edges in one parent to copy to the child


Crossover2
Crossover Walenz (2001)

  • Crossover selects pieces from both parents and creates two offspring solutions

    • Select a set of edges in one parent to copy to the child


Crossover3
Crossover Walenz (2001)

  • Crossover selects pieces from both parents and creates two offspring solutions

    • Select a set of edges in one parent to copy to the child

    • Copy as many edges as possible from the other parent


Crossover4

These edges conflict with existing Walenz (2001)

edges and are not copied

Crossover

  • Crossover selects pieces from both parents and creates two offspring solutions

    • Select a set of edges in one parent to copy to the child

    • Copy as many edges as possible from the other parent


Crossover5
Crossover Walenz (2001)

  • Crossover selects pieces from both parents and creates two offspring solutions

    • Select a set of edges in one parent to copy to the child

    • Copy as many edges as possible from the other parent

    • Add random edges to fill any remaining space


Crossover6
Crossover Walenz (2001)

  • Crossover selects pieces from both parents and creates two offspring solutions

    • Select a set of edges in one parent to copy to the child

    • Copy as many edges as possible from the other parent

    • Add random edges to fill any remaining space


Mutation
Mutation Walenz (2001)

  • Mutation introduces small changes to existing solutions by shifting edge endpoints


Mutation1
Mutation Walenz (2001)

  • Mutation introduces small changes to existing solutions by shifting edge endpoints

    • Select a set of endpoints to shift


Mutation2
Mutation Walenz (2001)

  • Mutation introduces small changes to existing solutions by shifting edge endpoints

    • Select a set of endpoints to shift


Mutation3
Mutation Walenz (2001)

  • Mutation introduces small changes to existing solutions by shifting edge endpoints

    • Select a set of endpoints to shift

This edge “fell off” the

end of the contact map

and is removed


Mutation4
Mutation Walenz (2001)

  • Mutation introduces small changes to existing solutions by shifting edge endpoints

    • Select a set of endpoints to shift

    • Randomly add new edges


Mutation5
Mutation Walenz (2001)

  • Mutation introduces small changes to existing solutions by shifting edge endpoints

    • Select a set of endpoints to shift

    • Randomly add new edges


Computational results

Computational Results Walenz (2001)


Computational results1
Computational Results Walenz (2001)

  • 269 proteins

    • 70 -100 residues

    • 80 to 140 contacts

  • Picked 10,000 pairs of proteins out of 36046 possible

  • Took a weekend on PC

  • 500 were solved to optimality

  • 2500 had a gap <= 10 contacts



Skolnick results
Skolnick Results Walenz (2001)

  • Four Families

    • Flavodoxin-like fold Che-Y related

    • Plastocyanin

    • TIM Barrel

    • Ferratin

  • alpha-beta

  • 8 structures

  • up to 124 residues

  • 15-30% sequence similarity

  • < 3Å RMSD


Skolnick results1
Skolnick Results Walenz (2001)

  • Four Families

    • Flavodoxin-like fold Che-Y related

    • Plastocyanin

    • TIM Barrel

    • Ferratin

  • beta

  • 8 structures

  • up to 99 residues

  • 35-90% sequence similarity

  • < 2Å RMSD


Skolnick results2
Skolnick Results Walenz (2001)

  • Four Families

    • Flavodoxin-like fold Che-Y related

    • Plastocyanin

    • TIM Barrel

    • Ferratin

  • alpha-beta

  • 11 structures

  • up to 250 residues

  • 30-90% sequence similarity

  • < 2Å RMSD


Skolnick results3
Skolnick Results Walenz (2001)

  • Four Families

    • Flavodoxin-like fold Che-Y related

    • Plastocyanin

    • TIM Barrel

    • Ferratin

  • alpha

  • 6 structures

  • up to 170 residues

  • 7-70% sequence similarity

  • < 4Å RMSD


Skolnick results4
Skolnick Results Walenz (2001)

  • Four Families

    • Flavodoxin-like fold Che-Y related

    • Plastocyanin

    • TIM Barrel

    • Ferratin


Clustering
Clustering Walenz (2001)

Define score(P1, P2) as

# shared contacts

Min # of contacts of P1,P2

0 <=

<= 1

Put P1, P2 in same family if score(P1, P2) >= threshold


Clustering1
Clustering Walenz (2001)

Define score(P1, P2) as

# shared contacts

Min # of contacts of P1,P2

0 <=

<= 1

Put P1, P2 in same family if score(P1, P2) >= threshold

If P1, P2 too big, use G.A. and local search to compute score

L.P. gives then bounds:

HEUR score <= OPT score <= LP bound

and we know how far off OPT we are


Clustering validation
Clustering validation Walenz (2001)

We got some known families from biologists, PDB.

Experiment: Take a family F of proteins and align them against each

other and against the remaining.


Clustering validation1
Clustering validation Walenz (2001)

We got some known families from biologists, PDB.

Experiment: Take a family F of proteins and align them against each

other and against the remaining.

TYPICAL BEHAVIOUR

score proteins were…

0.05 MISMATCH

0.1 MISMATCH

0.15 MISMATCH

0.2 MISMATCH

0.25 MISMATCH

0.3 MISMATCH

0.35 MATCH

…… ……

1.0 MATCH


Skolnick results5
Skolnick Results Walenz (2001)

  • Performance

    • 528 alignments

    • 1.3% false negative

    • 0.0% false positive


Clustering2
Clustering Walenz (2001)

Computed, for 1st time, provably optimal alignments for 150 pairs

(inter-family)

Used the CMO value to cluster: retrieves the clusters.

Set S(i,j) = 1 if CMO >= a, S(i,j) = 0 otherwise

Use TSP to find a block diagonal structure for S


Clustering3
Clustering Walenz (2001)


Last open problem
Last Open Problem Walenz (2001)

?

?


ad