Loading in 5 sec....

Optimization Problems for Polymorphisms of Single NucleotidesPowerPoint Presentation

Optimization Problems for Polymorphisms of Single Nucleotides

- 73 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Optimization Problems for Polymorphisms of Single Nucleotides' - abdalla

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Optimization Problems for Polymorphisms of Single Nucleotides

### The Single-IndividualHaplotyping problem

Polymorphisms

A polymorphism is a feature

Polymorphisms

A polymorphism is a feature

- common to everybody

- not identical in everybody

- the possible variants (alleles) are just a few

Polymorphisms

A polymorphism is a feature

- common to everybody

- not identical in everybody

- the possible variants (alleles) are just a few

E.g. think of eye-color

Polymorphisms

A polymorphism is a feature

- common to everybody

- not identical in everybody

- the possible variants (alleles) are just a few

E.g. think of eye-color

Or blood-type for a feature not visible from outside

At DNA level, a polymorphism is a sequence of nucleotides

varying in a population.

At DNA level, a polymorphism is a sequence of nucleotides

varying in a population.

The shortest possible sequence has only 1 nucleotide, hence

Single Nucleotide Polymorphism (SNP)

At DNA level, a polymorphism is a sequence of nucleotides

varying in a population.

The shortest possible sequence has only 1 nucleotide, hence

Single Nucleotide Polymorphism (SNP)

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

At DNA level, a polymorphism is a sequence of nucleotides

varying in a population.

The shortest possible sequence has only 1 nucleotide, hence

Single Nucleotide Polymorphism (SNP)

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

- SNPs are predominant form of human variations

- On average one every 1,000 bases

- Used for drug design, study disease, forensic, evolutionary...

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

- Multimillion dollar SNP consortium project

- 1st step: buildmaps of severalthousandSNPs

- Goal: associate SNPs (or group of SNPs) to geneticdiseases

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUS: different alleles

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUS: different alleles

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUS: different alleles

HAPLOTYPE: chromosome content at SNP sites

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUS: different alleles

HAPLOTYPE: chromosome content at SNP sites

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgt

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUS: different alleles

HAPLOTYPE: chromosome content at SNP sites

ct

cg

ag

at

at

at

ct

ag

ag

cg

ag

ag

ag

cg

HOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUS: different alleles

HAPLOTYPE: chromosome content at SNP sites

GENOTYPE: “union” of 2 haplotypes

ct

OcE

cg

ag

OaE

at

at

OaOt

at

ct

EE

ag

ag

EOg

cg

ag

ag

OaOg

OgE

ag

cg

CHANGE OF SYMBOLS: each SNP onlytwovalues in a poplulation (bio).

Call them1 and O. Also, call *the factthat a site isheterozygous

HAPLOTYPE: string over 1,O

GENOTYPE: string over 1,O,*

ct

OcE

cg

ag

OaE

at

at

OaOt

at

ct

EE

ag

ag

EOg

cg

ag

ag

OaOg

OgE

ag

cg

CHANGE OF SYMBOLS: each SNP onlytwovalues in a poplulation (bio).

Call them1and O. Also, call *the factthat a site isheterozygous

HAPLOTYPE: string over 1,O

GENOTYPE: string over 1,O,*

o1

o*

oo

1o

1*

11

11

11

11

o1

**

1o

1o

*o

oo

1o

1o

*o

*o

1o

oo

Single Individual: Given genomic data of one individual, determine

2 haplotypes (one per chromosome)

Population : Given genomic data of k individuals, determine

(at most) 2k haplotypes (one per chromosome/indiv.)

For the individual problem, input is erroneous haplotype data, from sequencing

For the population problem, data is ambiguous genotype data, from screening

OBJ is lead by Occam’s razor: find minimum explanation of observed data

under given hypothesis (a.k.a. parsimony principle)

Theory and Results

Single individual

- PolynomialAlgorithms for gaplesshaplotyping(L, Bafna, Istrail, Lippert,

Schwartz 01 & Bafna, L, Istrail, Rizzi 02)

- Polynomial Algorithms for bounded-length gapped haplotyping

(BLIR 02)

- NP-hardness for general gapped haplotyping (LBILS 01)

Population

- APX-hardness (Gusfield 00)

- Reduction to Graph-Theoretic model and I.P. approach(Gusfield 01)

-New formulations and DiseaseDetection(L, Ravi, Rizzi, 02)

- Exactalgorithms for min-sizesolution (L,Serafini 2011)

- Heuristics(Tininini, L, Bertolazzi 2010)

Shotgun Assembly of a Chromosome

[ Webber and Myers, 1997]

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

fragmentation

ACTGA GATTT GCCTAG CTATCTT

ATAGATA GAGATTTC TAGAAATC TGAGCCTAG

TAGAGATTTC TCCTAAAGAT CGCATAGATA

sequencing

TGAGCCTAG GATTT GCCTAG CTATCTT

ATAGATA GAGATTTCTAGAAATC ACTGA

TAGAGATTTC TCCTAAAGAT CGCATAGATA

assembly

ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

MAIN ERROR SOURCES

-Sequencing errors:

ACTGCCTGGCCAATGGAACGGACAAG

CTGGCCAAT

CATTGGAAC

AATGGAACGGA

-Contaminants

Givenerrors, the data may be inconsistent with exactly 2 haplotypes

Hence, assembler is unable to build 2 chromosomes

PROBLEM: Find and remove the errors so that the data becomes consistent with exactly 2 haplotypes

ACTGAAAGCGA ACTAGAGACAGCATG

ACTGATAGC GTAGAGTCA

ACTG TCGACTAGA CATG

ACTGA CGATCCATCG TCAGC

ACTGAAA ATCGATC AGCATG

ACTGAAAGCGAACTAGAGACAGCATG

ACTGATAGCGTAGAGTCA

ACTGTCGACTAGACATG

ACTGACGATCCATCGTCAGC

ACTGAAAATCGATCAGCATG

11O

OO1

1

11

1 O

Snips 1,..,n

1 2 3 4 5 6 7 8 9

1 - - - O 1 1 O O -

2 - O - O 1 - - - 1

31 1 O 1 1 - - - -

4 O O1 - - - - O -

5 - - - - - - - 1 O

6 - - - - O OO1 -

Fragments 1,..,m

Snips 1,..,n

1 2 3 4 5 6 7 8 9

1 - - - O 1 1 O O -

2 - O - O 1 - - - 1

31 1O 1 1 - - - -

4 O O1 - - - - O -

5 - - - - - - - 1 O

6 - - - - O OO1 -

Fragments 1,..,m

Fragment conflict: can’t be on same haplotype

Snips 1,..,n

1 2 3 4 5 6 7 8 9

1 - - - O 1 1 O O -

2 - O - O 1 - - - 1

31 1O 1 1 - - - -

4 O O1 - - - - O -

5 - - - - - - - 1 O

6 - - - - O OO1 -

Fragments 1,..,m

Fragment conflict: can’t be on same haplotype

Fragment Conflict Graph GF(M)

1

4

We have 2 haplotypes iff GF is BIPARTITE

5

2

6

3

Snips 1,..,n

1 2 3 4 5 6 7 8 9

1 - - - O 1 1 O O -

2 - O - O 1 - - - 1

31 1O 1 1 - - - -

4 O O1 - - - - O -

5 - - - - - - - 1 O

6 - - - - O OO1 -

Fragments 1,..,m

PROBLEM (Fragment Removal): make GF Bipartite

1

4

5

2

6

3

Snips 1,..,n

1 2 3 4 5 6 7 8 9

1 - - - O 1 1 O O -

2 - O - O 1 - - - 1

31 1 O 1 1 - - - -

4 O O1 - - - - O -

5 - - - - - - - 1 O

6 - - - - O OO1 -

Fragments 1,..,m

PROBLEM (Fragment Removal): make GF Bipartite

1 2 3 4 5 6 7 8 9

1 - - - O 1 1 O O -

2 - O - O 1 - - - 1

4 O O1 - - - - O -

31 1 O 1 1 - - - -

5 - - - - - - - 1 O

1

4

5

2

O O1 O 1 1 O O1

6

3

1 1 O 1 1 - - 1 O

Removing fewest fragments is equivalent

to maximum induced bipartite subgraph

NP-complete [Yannakakis, 1978a, 1978b; Lewis, 1978]

O(|V|(log log |V|/log |V|)2)-approximable [Halldórsson, 1999]

not O(|V|)-approximable for some [Lund and Yannakakis, 1993]

Are there cases of M for which GF(M) is easier?

YES: the gapless M

---O11OO1O1O1OO1--- gapless

---O11OO---O1OO1--- gap

---O11--1O----O1--- 2 gaps

Sequencingerrors (don’t call with lowconfidence)

---OO11?11--- ===> ---OO11-11---

Celera’s mate pairs

attcgttgtagtggtagcctaaatgtcggtagaccttga

attcgttgtagtggtagcctaaatgtcggtagaccttga

For a gapless M, the Min Fragment Removal

Problem is Polynomial

NOTE: Does not need to be gapless. Enough if it can be

sorted to become such

(Consecutive Ones Property, Booth and Lueker, 1976)

An O(nm + n ) D.P. algo

1 - O O1 1 O O - -

2 - - 1 O 1 1 O - -

3 - - - 1 1 O - - -

4 - - - - O O1 O -

5 - - - - - 1 O 1 O

An O(nm + n ) D.P. algo

LFT(i)

RGT(i)

1 - O O1 1 O O - -

2 - - 1 O 1 1 O - -

3 - - - 1 1 O - - -

4 - - - - O O1 O -

5 - - - - - 1 O 1 O

sort according to LFT

An O(nm + n ) D.P. algo

LFT(i)

RGT(i)

1 - O O1 1 O O - -

2 - - 1 O 1 1 O - -

3 - - - 1 1 O - - -

4 - - - - O O1 O -

5 - - - - - 1 O 1 O

sort according to LFT

D(i;h,k) := min cost to solve up to row i, with k, h not removed and put in

different haplotypes, and maximizing RGT(k), RGT(h)

{

D(i-1; h,k) if i, k compatible and RGT(i) <= RGT(k)

or i, h compatible and RGT(i) <= RGT(h)

1 + D(i-1; h, k) otherwise

D(i; h,k) =

OPT is min h,k D( n; h, k ) and can be found in time O(nm + n^3)

Th: NP-Hard if 2 gaps per fragment

proof: (simple) use factthat for every G thereis M s.t. G = GF(M) and reduce from Max Bip. InducedSubgraphon 3-regular graphs (in eachrow, max 3 non-bit, hencemax 2 gaps)

Th: NP-Hard if 2 gaps per fragment

proof: (simple) use factthat for every G thereis M s.t. G = GF(M) and reduce from Max Bip. InducedSubgraphon 3-regular graphs (in eachrow, max 3 non-bit, hencemax 2 gaps)

Th: NP-Hard if even 1 gap per fragment

proof: technical. reduction from MAX2SAT

Th: NP-Hard if 2 gaps per fragment

proof: (simple) use factthat for every G thereis M s.t. G = GF(M) and reduce from Max Bip. InducedSubgraphon 3-regular graphs (in eachrow, max 3 non-bit, hencemax 2 gaps)

Th: NP-Hard if even 1 gap per fragment

proof: technical. reduction from MAX2SAT

But, gaps must be long for problem to be difficult.

We have O( 2 mn + 2 n ) D.P.

for MFR on matrix with total gaps length L

2L

3L 3

What for MFR with gaps? Why not ILP...

1/2

1

1

5/12

5/12

1

0

2

2

5

5

2

5

1/3

4

4

3

3

4

3

1/4

1/2

Randomized rounding heuristic: round and repeat. Worked well at Celera

The fragment removal is good to get rid of contaminants.

However, we may want to keep all fragments and

correct errors otherwise

A dual point of view is to disregard some SNPs and keep

the largest subset sufficient to reconstruct the haplotypes

All fragments get assigned to one of the two haplotypes.

We describe the min SNP removal problem: remove the

fewest number of columns from M so that the fragment

graph becomes bipartite.

- - - O 1 1 O O -

- O 1 O 1 - - - 1

1 1 O 1 1 - - - -

O O1 - - - O O -

- - - - - - 1 1 O

- - - - O OO1 -

- - - O 1 1 O O -

- O 1 O 1- - - 1

1 1 O 1 1 - - - -

O O1 - - - O O -

- - - - - - 1 1 O

- - - - O OO1 -

OK

- - - O 1 1O O -

- O 1 O 1 - - - 1

1 1 O 1 1 - - - -

O O1 - - - O O -

- - - - - - 1 1 O

- - - - O OO1 -

OK

- - - O 1 1 O O -

- O 1 O 1 - - - 1

1 1 O 1 1 - - - -

O O1 - - - O O -

- - - - - - 1 1 O

- - - - O OO1 -

OK

- - - O 1 1 O O -

- O 1 O 1- - - 1

1 1 O 1 1- - - -

O O1 - - - O O -

- - - - - - 1 1 O

- - - - O OO1 -

CONFLICT !

- - - O 1 1 O O -

- O 1 O 1 - - - 1

1 1 O 1 1 - - - -

O O1 - - - O O -

- - - - - - 1 1 O

- - - - O OO1 -

CONFLICT !

- - - O 1 1 O O -

- O 1 O 1 - - - 1

1 1 O 1 1 - - - -

O O1 - - - O O -

- - - - - - 1 1 O

- - - - O OO1 -

SNP conflict graph GS(M)

1 node for each SNP (column)

edge between conflicting SNPs

1 2 3 4 5 6 7 8 9

- - - O 1 1 O O -

- O 1 O 1 - - - 1

1 1O 1 1 - - - -

O O1 - - - O O -

- - - - - - 1 1 O

- - - - O OO1 -

1 2 3 4 5 6 7 8 9

- - - O 1 1 O O -

- O 1 O 1 - - - 1

1 1O 1 1 - - - -

O O1 - - - O O -

- - - - - - 1 1 O

- - - - O OO1 -

1

4

8

2

5

7

3

6

9

1 2 3 4 5 6 7 8 9

- - - O 1 1 O O -

- O 1 O 1 - - - 1

1 1O 1 1 - - - -

O O1 - - - O O -

- - - - - - 1 1 O

- - - - O OO1 -

1

4

8

2

5

7

3

6

9

For a gapless M, GF(M) is bipartite

if and only if GS(M) is an independent set

THEOREM 2

For a gapless M, GS(M) is a perfect graph

COROLLARY

For a gapless M, the min SNP removal

problem is polynomial

For a gapless M, GF(M) is bipartite if and only if

GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--OO11OO---------

----OO1OO1O11O---

--------11O1O111-

----11OO1O11O----

-------1OOO1-----

------11111O-----

--11O11O1OO------

Assume M gapless, GS(M) an independent set, but GF(M)

not bipartite.

Take an odd cycle in GF

For a gapless M, GF(M) is bipartite if and only if

GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?1???---------

----O????????O---

--------??O??1??-

----??????1??----

-------???O?-----

------????1?-----

--1???????O------

There is a generic structure of hor-vert cycle

For a gapless M, GF(M) is bipartite if and only if

GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?1???---------

----O????????O---

--------??O??1??-

----??????1??----

-------???O?-----

------????1?-----

--1???????O------

“vertical lines”

There cannot be only one vertical line in odd cycle

We merge rightmost and next to reduce them by 1

Hence, there cannot be a minimal (in n. of vertical lines) counterexample

For a gapless M, GF(M) is bipartite if and only if

GS(M) is an independent set

Must be 1

PROOF (sketch): by minimal counterexample

--O?1???---------

----O????????O---

--------??O??1??-

----??????1??----

-------???O?-----

------????1?-----

--1???????O------

“vertical lines”

For a gapless M, GF(M) is bipartite if and only if

GS(M) is an independent set

Must be 1

PROOF (sketch): by minimal counterexample

--O?1???---------

----O?????1??O---

--------??O??1??-

----??????1??----

-------???O?-----

------????1?-----

--1???????O------

“vertical lines”

Merge the rightmost lines

For a gapless M, GF(M) is bipartite if and onlyif

GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?1???---------

----O?????1------

--------??O------

----??????1------

-------???O------

------????1------

--1???????O------

“vertical lines”

Merge the rightmost lines

Still a counterexample!

For a gapless M, GS(M) is a perfect graph

PROOF: GS(M) is the complement of a comparability graph A

Comparability graphs are perfect

Comparability Graphs: unoriented that can be oriented

to become a partial order

LEMMA: If i<j<k and (i,k) is a SNP conflict then

either (i,k) or (j,k) is also a SNP conflict

i j k

- 1O O ? 1 O 1-

- O1 O ? 1 1 1 -

O

O

O

1

Equal:conflicts with i

Different:conflicts with k

I.e. if (i,j) is not a conflict and (j,k) is not a conflict, also (i,k) is not a conflict

i

j

k

So (u,v) with u < v and u not a conflict with v is a comparability graph A

and GS is A complement

NOTE: ind set on perfect graph is in P (Lovasz, Schrijvers, Groetschel, 84)

Hence gapless MSR is polynomial (max stable set on perfect graph).

There are better, D.P., algorithms, O(mn + m^2)

What if gaps ?

THEOREM: The min SNP removal is NP-hard if there

can be gaps (Reduction from MAXCUT)

Again, gaps must be long for problem to be difficult.

We have O(mn + n ) D.P.

for MSR on matrix with total gaps length L

2L + 1

2L + 2

Download Presentation

Connecting to Server..