Resolving ambiguity in dna sequences
Advertisement
This presentation is the property of its rightful owner.
1 / 26

Resolving ambiguity in DNA sequences PowerPoint PPT Presentation

Resolving ambiguity in DNA sequences. Judith Keijsper Steven Kelk Leen Stougie Leo van Iersel. DNA. SNP’s: sites where variation is observed. SNP’s: sites that are interesting. SNP’s: binary strings. Two binary strings: haplotypes. +. One ternary string: genotype.

Download Presentation

Resolving ambiguity in DNA sequences

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Resolving ambiguity in dna sequences

Resolving ambiguity in DNA sequences

Judith Keijsper

Steven Kelk

Leen Stougie

Leo van Iersel


Slides in ppt

DNA


Snp s sites where variation is observed

SNP’s: sites where variation is observed


Snp s sites that are interesting

SNP’s: sites that are interesting


Snp s binary strings

SNP’s: binary strings

Two binary strings: haplotypes

+

One ternary string: genotype


Resolving a genotype

Resolving a genotype


Genotypes of a population

Genotypes of a population

One possible set of haplotypes that resolves these genotypes

+

+

+


Parsimony haplotyping ph

Parsimony Haplotyping (PH)

Input: set of genotypes

Output: smallest set of haplotypes such that every genotype is resolved by two of these haplotypes

  • Two haplotypes h1, h2resolve a genotype g if:

    • - At each site where g has a 0, h1 and h2 both have a 0;

    • - At each site where g has a 1, h1 and h2 both have a 1;

    • - At each site where g has a 2, h1 and h2 are different i.e. 0/1 or 1/0.


Perfect phylogeny haplotyping

Perfect PhylogenyHaplotyping


Minimum perfect phylogeny haplotyping

Minimum Perfect Phylogeny Haplotyping

Input:set of genotypes

Output: smallest set of haplotypes such that every genotype is explained by two of these haplotypes and the haplotypes admit a perfect phylogeny

Lemma: a haplotype matrix admits a perfect phylogeny if and only if it does not have

as a submatrix.


Slides in ppt

  • Both problems are NP-hard in general and therefore Sharan, Halldórsson and Istrail started a search for bounded instances that admit polynomial time algorithms in their paper “Islands of tractability for parsimony haplotyping” from 2005.

  • Let PH(k,l) denote the problem PH where the input matrix has at most k 2’s per row and at most l 2’s per column.

  • A ‘*’ denotes no restriction e.g. PH(3,*) is the problem with no restriction on the number of 2’s per column, but at most three 2’s per row.

  • Same definition for MPPH.


Slides in ppt

  • Parsimony Haplotyping (PH)

  • PH(4,3) is APX-hard (Sharan, Halldórsson, Istrail – 2005)

  • PH(3,*) is APX-hard (Lancia, Pinotti, Rizzi - 2004)

  • PH(2,*) is in P (Lancia et al, independently Cilibrasi et al - 2005)

  • PH(3,3) is APX-hard

  • PH(*,1) is in P

  • A approximation for PH(*,l)

our results

  • Minimum Perfect Phylogeny Haplotyping (MPPH)

  • NP-hard in general (Bafna, Gusfield, Hannenhalli, Yooseph - 2004)

  • MPPH(3,3) is APX-hard

  • MPPH(2,*) is in P

  • MPPH(*,1) is in P

our results


Slides in ppt

PH(*,1)

  • A haplotype is consistent with a genotype if it can be used in a resolution of this genotype. For example 010 is consistent with 022but 110 is not.

  • Genotypes are compatible if they can share a haplotype. For example 020 and 210are compatible but 020 and 120 are not.


The compatibility graph

The compatibility graph

g1

g7

g4

g2

g3

g3

g4

g6

g2

g5

g6

g5

g1

g7

Example input genotype matrix

Compatibility graph


Slides in ppt

  • If two genotypes are compatible, then there is precisely one haplotype that is consistent with both of them. (At each column, read off the non-2 element.) So each edge corresponds to a unique haplotype.

  • The compatibility graph is a 1-sum of cliques, and is thus chordal. Every chordal graph contains a simplicial vertex, a vertex whose neighbourhood is a clique.

  • (3) Given any mutually compatible set of genotypes (which thus appear as a clique in the compatibility graph), there is precisely one haplotype that is consistent with all of them. (At each column, read off the non-2 element.)

  • We call this the clique haplotype for that clique.


The compatibility graph1

The compatibility graph

g7

g4

g1

h1

h2

g2

h2

h2

g3

h3

g3

h1

g2

g6

h2

g4

h2

h2

h1

g5

g1

g5

g6

Clique haplotype for yellow clique is: h1 = 0011001Clique haplotype for pink clique is: h2 = 0010001Clique haplotype for red clique is: h3 = 1000001

g7


Algorithm for ph 1

Algorithm for PH(*,1)

  • H is initially the empty set.

  • Find a simplicial vertex.

  • Resolve the corresponding genotype g as follows:

    • If g is already resolved by H, there’s no need to add new haplotypes.

    • If g has no 2’s, simply add g to H.

    • If just adding the clique haplotype to H lets it resolve g, do it.

    • If just adding some non-clique haplotype to H lets it resolve g, do it.

    • If g is not an isolated vertex, add hc, h to H’ (where g = hc + h).

    • Add any two haplotypes h1, h2 to H’ such that g = h1 + h2.

  • Remove the simplicial vertex, the resulting graph is again chordal.


In mpph 1 some resolutions are forbidden

In MPPH(*,1) some resolutions are forbidden…

Eliminate duplicates

= FORBIDDEN RESOLUTION

Corresponding columns in H

resolve

Eliminate duplicates

= SAFE RESOLUTION

Two columns in G

Corresponding columns in H


Reducing mpph 1 to ph 1 by discouraging forbidden resolutions

Reducing MPPH(*,1) to PH(*,1) by discouraging forbidden resolutions

  • In PH(*,1), there will – for each pair of columns – be at most one row that is 22, and if such a row exists there will be no other 2’s in those columns.

  • Idea: to reduce MPPH(*,1) to PH(*,1), we have to discourage such rows from resolving the forbidden way.

  • We do this by adding, for each pair of columns where a 22 can be seen, a ‘blocking’ column that biases resolutions in favour of the safe way.

Idea is that, within PH(*,1), the 22 might still choose the forbidden resolution (e.g. 00/11 in this case) but the haplotypes used to do this cannot be shared by any other genotypes, because of the extra column. So just as good, if not better, to choose the safe resolution. So (assuming feasibility) there exist optimal solutions to PH(*,1) where all such 22 resolutions are safe.

becomes


Approximation algorithms

Approximation Algorithms

  • A simple matching algorithm and a combination of two lower bounds.

  • Let l be the maximum number of 2’s per column.

  • Lemma: any solution contains at least LB(n) haplotypes

  • If there are no genotypes without 2’s this bound can be significantly improved.


Proof

Proof

  • If the compatibility graph is a clique we can combine two known lower bounds to get the new bound.

  • Otherwise there is a column where one genotype has a 1 and another genotype has a 0.

  • Deleting the at most l genotypes with a 2 in this column disconnects the compatibility graph into components. Let ni be the number of genotypes in the i-th component and apply induction:


Matching algorithm phm

Matching algorithm (PHM)

  • Construct the compatibility graph C(G)

  • Find a maximum matching M in C(G)

  • For every edge {g1,g2} in M:

    • If either g1 or g2 contains no 2’s then resolve by two haplotypes.

    • Otherwise, resolve g1 and g2 by three haplotypes.

  • Resolve each remaining genotype by two haplotypes


Slides in ppt

Theorem: the matching algorithm PHM achieves an approximation ratio of if there are at most l 2’s per column.

Proof: let q be the size of the maximum matching and nt the number of genotypes without 2’s.

  • PHM uses 2n - q - nt haplotypes.

  • The vertices not covered by the matching form an independent set of size n - 2q.

  • Hence at least n - 2q haplotypes are needed.


Proof continued

Proof (continued)

  • If we are done.

  • Otherwise we use the lower bound LB(n) and prove that:


Summary

Summary

  • Many new “Islands of tractability”.

  • Approximation algorithms depending on the number of 2’s per column.

  • Main open problems are PH(*,2) and MPPH(*,2)


Slides in ppt

Leo van Iersel, Judith Keijsper, Steven Kelk, Leen Stougie, Beaches of islands of tractability: Algorithms for parsimony and minimum perfect phylogeny haplotyping problems, WABI 2006, LNCS 4175, pp. 80-91.

Questions?

Leo van Iersel, Judith Keijsper, Steven Kelk, Leen Stougie, Shorelines of islands of tractability: Algorithms for parsimony and minimum perfect phylogeny haplotyping problems, submitted for journal publication.


  • Login