Resolving ambiguity in DNA sequences

395 Views

Download Presentation
## Resolving ambiguity in DNA sequences

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Resolving ambiguity in DNA sequences**Judith Keijsper Steven Kelk Leen Stougie Leo van Iersel**SNP’s: binary strings**Two binary strings: haplotypes + One ternary string: genotype**Genotypes of a population**One possible set of haplotypes that resolves these genotypes + + +**Parsimony Haplotyping (PH)**Input: set of genotypes Output: smallest set of haplotypes such that every genotype is resolved by two of these haplotypes • Two haplotypes h1, h2resolve a genotype g if: • - At each site where g has a 0, h1 and h2 both have a 0; • - At each site where g has a 1, h1 and h2 both have a 1; • - At each site where g has a 2, h1 and h2 are different i.e. 0/1 or 1/0.**Minimum Perfect Phylogeny Haplotyping**Input: set of genotypes Output: smallest set of haplotypes such that every genotype is explained by two of these haplotypes and the haplotypes admit a perfect phylogeny Lemma: a haplotype matrix admits a perfect phylogeny if and only if it does not have as a submatrix.**Both problems are NP-hard in general and therefore Sharan,**Halldórsson and Istrail started a search for bounded instances that admit polynomial time algorithms in their paper “Islands of tractability for parsimony haplotyping” from 2005. • Let PH(k,l) denote the problem PH where the input matrix has at most k 2’s per row and at most l 2’s per column. • A ‘*’ denotes no restriction e.g. PH(3,*) is the problem with no restriction on the number of 2’s per column, but at most three 2’s per row. • Same definition for MPPH.**Parsimony Haplotyping (PH)**• PH(4,3) is APX-hard (Sharan, Halldórsson, Istrail – 2005) • PH(3,*) is APX-hard (Lancia, Pinotti, Rizzi - 2004) • PH(2,*) is in P (Lancia et al, independently Cilibrasi et al - 2005) • PH(3,3) is APX-hard • PH(*,1) is in P • A approximation for PH(*,l) our results • Minimum Perfect Phylogeny Haplotyping (MPPH) • NP-hard in general (Bafna, Gusfield, Hannenhalli, Yooseph - 2004) • MPPH(3,3) is APX-hard • MPPH(2,*) is in P • MPPH(*,1) is in P our results**PH(*,1)**• A haplotype is consistent with a genotype if it can be used in a resolution of this genotype. For example 010 is consistent with 022but 110 is not. • Genotypes are compatible if they can share a haplotype. For example 020 and 210are compatible but 020 and 120 are not.**The compatibility graph**g1 g7 g4 g2 g3 g3 g4 g6 g2 g5 g6 g5 g1 g7 Example input genotype matrix Compatibility graph**If two genotypes are compatible, then there is precisely**one haplotype that is consistent with both of them. (At each column, read off the non-2 element.) So each edge corresponds to a unique haplotype. • The compatibility graph is a 1-sum of cliques, and is thus chordal. Every chordal graph contains a simplicial vertex, a vertex whose neighbourhood is a clique. • (3) Given any mutually compatible set of genotypes (which thus appear as a clique in the compatibility graph), there is precisely one haplotype that is consistent with all of them. (At each column, read off the non-2 element.) • We call this the clique haplotype for that clique.**The compatibility graph**g7 g4 g1 h1 h2 g2 h2 h2 g3 h3 g3 h1 g2 g6 h2 g4 h2 h2 h1 g5 g1 g5 g6 Clique haplotype for yellow clique is: h1 = 0011001Clique haplotype for pink clique is: h2 = 0010001Clique haplotype for red clique is: h3 = 1000001 g7**Algorithm for PH(*,1)**• H is initially the empty set. • Find a simplicial vertex. • Resolve the corresponding genotype g as follows: • If g is already resolved by H, there’s no need to add new haplotypes. • If g has no 2’s, simply add g to H. • If just adding the clique haplotype to H lets it resolve g, do it. • If just adding some non-clique haplotype to H lets it resolve g, do it. • If g is not an isolated vertex, add hc, h to H’ (where g = hc + h). • Add any two haplotypes h1, h2 to H’ such that g = h1 + h2. • Remove the simplicial vertex, the resulting graph is again chordal.**In MPPH(*,1) some resolutions are forbidden…**Eliminate duplicates = FORBIDDEN RESOLUTION Corresponding columns in H resolve Eliminate duplicates = SAFE RESOLUTION Two columns in G Corresponding columns in H**Reducing MPPH(*,1) to PH(*,1) by discouraging forbidden**resolutions • In PH(*,1), there will – for each pair of columns – be at most one row that is 22, and if such a row exists there will be no other 2’s in those columns. • Idea: to reduce MPPH(*,1) to PH(*,1), we have to discourage such rows from resolving the forbidden way. • We do this by adding, for each pair of columns where a 22 can be seen, a ‘blocking’ column that biases resolutions in favour of the safe way. Idea is that, within PH(*,1), the 22 might still choose the forbidden resolution (e.g. 00/11 in this case) but the haplotypes used to do this cannot be shared by any other genotypes, because of the extra column. So just as good, if not better, to choose the safe resolution. So (assuming feasibility) there exist optimal solutions to PH(*,1) where all such 22 resolutions are safe. becomes**Approximation Algorithms**• A simple matching algorithm and a combination of two lower bounds. • Let l be the maximum number of 2’s per column. • Lemma: any solution contains at least LB(n) haplotypes • If there are no genotypes without 2’s this bound can be significantly improved.**Proof**• If the compatibility graph is a clique we can combine two known lower bounds to get the new bound. • Otherwise there is a column where one genotype has a 1 and another genotype has a 0. • Deleting the at most l genotypes with a 2 in this column disconnects the compatibility graph into components. Let ni be the number of genotypes in the i-th component and apply induction:**Matching algorithm (PHM)**• Construct the compatibility graph C(G) • Find a maximum matching M in C(G) • For every edge {g1,g2} in M: • If either g1 or g2 contains no 2’s then resolve by two haplotypes. • Otherwise, resolve g1 and g2 by three haplotypes. • Resolve each remaining genotype by two haplotypes**Theorem: the matching algorithm PHM achieves an**approximation ratio of if there are at most l 2’s per column. Proof: let q be the size of the maximum matching and nt the number of genotypes without 2’s. • PHM uses 2n - q - nt haplotypes. • The vertices not covered by the matching form an independent set of size n - 2q. • Hence at least n - 2q haplotypes are needed.**Proof (continued)**• If we are done. • Otherwise we use the lower bound LB(n) and prove that:**Summary**• Many new “Islands of tractability”. • Approximation algorithms depending on the number of 2’s per column. • Main open problems are PH(*,2) and MPPH(*,2)**Leo van Iersel, Judith Keijsper, Steven Kelk, Leen Stougie,**Beaches of islands of tractability: Algorithms for parsimony and minimum perfect phylogeny haplotyping problems, WABI 2006, LNCS 4175, pp. 80-91. Questions? Leo van Iersel, Judith Keijsper, Steven Kelk, Leen Stougie, Shorelines of islands of tractability: Algorithms for parsimony and minimum perfect phylogeny haplotyping problems, submitted for journal publication.