- 388 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Resolving ambiguity in DNA sequences' - libitha

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Parsimony Haplotyping (PH)

Input: set of genotypes

Output: smallest set of haplotypes such that every genotype is resolved by two of these haplotypes

- Two haplotypes h1, h2resolve a genotype g if:
- - At each site where g has a 0, h1 and h2 both have a 0;
- - At each site where g has a 1, h1 and h2 both have a 1;
- - At each site where g has a 2, h1 and h2 are different i.e. 0/1 or 1/0.

Minimum Perfect Phylogeny Haplotyping

Input: set of genotypes

Output: smallest set of haplotypes such that every genotype is explained by two of these haplotypes and the haplotypes admit a perfect phylogeny

Lemma: a haplotype matrix admits a perfect phylogeny if and only if it does not have

as a submatrix.

Both problems are NP-hard in general and therefore Sharan, Halldórsson and Istrail started a search for bounded instances that admit polynomial time algorithms in their paper “Islands of tractability for parsimony haplotyping” from 2005.

- Let PH(k,l) denote the problem PH where the input matrix has at most k 2’s per row and at most l 2’s per column.
- A ‘*’ denotes no restriction e.g. PH(3,*) is the problem with no restriction on the number of 2’s per column, but at most three 2’s per row.
- Same definition for MPPH.

Parsimony Haplotyping (PH)

- PH(4,3) is APX-hard (Sharan, Halldórsson, Istrail – 2005)
- PH(3,*) is APX-hard (Lancia, Pinotti, Rizzi - 2004)
- PH(2,*) is in P (Lancia et al, independently Cilibrasi et al - 2005)
- PH(3,3) is APX-hard
- PH(*,1) is in P
- A approximation for PH(*,l)

our results

- Minimum Perfect Phylogeny Haplotyping (MPPH)
- NP-hard in general (Bafna, Gusfield, Hannenhalli, Yooseph - 2004)
- MPPH(3,3) is APX-hard
- MPPH(2,*) is in P
- MPPH(*,1) is in P

our results

PH(*,1)

- A haplotype is consistent with a genotype if it can be used in a resolution of this genotype. For example 010 is consistent with 022but 110 is not.
- Genotypes are compatible if they can share a haplotype. For example 020 and 210are compatible but 020 and 120 are not.

If two genotypes are compatible, then there is precisely one haplotype that is consistent with both of them. (At each column, read off the non-2 element.) So each edge corresponds to a unique haplotype.

- The compatibility graph is a 1-sum of cliques, and is thus chordal. Every chordal graph contains a simplicial vertex, a vertex whose neighbourhood is a clique.
- (3) Given any mutually compatible set of genotypes (which thus appear as a clique in the compatibility graph), there is precisely one haplotype that is consistent with all of them. (At each column, read off the non-2 element.)
- We call this the clique haplotype for that clique.

The compatibility graph

g7

g4

g1

h1

h2

g2

h2

h2

g3

h3

g3

h1

g2

g6

h2

g4

h2

h2

h1

g5

g1

g5

g6

Clique haplotype for yellow clique is: h1 = 0011001Clique haplotype for pink clique is: h2 = 0010001Clique haplotype for red clique is: h3 = 1000001

g7

Algorithm for PH(*,1)

- H is initially the empty set.
- Find a simplicial vertex.
- Resolve the corresponding genotype g as follows:
- If g is already resolved by H, there’s no need to add new haplotypes.
- If g has no 2’s, simply add g to H.
- If just adding the clique haplotype to H lets it resolve g, do it.
- If just adding some non-clique haplotype to H lets it resolve g, do it.
- If g is not an isolated vertex, add hc, h to H’ (where g = hc + h).
- Add any two haplotypes h1, h2 to H’ such that g = h1 + h2.
- Remove the simplicial vertex, the resulting graph is again chordal.

In MPPH(*,1) some resolutions are forbidden…

Eliminate duplicates

= FORBIDDEN RESOLUTION

Corresponding columns in H

resolve

Eliminate duplicates

= SAFE RESOLUTION

Two columns in G

Corresponding columns in H

Reducing MPPH(*,1) to PH(*,1) by discouraging forbidden resolutions

- In PH(*,1), there will – for each pair of columns – be at most one row that is 22, and if such a row exists there will be no other 2’s in those columns.
- Idea: to reduce MPPH(*,1) to PH(*,1), we have to discourage such rows from resolving the forbidden way.
- We do this by adding, for each pair of columns where a 22 can be seen, a ‘blocking’ column that biases resolutions in favour of the safe way.

Idea is that, within PH(*,1), the 22 might still choose the forbidden resolution (e.g. 00/11 in this case) but the haplotypes used to do this cannot be shared by any other genotypes, because of the extra column. So just as good, if not better, to choose the safe resolution. So (assuming feasibility) there exist optimal solutions to PH(*,1) where all such 22 resolutions are safe.

becomes

Approximation Algorithms

- A simple matching algorithm and a combination of two lower bounds.
- Let l be the maximum number of 2’s per column.
- Lemma: any solution contains at least LB(n) haplotypes
- If there are no genotypes without 2’s this bound can be significantly improved.

Proof

- If the compatibility graph is a clique we can combine two known lower bounds to get the new bound.
- Otherwise there is a column where one genotype has a 1 and another genotype has a 0.
- Deleting the at most l genotypes with a 2 in this column disconnects the compatibility graph into components. Let ni be the number of genotypes in the i-th component and apply induction:

Matching algorithm (PHM)

- Construct the compatibility graph C(G)
- Find a maximum matching M in C(G)
- For every edge {g1,g2} in M:
- If either g1 or g2 contains no 2’s then resolve by two haplotypes.
- Otherwise, resolve g1 and g2 by three haplotypes.
- Resolve each remaining genotype by two haplotypes

Theorem: the matching algorithm PHM achieves an approximation ratio of if there are at most l 2’s per column.

Proof: let q be the size of the maximum matching and nt the number of genotypes without 2’s.

- PHM uses 2n - q - nt haplotypes.
- The vertices not covered by the matching form an independent set of size n - 2q.
- Hence at least n - 2q haplotypes are needed.

Proof (continued)

- If we are done.
- Otherwise we use the lower bound LB(n) and prove that:

Summary

- Many new “Islands of tractability”.
- Approximation algorithms depending on the number of 2’s per column.
- Main open problems are PH(*,2) and MPPH(*,2)

Leo van Iersel, Judith Keijsper, Steven Kelk, Leen Stougie, Beaches of islands of tractability: Algorithms for parsimony and minimum perfect phylogeny haplotyping problems, WABI 2006, LNCS 4175, pp. 80-91.

Questions?

Leo van Iersel, Judith Keijsper, Steven Kelk, Leen Stougie, Shorelines of islands of tractability: Algorithms for parsimony and minimum perfect phylogeny haplotyping problems, submitted for journal publication.

Download Presentation

Connecting to Server..