On the hardness of inferring phylogenies from triplet dissimilarities
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities. Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology Haifa, Israel. B E G H L M. B E G H L M. D. T. B E G H L M. 4. 2. 1. 5. 7. 3. reconstruct. calculate. B E G H L M. 4. 3.

Download Presentation

On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


On the hardness of inferring phylogenies from triplet dissimilarities

On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities

Ilan Gronau Shlomo Moran

Technion – Israel Institute of Technology

Haifa, Israel


Pairwise distance based reconstruction

B E G H L M

B E G H L M

D

T

B E G H L M

4

2

1

5

7

3

reconstruct

calculate

B E G H L M

4

3

1

2

B

E

M

L

G

H

Pairwise-Distance Based Reconstruction

DT

M

E

L

G

H

B


Optimization criteria

B E G H L M

B E G H L M

B E G H L M

B E G H L M

Optimization Criteria

We wish the tree-metric DT to approximate simultaneously the pairwise distances in D.

= D

should be “close” to

DT =

Two “closeness” measures studied here:

Maximal Difference(l∞)

  • Maximal Distortion


Maximal difference l vs maximal distortion

B E G H L M

B E G H L M

Maximal Difference (l∞)vs. Maximal Distortion

B E G H L M

D =

DT =

B E G H L M

Goal: Find optimal T,

which minimizes the maximal difference/distortion between D and DT


Previous works on approximating dissimilarities by tree distances

Previous works on Approximating Dissimilarities by Tree Distances

  • Negative results: (NP-hardness)

  • Closest tree-metric (even ultrametric ) to dissimilarity matrix under l1 l2 [Day ‘87]

  • Closest tree-metric to dissimilarity matrix under l∞ [ABFPT99]

    • Hard to approximate better than 1.125

    • Implicit:Hard to approximate closest MaxDist tree within any constant factor

  • Positive results:

  • Closest ultrametric to dissimilarity matrix under l∞ [Krivanek ‘88]

  • 3-approximation of closest additive metric to a given metric[ABFPT99]

    • (implicit 6-approximation for general dissimilarity matrices)


This work triplet distances distances to triplets midpoints

This Work: Triplet-Distances – Distances to Triplets Midpoints

C(i,j,k)

τT (i ; jk)

  • τT (i ; jk) = τT (i ; kj)

  • τT (i ; ij) = 0

  • τT (i ; jj) = DT (i, j)

i

k

j


Triplet distances defined by 2 distances

…is realizable by a 3-tree

j

i

5

3

4

C(i,j,k)

k

Triplet-Distances Defined by 2-Distances

  • Each distance Matrix D defines 3-trees

  • τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

i

Any metric on 3 taxa…

8

9

j

7

k


Triplet distance based reconstruction

BB BE BG….. LL LM MM

B E G H L M

T

T

4

2

1

5

7

3

4

3

1

2

B

E

M

L

G

H

Triplet-Distance Based Reconstruction

τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

BB BE BG….. LL LM MM

B E G H L M

reconstruct


Why use triplet distances

Why use Triplet-Distances?

1. They enable more accurate estimations of 2-distances.

2. They are used (de facto) by known reconstruction algorithms


Improved estimations of pairwise distances

B E G H L M

B E G H L M

E

(Maximum Likelihood)

13

(In calculating D(H,E),

all other taxa are ignored

H

Improved Estimations of Pairwise Distances:

“Information Loss”

D=

Calculate D(H,E)


Improved estimations cont

B=(..AAGT..)

L=(..AATA..)

G=(..CCGT..)

(..****..)

(..****..)

M=(..CGCG..)

2

3

4

2

(..****..)

(..****..)

H= (..AACG..)

H= (..AACG..)

E=(..CAGA..)

E=(..CAGA..)

1

5

3

3

H= (..AACG..)

H= (..AACG..)

E=(..CAGA..)

E=(..CAGA..)

Improved Estimations (cont):

  • Estimate D(H,E) by calculating all the 3-trees on {H,E,X:XH,E}

  • (Or: calculate just one 3-tree, for a “trusted” 3rd taxon X :

  • V. Ranwez, O. Gascuel, Improvement of distance-based phylogenetic methods by a local maximum likelihood approach using triplets, Mol.Biol. Evol. 19(11) 1952–1963. (2002)


Implicit use of triplet distances in 2 distance reconstruction algorithms

T

BB BE BG….. LL LM MM

4

B E G H L M

2

1

5

7

3

B E G H L M

4

3

1

2

B

E

M

L

G

H

D

(Implicit) use of Triplet-Distances in 2-Distance Reconstruction Algorithms

τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].


1 st use triplet distances from a single source

i

r

j

1st use :“Triplet Distances from a Single Source”:

  • Fix a taxon r, and construct a tree T which minimizes:

  • Optimal solution is doable in O(n2) time, and is used eg in :

  • (FKW95): Optimal approximation of distances by ultrametric trees.

  • (ABFPT99): The best known approximation of distances by general trees

  • (BB99): Fast construction of Buneman trees.


2 nd use saitou nei neighbour joining

2nd use:Saitou&Nei Neighbour Joining

The neighbors-selection criterion of NJ selects a taxon-pair i,j which maximizes the sum :

r

r

i

r

r

r

r

j

r

r


Previous works on triplet dissimilarities distances

Previous Works on Triplet-Dissimilarities/Distances

  • I. Gronau, S. MoranNeighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances, Journal of Computational Biology 14(1) pp. 1-15 (2007).

  • Works which use the total weights of 3 trees:

  • S. Joly, GL Calve, Three Way Distances, Journal of Classification 12 pp. 191-205 (1995)

  • L. Pachter, D. Speyer Reconstructing Trees from Subtrees Weights , Applied Mathematics Letters 17 pp. 615-621 (2004)

  • D. Levy, R. Yoshida, L. Pachter, Beyond pairwise distances: Neighbor-joining with phylogenetic diversity estimates, Mol. Biol. Evol. 23(3) 491–498 (2006) .


Summary of results

Summary of Results

  • Results for Maximal Difference (l∞):

  • Decision problem is NP-Hard

  •  IS there a tree T s.t. ||τ,τT ||∞ ≤ Δ ?

  • Hardness-of-approximation of optimization problem

  •  Finding a tree T s.t. ||τ,τT ||∞ ≤ 1.4||τ,τOPT||∞

  • A 15-approximation algorithm

  •  Using the 6-approximation algorithm for 2-dissimilarities from [ABFPT99]

  • Result forMaximal Distortion:

  • Hardness-of-approximation within any constant factor


Np hardness of the decision problem

literals

clause

Satisfying assignment:

NP Hardness of the Decision Problem

We use a reduction from 3SAT

(the problem of determining whether a 3CNF formula is satisfiable)

We show:

If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τT ||∞ ≤ Δ,then one can determine for every 3CNF formula φ whether it is satisfiable.


The reduction

The Reduction

Given a 3CNF formula φ we define triplet distances  and an error bound Δ which enforce the output tree to imply a satisfying assignment to φ.

  • The set of taxa:

  • Taxa T , F.

  • A taxon for every literal ( ).

  • 3 taxa for every clause Cj ( y j1, y j2, y j3).


Properties enforced by the input

v

w

Properties Enforced by the Input (,Δ)

  • One the following can be enforced on each taxa triplet (u,v,w):

  • taxon u is closeto Path(v,w), or

  • taxon u is farto Path(v,w)

u


Enforcing truth assignmaent

T

F

Enforcing Truth Assignmaent

  • A truth assignment to φis implied by the following:

  • TisfarfromF

  • For each i, isfar from , and both of and areclose toPath(T ,F)

Thus we set xi =T iff xi is close to T.


Enforcing clauses satisfaction

l1

F

l2

l3

Enforcing Clauses-Satisfaction

A clause C=( l1 l2 l3 )is satisfied iff

At least one literal liis true, i.e. is close toT.

(l1 l2 l3 ) is satisfiediff it is not like this

We need to guarantee that all clauses avoid the above by the close/far relations.


Clauses satisfaction cont

But we don’t know which two paths

Clauses-Satisfaction (cont)

-(l1 l2 l3 )is satisfied iff out of the three paths:

Path(l1 , l2),Path(l1 , l3),Path(l2 , l3),

at least two paths areclose toT .

l3

T

F

l1

l2


Clauses satisfaction cont1

y1

y2

y3

l3

T

F

l1

l2

Clauses-Satisfaction (cont)

We attach a taxon to each such path:

y1is close toPath( l2,l3)

y2is close toPath( l1,l3)

y3is close toPath( l1,l2)

(l1 l2 l3 )is satisfied iff at least twoyi’s can be locatedclose toT.…


Clauses satisfaction end

y1

y2

y3

l3

T

F

l1

l2

Clauses-Satisfaction (end)

… and, at least two of theyi’scan be located close toT

Path( y2,y3), Path( y1,y3), Path( y1,y2), are close to T

So, (l1 l2 l3 )is satisfied iff all the above paths are close toT


Construction example

y22

y13

y12

y21

y11

y23

α

α

T

F

α

α

vT

vF

α

α

Construction Example

φ is satisfiable  there is a tree T which satisfies all bounds

A1τT (T , F ) ≥ 2α+2β

A2i=1..n :τT (T ; ) ≤α ; τT (F ; ) ≤α

B1j=1..m :τT (y j1; l j2 l j3 ) ≤α ; τT (y j2; l j1 l j3 ) ≤α ; τT (y j3; l j1 l j2 ) ≤α

B2j=1..m :τT (y j1; T F ) ≥α ; τT (y j2; T F ) ≥α ; τT (y j3; T F ) ≥α

B3j=1..m :τT (T ; y j2 y j3 ) ≤α ; τT (T ; y j1 y j3 ) ≤α; τT (T ; y j1 y j2 ) ≤α


Hardness of approximation results

Hardness of Approximation Results

By “stretching” the close/far restrictions, the following problems are also shown NP hard:

  • Approximating Maximal Difference

  • Finding a tree T s.t. ||τ,τT ||∞ ≤ 1.4||τ,τOPT||∞

  • ApproximatingMaximal Distortion:

  • Finding a tree T s.t.

  • MaxDist(τ,τT )≤ CMaxDist(τ,τOPT) for any constantC

Details in:

I. Gronau and S. moran, On The Hardness of Inferring Phylogenies from Triplet-Dissimilarities, Theoretical Computer Science 389(1-2), December 2007, pp. 44-55.


Open problems further research

Open Problems/Further Research

  • Extending hardness results for 3-diss tables induced by 2-diss matrices

  • (τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)] )

  • Extending hardness results for “naturally looking” trees

  • (binary trees with constant-bounded edge weights)

  • Check Performance of NJ when neighbor selection formula computed from “real” 3-distances.

  • Devise algorithms which use 3-distances as input.

  • Does optimization of 3-diss lead to good topological accuracy (under accepted models of sequence evolution)

  • (it is known that optimization of 2-diss doesn’t lead to good topological accuracy)


On the hardness of inferring phylogenies from triplet dissimilarities

Thank You


Distance based phylogenetic reconstruction

1

5

2

4

6

10

1

2

7

  • Compute distances between all taxon-pairs

  • Find a tree(edge-weighted) best-describing the distances

Distance-Based Phylogenetic Reconstruction


The reduction1

y22

y13

y12

y21

y11

y23

α

α

α

α

T

F

vT

vF

α

α

The Reduction – τ(φ)

A1τT (T , F ) ≥ 2α+2β

A2i=1..n :τT (T ; ) ≤α ; τT (F ; ) ≤α

B1j=1..m :τT (y j1; l j2 l j3 ) ≤α ; τT (y j2; l j1 l j3 ) ≤α ; τT (y j3; l j1 l j2 ) ≤α

B2j=1..m :τT (y j1; T F ) ≥α ; τT (y j2; T F ) ≥α ; τT (y j3; T F ) ≥α

B3j=1..m :τT (T ; y j2 y j3 ) ≤α ; τT (T ; y j1 y j3 ) ≤α; τT (T ; y j1 y j2 ) ≤α

  • In our constructed tree:

  • All 2-distances are in[2α , 2α+2β].

  • All 3-distances are in[α , α+2β].

  •  Δ=β.

A1τ(T , F ) = 2α+3β

A2i=1..n :τ(T ; ) = α-β; τ(F ; ) = α-β

B1j=1..m :τ(y j1; l j2 l j3 ) = α-β ; τ(y j2; l j1 l j3 ) = α-β ; τ(y j3; l j1 l j2 ) = α-β

B2j=1..m :τ(y j1; T F ) = α+β ; τ(y j2; T F ) = α+β ; τ(y j3; T F ) = α+β

B3j=1..m :τ(T ; y j2 y j3 ) = α-β ; τ(T ; y j1 y j3 ) = α-β ; τ(T ; y j1 y j2 ) = α-β

Other2-distances: τ(s , t) = 2α+2β

Other3-distances: τ(s ; t u) = α+2β


  • Login