- 84 Views
- Uploaded on
- Presentation posted in: General

On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Ilan Gronau Shlomo Moran

Technion – Israel Institute of Technology

Haifa, Israel

B E G H L M

B E G H L M

D

T

B E G H L M

4

2

1

5

7

3

reconstruct

calculate

B E G H L M

4

3

1

2

B

E

M

L

G

H

DT

M

E

L

G

H

B

B E G H L M

B E G H L M

B E G H L M

B E G H L M

We wish the tree-metric DT to approximate simultaneously the pairwise distances in D.

= D

should be “close” to

DT =

Two “closeness” measures studied here:

Maximal Difference(l∞)

- Maximal Distortion

B E G H L M

B E G H L M

B E G H L M

D =

DT =

B E G H L M

Goal: Find optimal T,

which minimizes the maximal difference/distortion between D and DT

- Negative results: (NP-hardness)
- Closest tree-metric (even ultrametric ) to dissimilarity matrix under l1 l2 [Day ‘87]
- Closest tree-metric to dissimilarity matrix under l∞ [ABFPT99]
- Hard to approximate better than 1.125
- Implicit:Hard to approximate closest MaxDist tree within any constant factor

- Positive results:
- Closest ultrametric to dissimilarity matrix under l∞ [Krivanek ‘88]
- 3-approximation of closest additive metric to a given metric[ABFPT99]
- (implicit 6-approximation for general dissimilarity matrices)

C(i,j,k)

τT (i ; jk)

- τT (i ; jk) = τT (i ; kj)
- τT (i ; ij) = 0
- τT (i ; jj) = DT (i, j)

i

k

j

…is realizable by a 3-tree

j

i

5

3

4

C(i,j,k)

k

- Each distance Matrix D defines 3-trees

- τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

i

Any metric on 3 taxa…

8

9

j

7

k

BB BE BG….. LL LM MM

B E G H L M

T

T

4

2

1

5

7

3

4

3

1

2

B

E

M

L

G

H

τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

BB BE BG….. LL LM MM

B E G H L M

reconstruct

1. They enable more accurate estimations of 2-distances.

2. They are used (de facto) by known reconstruction algorithms

B E G H L M

B E G H L M

E

(Maximum Likelihood)

13

(In calculating D(H,E),

all other taxa are ignored

H

“Information Loss”

D=

Calculate D(H,E)

B=(..AAGT..)

L=(..AATA..)

G=(..CCGT..)

(..****..)

(..****..)

M=(..CGCG..)

2

3

4

2

(..****..)

(..****..)

H= (..AACG..)

H= (..AACG..)

E=(..CAGA..)

E=(..CAGA..)

1

5

3

3

H= (..AACG..)

H= (..AACG..)

E=(..CAGA..)

E=(..CAGA..)

- Estimate D(H,E) by calculating all the 3-trees on {H,E,X:XH,E}
- (Or: calculate just one 3-tree, for a “trusted” 3rd taxon X :
- V. Ranwez, O. Gascuel, Improvement of distance-based phylogenetic methods by a local maximum likelihood approach using triplets, Mol.Biol. Evol. 19(11) 1952–1963. (2002)

T

BB BE BG….. LL LM MM

4

B E G H L M

2

1

5

7

3

B E G H L M

4

3

1

2

B

E

M

L

G

H

D

τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

i

r

j

- Fix a taxon r, and construct a tree T which minimizes:
- Optimal solution is doable in O(n2) time, and is used eg in :
- (FKW95): Optimal approximation of distances by ultrametric trees.
- (ABFPT99): The best known approximation of distances by general trees
- (BB99): Fast construction of Buneman trees.

The neighbors-selection criterion of NJ selects a taxon-pair i,j which maximizes the sum :

r

r

i

r

r

r

r

j

r

r

- I. Gronau, S. MoranNeighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances, Journal of Computational Biology 14(1) pp. 1-15 (2007).
- Works which use the total weights of 3 trees:
- S. Joly, GL Calve, Three Way Distances, Journal of Classification 12 pp. 191-205 (1995)
- L. Pachter, D. Speyer Reconstructing Trees from Subtrees Weights , Applied Mathematics Letters 17 pp. 615-621 (2004)
- D. Levy, R. Yoshida, L. Pachter, Beyond pairwise distances: Neighbor-joining with phylogenetic diversity estimates, Mol. Biol. Evol. 23(3) 491–498 (2006) .

- Results for Maximal Difference (l∞):
- Decision problem is NP-Hard
- IS there a tree T s.t. ||τ,τT ||∞ ≤ Δ ?
- Hardness-of-approximation of optimization problem
- Finding a tree T s.t. ||τ,τT ||∞ ≤ 1.4||τ,τOPT||∞
- A 15-approximation algorithm
- Using the 6-approximation algorithm for 2-dissimilarities from [ABFPT99]
- Result forMaximal Distortion:
- Hardness-of-approximation within any constant factor

literals

clause

Satisfying assignment:

We use a reduction from 3SAT

(the problem of determining whether a 3CNF formula is satisfiable)

We show:

If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τT ||∞ ≤ Δ,then one can determine for every 3CNF formula φ whether it is satisfiable.

Given a 3CNF formula φ we define triplet distances and an error bound Δ which enforce the output tree to imply a satisfying assignment to φ.

- The set of taxa:
- Taxa T , F.
- A taxon for every literal ( ).
- 3 taxa for every clause Cj ( y j1, y j2, y j3).

v

w

- One the following can be enforced on each taxa triplet (u,v,w):
- taxon u is closeto Path(v,w), or
- taxon u is farto Path(v,w)

u

T

F

- A truth assignment to φis implied by the following:
- TisfarfromF
- For each i, isfar from , and both of and areclose toPath(T ,F)

Thus we set xi =T iff xi is close to T.

l1

F

l2

l3

A clause C=( l1 l2 l3 )is satisfied iff

At least one literal liis true, i.e. is close toT.

(l1 l2 l3 ) is satisfiediff it is not like this

We need to guarantee that all clauses avoid the above by the close/far relations.

But we don’t know which two paths

-(l1 l2 l3 )is satisfied iff out of the three paths:

Path(l1 , l2),Path(l1 , l3),Path(l2 , l3),

at least two paths areclose toT .

l3

T

F

l1

l2

y1

y2

y3

l3

T

F

l1

l2

We attach a taxon to each such path:

y1is close toPath( l2,l3)

y2is close toPath( l1,l3)

y3is close toPath( l1,l2)

(l1 l2 l3 )is satisfied iff at least twoyi’s can be locatedclose toT.…

y1

y2

y3

l3

T

F

l1

l2

… and, at least two of theyi’scan be located close toT

Path( y2,y3), Path( y1,y3), Path( y1,y2), are close to T

So, (l1 l2 l3 )is satisfied iff all the above paths are close toT

y22

y13

y12

y21

y11

y23

α

α

T

2β

F

α

α

vT

vF

α

α

φ is satisfiable there is a tree T which satisfies all bounds

A1τT (T , F ) ≥ 2α+2β

A2i=1..n :τT (T ; ) ≤α ; τT (F ; ) ≤α

B1j=1..m :τT (y j1; l j2 l j3 ) ≤α ; τT (y j2; l j1 l j3 ) ≤α ; τT (y j3; l j1 l j2 ) ≤α

B2j=1..m :τT (y j1; T F ) ≥α ; τT (y j2; T F ) ≥α ; τT (y j3; T F ) ≥α

B3j=1..m :τT (T ; y j2 y j3 ) ≤α ; τT (T ; y j1 y j3 ) ≤α; τT (T ; y j1 y j2 ) ≤α

By “stretching” the close/far restrictions, the following problems are also shown NP hard:

- Approximating Maximal Difference
- Finding a tree T s.t. ||τ,τT ||∞ ≤ 1.4||τ,τOPT||∞
- ApproximatingMaximal Distortion:
- Finding a tree T s.t.
- MaxDist(τ,τT )≤ CMaxDist(τ,τOPT) for any constantC

Details in:

I. Gronau and S. moran, On The Hardness of Inferring Phylogenies from Triplet-Dissimilarities, Theoretical Computer Science 389(1-2), December 2007, pp. 44-55.

- Extending hardness results for 3-diss tables induced by 2-diss matrices
- (τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)] )
- Extending hardness results for “naturally looking” trees
- (binary trees with constant-bounded edge weights)
- Check Performance of NJ when neighbor selection formula computed from “real” 3-distances.
- Devise algorithms which use 3-distances as input.
- Does optimization of 3-diss lead to good topological accuracy (under accepted models of sequence evolution)
- (it is known that optimization of 2-diss doesn’t lead to good topological accuracy)

Thank You

1

5

2

4

6

10

1

2

7

- Compute distances between all taxon-pairs
- Find a tree(edge-weighted) best-describing the distances

Distance-Based Phylogenetic Reconstruction

y22

y13

y12

y21

y11

y23

α

α

2β

α

α

T

F

vT

vF

α

α

A1τT (T , F ) ≥ 2α+2β

A2i=1..n :τT (T ; ) ≤α ; τT (F ; ) ≤α

B1j=1..m :τT (y j1; l j2 l j3 ) ≤α ; τT (y j2; l j1 l j3 ) ≤α ; τT (y j3; l j1 l j2 ) ≤α

B2j=1..m :τT (y j1; T F ) ≥α ; τT (y j2; T F ) ≥α ; τT (y j3; T F ) ≥α

B3j=1..m :τT (T ; y j2 y j3 ) ≤α ; τT (T ; y j1 y j3 ) ≤α; τT (T ; y j1 y j2 ) ≤α

- In our constructed tree:
- All 2-distances are in[2α , 2α+2β].
- All 3-distances are in[α , α+2β].
- Δ=β.

A1τ(T , F ) = 2α+3β

A2i=1..n :τ(T ; ) = α-β; τ(F ; ) = α-β

B1j=1..m :τ(y j1; l j2 l j3 ) = α-β ; τ(y j2; l j1 l j3 ) = α-β ; τ(y j3; l j1 l j2 ) = α-β

B2j=1..m :τ(y j1; T F ) = α+β ; τ(y j2; T F ) = α+β ; τ(y j3; T F ) = α+β

B3j=1..m :τ(T ; y j2 y j3 ) = α-β ; τ(T ; y j1 y j3 ) = α-β ; τ(T ; y j1 y j2 ) = α-β

Other2-distances: τ(s , t) = 2α+2β

Other3-distances: τ(s ; t u) = α+2β