On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities

1 / 30

# On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities - PowerPoint PPT Presentation

On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities. Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology Haifa, Israel. B E G H L M. B E G H L M. D. T. B E G H L M. 4. 2. 1. 5. 7. 3. reconstruct. calculate. B E G H L M. 4. 3.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities' - gianna

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities

Ilan Gronau Shlomo Moran

Technion – Israel Institute of Technology

Haifa, Israel

B E G H L M

B E G H L M

D

T

B E G H L M

4

2

1

5

7

3

reconstruct

calculate

B E G H L M

4

3

1

2

B

E

M

L

G

H

Pairwise-Distance Based Reconstruction

DT

M

E

L

G

H

B

B E G H L M

B E G H L M

B E G H L M

B E G H L M

Optimization Criteria

We wish the tree-metric DT to approximate simultaneously the pairwise distances in D.

= D

should be “close” to

DT =

Two “closeness” measures studied here:

Maximal Difference(l∞)

• Maximal Distortion

B E G H L M

B E G H L M

Maximal Difference (l∞)vs. Maximal Distortion

B E G H L M

D =

DT =

B E G H L M

Goal: Find optimal T,

which minimizes the maximal difference/distortion between D and DT

• Negative results: (NP-hardness)
• Closest tree-metric (even ultrametric ) to dissimilarity matrix under l1 l2 [Day ‘87]
• Closest tree-metric to dissimilarity matrix under l∞ [ABFPT99]
• Hard to approximate better than 1.125
• Implicit:Hard to approximate closest MaxDist tree within any constant factor
• Positive results:
• Closest ultrametric to dissimilarity matrix under l∞ [Krivanek ‘88]
• 3-approximation of closest additive metric to a given metric[ABFPT99]
• (implicit 6-approximation for general dissimilarity matrices)

C(i,j,k)

τT (i ; jk)

• τT (i ; jk) = τT (i ; kj)
• τT (i ; ij) = 0
• τT (i ; jj) = DT (i, j)

i

k

j

…is realizable by a 3-tree

j

i

5

3

4

C(i,j,k)

k

Triplet-Distances Defined by 2-Distances
• Each distance Matrix D defines 3-trees
• τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

i

Any metric on 3 taxa…

8

9

j

7

k

BB BE BG….. LL LM MM

B E G H L M

T

T

4

2

1

5

7

3

4

3

1

2

B

E

M

L

G

H

Triplet-Distance Based Reconstruction

τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

BB BE BG….. LL LM MM

B E G H L M

reconstruct

Why use Triplet-Distances?

1. They enable more accurate estimations of 2-distances.

2. They are used (de facto) by known reconstruction algorithms

B E G H L M

B E G H L M

E

(Maximum Likelihood)

13

(In calculating D(H,E),

all other taxa are ignored

H

Improved Estimations of Pairwise Distances:

“Information Loss”

D=

Calculate D(H,E)

B=(..AAGT..)

L=(..AATA..)

G=(..CCGT..)

(..****..)

(..****..)

M=(..CGCG..)

2

3

4

2

(..****..)

(..****..)

H= (..AACG..)

H= (..AACG..)

E=(..CAGA..)

E=(..CAGA..)

1

5

3

3

H= (..AACG..)

H= (..AACG..)

E=(..CAGA..)

E=(..CAGA..)

Improved Estimations (cont):
• Estimate D(H,E) by calculating all the 3-trees on {H,E,X:XH,E}
• (Or: calculate just one 3-tree, for a “trusted” 3rd taxon X :
• V. Ranwez, O. Gascuel, Improvement of distance-based phylogenetic methods by a local maximum likelihood approach using triplets, Mol.Biol. Evol. 19(11) 1952–1963. (2002)

T

BB BE BG….. LL LM MM

4

B E G H L M

2

1

5

7

3

B E G H L M

4

3

1

2

B

E

M

L

G

H

D

(Implicit) use of Triplet-Distances in 2-Distance Reconstruction Algorithms

τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

i

r

j

1st use :“Triplet Distances from a Single Source”:
• Fix a taxon r, and construct a tree T which minimizes:
• Optimal solution is doable in O(n2) time, and is used eg in :
• (FKW95): Optimal approximation of distances by ultrametric trees.
• (ABFPT99): The best known approximation of distances by general trees
• (BB99): Fast construction of Buneman trees.
2nd use:Saitou&Nei Neighbour Joining

The neighbors-selection criterion of NJ selects a taxon-pair i,j which maximizes the sum :

r

r

i

r

r

r

r

j

r

r

Previous Works on Triplet-Dissimilarities/Distances
• I. Gronau, S. MoranNeighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances, Journal of Computational Biology 14(1) pp. 1-15 (2007).
• Works which use the total weights of 3 trees:
• S. Joly, GL Calve, Three Way Distances, Journal of Classification 12 pp. 191-205 (1995)
• L. Pachter, D. Speyer Reconstructing Trees from Subtrees Weights , Applied Mathematics Letters 17 pp. 615-621 (2004)
• D. Levy, R. Yoshida, L. Pachter, Beyond pairwise distances: Neighbor-joining with phylogenetic diversity estimates, Mol. Biol. Evol. 23(3) 491–498 (2006) .
Summary of Results
• Results for Maximal Difference (l∞):
• Decision problem is NP-Hard
•  IS there a tree T s.t. ||τ,τT ||∞ ≤ Δ ?
• Hardness-of-approximation of optimization problem
•  Finding a tree T s.t. ||τ,τT ||∞ ≤ 1.4||τ,τOPT||∞
• A 15-approximation algorithm
•  Using the 6-approximation algorithm for 2-dissimilarities from [ABFPT99]
• Result forMaximal Distortion:
• Hardness-of-approximation within any constant factor

literals

clause

Satisfying assignment:

NP Hardness of the Decision Problem

We use a reduction from 3SAT

(the problem of determining whether a 3CNF formula is satisfiable)

We show:

If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τT ||∞ ≤ Δ,then one can determine for every 3CNF formula φ whether it is satisfiable.

The Reduction

Given a 3CNF formula φ we define triplet distances  and an error bound Δ which enforce the output tree to imply a satisfying assignment to φ.

• The set of taxa:
• Taxa T , F.
• A taxon for every literal ( ).
• 3 taxa for every clause Cj ( y j1, y j2, y j3).

v

w

Properties Enforced by the Input (,Δ)
• One the following can be enforced on each taxa triplet (u,v,w):
• taxon u is closeto Path(v,w), or
• taxon u is farto Path(v,w)

u

T

F

Enforcing Truth Assignmaent
• A truth assignment to φis implied by the following:
• TisfarfromF
• For each i, isfar from , and both of and areclose toPath(T ,F)

Thus we set xi =T iff xi is close to T.

l1

F

l2

l3

Enforcing Clauses-Satisfaction

A clause C=( l1 l2 l3 )is satisfied iff

At least one literal liis true, i.e. is close toT.

(l1 l2 l3 ) is satisfiediff it is not like this

We need to guarantee that all clauses avoid the above by the close/far relations.

But we don’t know which two paths

Clauses-Satisfaction (cont)

-(l1 l2 l3 )is satisfied iff out of the three paths:

Path(l1 , l2),Path(l1 , l3),Path(l2 , l3),

at least two paths areclose toT .

l3

T

F

l1

l2

y1

y2

y3

l3

T

F

l1

l2

Clauses-Satisfaction (cont)

We attach a taxon to each such path:

y1is close toPath( l2,l3)

y2is close toPath( l1,l3)

y3is close toPath( l1,l2)

(l1 l2 l3 )is satisfied iff at least twoyi’s can be locatedclose toT.…

y1

y2

y3

l3

T

F

l1

l2

Clauses-Satisfaction (end)

… and, at least two of theyi’scan be located close toT

Path( y2,y3), Path( y1,y3), Path( y1,y2), are close to T

So, (l1 l2 l3 )is satisfied iff all the above paths are close toT

y22

y13

y12

y21

y11

y23

α

α

T

F

α

α

vT

vF

α

α

Construction Example

φ is satisfiable  there is a tree T which satisfies all bounds

A1τT (T , F ) ≥ 2α+2β

A2i=1..n :τT (T ; ) ≤α ; τT (F ; ) ≤α

B1j=1..m :τT (y j1; l j2 l j3 ) ≤α ; τT (y j2; l j1 l j3 ) ≤α ; τT (y j3; l j1 l j2 ) ≤α

B2j=1..m :τT (y j1; T F ) ≥α ; τT (y j2; T F ) ≥α ; τT (y j3; T F ) ≥α

B3j=1..m :τT (T ; y j2 y j3 ) ≤α ; τT (T ; y j1 y j3 ) ≤α ; τT (T ; y j1 y j2 ) ≤α

Hardness of Approximation Results

By “stretching” the close/far restrictions, the following problems are also shown NP hard:

• Approximating Maximal Difference
• Finding a tree T s.t. ||τ,τT ||∞ ≤ 1.4||τ,τOPT||∞
• ApproximatingMaximal Distortion:
• Finding a tree T s.t.
• MaxDist(τ,τT )≤ CMaxDist(τ,τOPT) for any constantC

Details in:

I. Gronau and S. moran, On The Hardness of Inferring Phylogenies from Triplet-Dissimilarities, Theoretical Computer Science 389(1-2), December 2007, pp. 44-55.

Open Problems/Further Research
• Extending hardness results for 3-diss tables induced by 2-diss matrices
• (τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)] )
• Extending hardness results for “naturally looking” trees
• (binary trees with constant-bounded edge weights)
• Check Performance of NJ when neighbor selection formula computed from “real” 3-distances.
• Devise algorithms which use 3-distances as input.
• Does optimization of 3-diss lead to good topological accuracy (under accepted models of sequence evolution)
• (it is known that optimization of 2-diss doesn’t lead to good topological accuracy)

1

5

2

4

6

10

1

2

7

• Compute distances between all taxon-pairs
• Find a tree(edge-weighted) best-describing the distances

### Distance-Based Phylogenetic Reconstruction

y22

y13

y12

y21

y11

y23

α

α

α

α

T

F

vT

vF

α

α

The Reduction – τ(φ)

A1τT (T , F ) ≥ 2α+2β

A2i=1..n :τT (T ; ) ≤α ; τT (F ; ) ≤α

B1j=1..m :τT (y j1; l j2 l j3 ) ≤α ; τT (y j2; l j1 l j3 ) ≤α ; τT (y j3; l j1 l j2 ) ≤α

B2j=1..m :τT (y j1; T F ) ≥α ; τT (y j2; T F ) ≥α ; τT (y j3; T F ) ≥α

B3j=1..m :τT (T ; y j2 y j3 ) ≤α ; τT (T ; y j1 y j3 ) ≤α ; τT (T ; y j1 y j2 ) ≤α

• In our constructed tree:
• All 2-distances are in[2α , 2α+2β].
• All 3-distances are in[α , α+2β].
•  Δ=β.

A1τ(T , F ) = 2α+3β

A2i=1..n :τ(T ; ) = α-β ; τ(F ; ) = α-β

B1j=1..m :τ(y j1; l j2 l j3 ) = α-β ; τ(y j2; l j1 l j3 ) = α-β ; τ(y j3; l j1 l j2 ) = α-β

B2j=1..m :τ(y j1; T F ) = α+β ; τ(y j2; T F ) = α+β ; τ(y j3; T F ) = α+β

B3j=1..m :τ(T ; y j2 y j3 ) = α-β ; τ(T ; y j1 y j3 ) = α-β ; τ(T ; y j1 y j2 ) = α-β

Other2-distances: τ(s , t) = 2α+2β

Other3-distances: τ(s ; t u) = α+2β