1 / 17

# Fitting Tree Metrics: Hierarchical Clustering and Phylogeny - PowerPoint PPT Presentation

Fitting Tree Metrics: Hierarchical Clustering and Phylogeny. Nir Ailon Moses Charikar Princeton University. Data with dissimilarity information. u. Represented by matrix D Complete information. 10. D(u,v)=1. y. 7. v. 6. 5. 3. 2. 13. 8. 5. x. w.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Fitting Tree Metrics: Hierarchical Clustering and Phylogeny

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Fitting Tree Metrics:Hierarchical Clustering and Phylogeny

Nir AilonMoses Charikar

Princeton University

### Data with dissimilarity information

u

• Represented by matrix D

• Complete information

10

D(u,v)=1

y

7

v

6

5

3

2

13

8

5

x

w

(big number = high dissimilarity)

### Goal: Fit data to tree structure

• Preserve dissimilarity info

T

• Tree metric dT close to D

v

dT(u,v)

w

y

x

u

### Objective function

Minimize:

cost(T) = || D – dT||p

n

( )-dimensional real vectors

2

### Applications

• Evolutionary biology

• Molecular phylogeny:Dissimilarity information from DNA

• Gene expression analysis

• Historical linguistics

• ...

### Special case: Ultrametrics

(Hierarchical clustering)

T

,`

y

u

v

M=3

x

w

y

u

v

x

w

dT(v,x)=1

dT(u,w)=3

Equivalently: Two largest distances in every  equal

### Previous results

• Fitting ultrametrics under ||.|| in P[FKW95]

• Fitting trees under ||.|| APX-Hard[ABFPT99]

• Fitting ultrametrics under ||.||1 APX-Hard[W93] under ||.||2 NP-Hard

• f(n)-approximation algorithm for ultrametrics(3f(n))-approximation algorithm for trees(under any ||.||p) [ABFPT99]

### Previous results

• O(min{n1/p, (k logn)1/p})-approx for trees under ||.||p[HKM05]

• Fitting ultrametrics for M=2under||.||1 :

Correlation Clustering[BBC02, CGW03, ACN05..]

• . . .

### Our results

• (M+1)– approx for fitting level M ultrametrics under ||.||1

• O)(log n loglog n)1/p)- approx for general weighted trees under||.||p

### Reconstructing T from ultrametric D

• Given ultrametricD  {1..M}n x n

• Pick pivot vertex u

• Recursively solve for neighbor-classes

M=3

M=2

2

1

u

3

### Minimizing ||.||1 for inconsistent D

{1..M}n x n

• Same algorithm!

• Pick pivot vertex u(uniformly@random)

• Freeze distances incident to u

• Fix inter-class distances

2

2

X

3

3

X

• Fix intra-class distances

3

2

1

X

1

• (Total cost contribution: 4)

u

3

• Recurse...

• Lemma: no cancellations

• Theorem: M+1 approximation

### Proof idea

w

• violating if:1 > 2¸3

• Optimal solution pays¸1-2

• Algorithm chargingscheme:

2

) 1

1

) 2

v

u

) 2) 1

3

2-3+ 1-2

w

1-2

u

v

chosen as pivot ) charged

T

LM

...

...

...

L2

L1

y

u

v

x

w

### General ultrametrics

• D2 R+n £ n

• Fit D to weighted ultrametric

M possible distances:

1 = L1

2 = L1+L2

:

M = L1+ . . . + Lm

Ex: dt(v,w)=L1+L2

T

LM

xMuy = 0

x2uy = 0

x1uy = 1

...

...

...

L2

L1

y

u

v

x

w

### Fitting D to M-level weightedUltrametric under || .||1

Linear

[0,1]

relaxation

• Integer program formulation: xtuv  {0,1}

• xtuv = 1 u,v separated at level t

• 0  xMuv  xM-1uv  ...  x1uv=1

• - inequality at each levelxtuv  xtuw + xtwv

• Cost:min t=1M Lt ( xtuv +  (1-xtuv) )

D(u,v)  t

D(u,v) > t

### Rounding the LP:An O(logn loglogn)-approximation

• A divisive (top-down) algorithm

• At each level t=M, M-1,..., 1:

• Solve a multi-cut-like problem

• Cluster so as to separate u,v ’s s.t.

xtuv¸ 2/3

• Danger: High levels influence low ones!

### General ||.||p cost

• Similar analysisgives same bound for

||.||pp

• Therefore:

O( logn loglogn )1/p– approximation

• By [ABFPT99], applies also to fitting trees

### Future work

• O( log n) – algorithm? Better?

• Stronger lower bounds

• Derandomize (M+1)-approx algorithm

• Aggregation [ACN05]

• Applications

Thank You !!!