Fitting tree metrics hierarchical clustering and phylogeny
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

Fitting Tree Metrics: Hierarchical Clustering and Phylogeny PowerPoint PPT Presentation


  • 46 Views
  • Uploaded on
  • Presentation posted in: General

Fitting Tree Metrics: Hierarchical Clustering and Phylogeny. Nir Ailon Moses Charikar Princeton University. Data with dissimilarity information. u. Represented by matrix D Complete information. 10. D(u,v)=1. y. 7. v. 6. 5. 3. 2. 13. 8. 5. x. w.

Download Presentation

Fitting Tree Metrics: Hierarchical Clustering and Phylogeny

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Fitting tree metrics hierarchical clustering and phylogeny

Fitting Tree Metrics:Hierarchical Clustering and Phylogeny

Nir AilonMoses Charikar

Princeton University


Data with dissimilarity information

Data with dissimilarity information

u

  • Represented by matrix D

  • Complete information

10

D(u,v)=1

y

7

v

6

5

3

2

13

8

5

x

w

(big number = high dissimilarity)


Goal fit data to tree structure

Goal: Fit data to tree structure

  • Preserve dissimilarity info

T

  • Tree metric dT close to D

v

dT(u,v)

w

y

x

u


Objective function

Objective function

Minimize:

cost(T) = || D – dT||p

n

( )-dimensional real vectors

2


Applications

Applications

  • Evolutionary biology

    • Molecular phylogeny:Dissimilarity information from DNA

  • Gene expression analysis

  • Historical linguistics

  • ...


Special case ultrametrics

Special case: Ultrametrics

(Hierarchical clustering)

T

,`

y

u

v

M=3

x

w

y

u

v

x

w

dT(v,x)=1

dT(u,w)=3

Equivalently: Two largest distances in every  equal


Previous results

Previous results

  • Fitting ultrametrics under ||.|| in P[FKW95]

  • Fitting trees under ||.|| APX-Hard[ABFPT99]

  • Fitting ultrametrics under ||.||1 APX-Hard[W93] under ||.||2 NP-Hard

  • f(n)-approximation algorithm for ultrametrics(3f(n))-approximation algorithm for trees(under any ||.||p) [ABFPT99]


Previous results1

Previous results

  • O(min{n1/p, (k logn)1/p})-approx for trees under ||.||p[HKM05]

  • Fitting ultrametrics for M=2under||.||1 :

    Correlation Clustering[BBC02, CGW03, ACN05..]

  • . . .


Our results

Our results

  • (M+1)– approx for fitting level M ultrametrics under ||.||1

  • O)(log n loglog n)1/p)- approx for general weighted trees under||.||p


Reconstructing t from ultrametric d

Reconstructing T from ultrametric D

  • Given ultrametricD  {1..M}n x n

  • Pick pivot vertex u

  • Recursively solve for neighbor-classes

M=3

M=2

2

1

u

3


Minimizing 1 for inconsistent d

Minimizing ||.||1 for inconsistent D

{1..M}n x n

  • Same algorithm!

  • Pick pivot vertex u([email protected])

  • Freeze distances incident to u

  • Fix inter-class distances

2

2

X

3

3

X

  • Fix intra-class distances

3

2

1

X

1

  • (Total cost contribution: 4)

u

3

  • Recurse...

  • Lemma: no cancellations

  • Theorem: M+1 approximation


Proof idea

Proof idea

w

  • violating if:1 > 2¸3

  • Optimal solution pays¸1-2

  • Algorithm chargingscheme:

2

) 1

1

) 2

v

u

) 2) 1

3

2-3+ 1-2

w

1-2

u

v

chosen as pivot ) charged


General ultrametrics

T

LM

...

...

...

L2

L1

y

u

v

x

w

General ultrametrics

  • D2 R+n £ n

  • Fit D to weighted ultrametric

M possible distances:

1 = L1

2 = L1+L2

:

M = L1+ . . . + Lm

Ex: dt(v,w)=L1+L2


Fitting d to m level weighted ultrametric under 1

T

LM

xMuy = 0

x2uy = 0

x1uy = 1

...

...

...

L2

L1

y

u

v

x

w

Fitting D to M-level weightedUltrametric under || .||1

Linear

[0,1]

relaxation

  • Integer program formulation: xtuv  {0,1}

  • xtuv = 1 u,v separated at level t

  • 0  xMuv  xM-1uv  ...  x1uv=1

  • - inequality at each levelxtuv  xtuw + xtwv

  • Cost:min t=1M Lt ( xtuv +  (1-xtuv) )

D(u,v)  t

D(u,v) > t


Rounding the lp an o logn loglogn approximation

Rounding the LP:An O(logn loglogn)-approximation

  • A divisive (top-down) algorithm

  • At each level t=M, M-1,..., 1:

  • Solve a multi-cut-like problem

  • Cluster so as to separate u,v ’s s.t.

    xtuv¸ 2/3

  • Danger: High levels influence low ones!


General p cost

General ||.||p cost

  • Similar analysisgives same bound for

    ||.||pp

  • Therefore:

    O( logn loglogn )1/p– approximation

  • By [ABFPT99], applies also to fitting trees


Future work

Future work

  • O( log n) – algorithm? Better?

  • Stronger lower bounds

  • Derandomize (M+1)-approx algorithm

  • Aggregation [ACN05]

  • Applications

Thank You !!!


  • Login