fitting tree metrics hierarchical clustering and phylogeny
Download
Skip this Video
Download Presentation
Fitting Tree Metrics: Hierarchical Clustering and Phylogeny

Loading in 2 Seconds...

play fullscreen
1 / 17

Fitting Tree Metrics: Hierarchical Clustering and Phylogeny - PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on

Fitting Tree Metrics: Hierarchical Clustering and Phylogeny. Nir Ailon Moses Charikar Princeton University. Data with dissimilarity information. u. Represented by matrix D Complete information. 10. D(u,v)=1. y. 7. v. 6. 5. 3. 2. 13. 8. 5. x. w.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Fitting Tree Metrics: Hierarchical Clustering and Phylogeny' - amena-hanson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
fitting tree metrics hierarchical clustering and phylogeny

Fitting Tree Metrics:Hierarchical Clustering and Phylogeny

Nir Ailon Moses Charikar

Princeton University

data with dissimilarity information
Data with dissimilarity information

u

  • Represented by matrix D
  • Complete information

10

D(u,v)=1

y

7

v

6

5

3

2

13

8

5

x

w

(big number = high dissimilarity)

goal fit data to tree structure
Goal: Fit data to tree structure
  • Preserve dissimilarity info

T

  • Tree metric dT close to D

v

dT(u,v)

w

y

x

u

objective function
Objective function

Minimize:

cost(T) = || D – dT||p

n

( )-dimensional real vectors

2

applications
Applications
  • Evolutionary biology
    • Molecular phylogeny:Dissimilarity information from DNA
  • Gene expression analysis
  • Historical linguistics
  • ...
special case ultrametrics
Special case: Ultrametrics

(Hierarchical clustering)

T

, `

y

u

v

M=3

x

w

y

u

v

x

w

dT(v,x)=1

dT(u,w)=3

Equivalently: Two largest distances in every  equal

previous results
Previous results
  • Fitting ultrametrics under ||.|| in P[FKW95]
  • Fitting trees under ||.|| APX-Hard[ABFPT99]
  • Fitting ultrametrics under ||.||1 APX-Hard[W93] under ||.||2 NP-Hard
  • f(n)-approximation algorithm for ultrametrics(3f(n))-approximation algorithm for trees(under any ||.||p) [ABFPT99]
previous results1
Previous results
  • O(min{n1/p, (k logn)1/p})-approx for trees under ||.||p[HKM05]
  • Fitting ultrametrics for M=2under||.||1 :

Correlation Clustering[BBC02, CGW03, ACN05..]

  • . . .
our results
Our results
  • (M+1)– approx for fitting level M ultrametrics under ||.||1
  • O)(log n loglog n)1/p)- approx for general weighted trees under||.||p
reconstructing t from ultrametric d
Reconstructing T from ultrametric D
  • Given ultrametricD  {1..M}n x n
  • Pick pivot vertex u
  • Recursively solve for neighbor-classes

M=3

M=2

2

1

u

3

minimizing 1 for inconsistent d
Minimizing ||.||1 for inconsistent D

{1..M}n x n

  • Fix inter-class distances

2

2

X

3

3

X

  • Fix intra-class distances

3

2

1

X

1

  • (Total cost contribution: 4)

u

3

  • Recurse...
  • Lemma: no cancellations
  • Theorem: M+1 approximation
proof idea
Proof idea

w

  • violating if:1 > 2¸3
  • Optimal solution pays¸1-2
  • Algorithm chargingscheme:

2

) 1

1

) 2

v

u

) 2) 1

3

2-3+ 1-2

w

1-2

u

v

chosen as pivot ) charged

general ultrametrics
T

LM

...

...

...

L2

L1

y

u

v

x

w

General ultrametrics
  • D2 R+n £ n
  • Fit D to weighted ultrametric

M possible distances:

1 = L1

2 = L1+L2

:

M = L1+ . . . + Lm

Ex: dt(v,w)=L1+L2

fitting d to m level weighted ultrametric under 1
T

LM

xMuy = 0

x2uy = 0

x1uy = 1

...

...

...

L2

L1

y

u

v

x

w

Fitting D to M-level weightedUltrametric under || .||1

Linear

[0,1]

relaxation

  • Integer program formulation: xtuv  {0,1}
  • xtuv = 1 u,v separated at level t
  • 0  xMuv  xM-1uv  ...  x1uv=1
  • - inequality at each levelxtuv  xtuw + xtwv
  • Cost:min t=1M Lt ( xtuv +  (1-xtuv) )

D(u,v)  t

D(u,v) > t

rounding the lp an o logn loglogn approximation
Rounding the LP:An O(logn loglogn)-approximation
  • A divisive (top-down) algorithm
  • At each level t=M, M-1,..., 1:
  • Solve a multi-cut-like problem
  • Cluster so as to separate u,v ’s s.t.

xtuv¸ 2/3

  • Danger: High levels influence low ones!
general p cost
General ||.||p cost
  • Similar analysisgives same bound for

||.||pp

  • Therefore:

O( logn loglogn )1/p– approximation

  • By [ABFPT99], applies also to fitting trees
future work
Future work
  • O( log n) – algorithm? Better?
  • Stronger lower bounds
  • Derandomize (M+1)-approx algorithm
  • Aggregation [ACN05]
  • Applications

Thank You !!!

ad