Fitting Tree Metrics: Hierarchical Clustering and Phylogeny

Fitting Tree Metrics:Hierarchical Clustering and Phylogeny Nir Ailon Moses Charikar Princeton University

Data with dissimilarity information u • Represented by matrix D • Complete information 10 D(u,v)=1 y 7 v 6 5 3 2 13 8 5 x w (big number = high dissimilarity)

Goal: Fit data to tree structure • Preserve dissimilarity info T • Tree metric dT close to D v dT(u,v) w y x u

Objective function Minimize: cost(T) = || D – dT||p n ( )-dimensional real vectors 2

Applications • Evolutionary biology • Molecular phylogeny:Dissimilarity information from DNA • Gene expression analysis • Historical linguistics • ...

Special case: Ultrametrics (Hierarchical clustering) T , ` y u v M=3 x w y u v x w dT(v,x)=1 dT(u,w)=3 Equivalently: Two largest distances in every  equal

Previous results • Fitting ultrametrics under ||.|| in P[FKW95] • Fitting trees under ||.|| APX-Hard[ABFPT99] • Fitting ultrametrics under ||.||1 APX-Hard[W93] under ||.||2 NP-Hard • f(n)-approximation algorithm for ultrametrics(3f(n))-approximation algorithm for trees(under any ||.||p) [ABFPT99]

Previous results • O(min{n1/p, (k logn)1/p})-approx for trees under ||.||p[HKM05] • Fitting ultrametrics for M=2under||.||1 : Correlation Clustering[BBC02, CGW03, ACN05..] • . . .

Our results • (M+1)– approx for fitting level M ultrametrics under ||.||1 • O)(log n loglog n)1/p)- approx for general weighted trees under||.||p

Reconstructing T from ultrametric D • Given ultrametricD  {1..M}n x n • Pick pivot vertex u • Recursively solve for neighbor-classes M=3 M=2 2 1 u 3

Minimizing ||.||1 for inconsistent D  {1..M}n x n • Same algorithm! • Pick pivot vertex u(uniformly@random) • Freeze distances incident to u • Fix inter-class distances 2 2 X 3 3 X • Fix intra-class distances 3 2 1 X 1 • (Total cost contribution: 4) u 3 • Recurse... • Lemma: no cancellations • Theorem: M+1 approximation

Proof idea w • violating if:1 > 2¸3 • Optimal solution pays¸1-2 • Algorithm chargingscheme: 2 ) 1 1 ) 2 v u ) 2) 1 3 2-3+ 1-2 w 1-2 u v chosen as pivot ) charged

T LM ... ... ... L2 L1 y u v x w General ultrametrics • D2 R+n £ n • Fit D to weighted ultrametric M possible distances: 1 = L1 2 = L1+L2 : M = L1+ . . . + Lm Ex: dt(v,w)=L1+L2

T LM xMuy = 0 x2uy = 0 x1uy = 1 ... ... ... L2 L1 y u v x w Fitting D to M-level weightedUltrametric under || .||1 Linear [0,1] relaxation • Integer program formulation: xtuv  {0,1} • xtuv = 1 u,v separated at level t • 0  xMuv  xM-1uv  ...  x1uv=1 • - inequality at each levelxtuv  xtuw + xtwv • Cost:min t=1M Lt ( xtuv +  (1-xtuv) ) D(u,v)  t D(u,v) > t

Rounding the LP:An O(logn loglogn)-approximation • A divisive (top-down) algorithm • At each level t=M, M-1,..., 1: • Solve a multi-cut-like problem • Cluster so as to separate u,v ’s s.t. xtuv¸ 2/3 • Danger: High levels influence low ones!

General ||.||p cost • Similar analysisgives same bound for ||.||pp • Therefore: O( logn loglogn )1/p– approximation • By [ABFPT99], applies also to fitting trees

Future work • O( log n) – algorithm? Better? • Stronger lower bounds • Derandomize (M+1)-approx algorithm • Aggregation [ACN05] • Applications Thank You !!!

Fitting Tree Metrics: Hierarchical Clustering and Phylogeny

Fitting Tree Metrics: Hierarchical Clustering and Phylogeny

Presentation Transcript

Graphical Models

Part III Hierarchical Bayesian Models

10/29/15

Guest vs. Host Clustering: What ? Why? When?

Switched LAN Architecture

Capsicum Phylogeny and Domestication

Grupo de Genómica Evolutiva

Trees

Managing Performance Metrics with PMM

EUPHORBIACEAE

Traversing a Binary Tree Binary Search Tree Insertion Deleting from a Binary Search Tree

Software Metrics

New Metrics for New Media

Sequence comparison and Phylogeny

Trees

Molecular Evolution

Taxonomy – the 5 Kingdom System

Switched LAN Architecture

Molecular Evolution

Molecular Evolution

Marketing Analytics for Startups - 6 Growth Metrics that Matter