1 / 49

Cluster Ranking with an Application to Mining Mailbox Networks

Cluster Ranking with an Application to Mining Mailbox Networks. Ziv Bar-Yossef Technion, Google Ido Guy Technion, IBM Ronny Lempel IBM Yoelle Maarek Google Vova Soroka IBM. Clustering. A network : undirected graph with non-negative edge weights w(u,v): “Similarity” between u and v.

mahon
Download Presentation

Cluster Ranking with an Application to Mining Mailbox Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cluster Ranking with an Application to Mining Mailbox Networks Ziv Bar-Yossef Technion, Google Ido Guy Technion, IBM Ronny Lempel IBM Yoelle Maarek Google Vova Soroka IBM

  2. Clustering • A network: undirected graph with non-negative edge weights • w(u,v): “Similarity” between u and v. • Do not necessarily correspond to a proper metric • Induced distance may not respect the triangle’s inequality • Examples: • Social networks. w(u,v) = strength of relationship between u and v. • Biological networks. w(u,v) = genetic similarity between species u and v. • Document networks. w(u,v) = topical similarity between u and v. • Image networks. w(u,v) = color similarity/proximity between u and v. • Clustering: partitioning of the network into regions of similarity • Communities in social networks • Species families in biological networks • Groups of documents on the same topic. • Segments of an image.

  3. The cluster abundance problem • Problem: Sometimes clustering algorithm produces masses of clusters. • Large networks • Fuzzy/soft clustering • Needle in a haystack problem – which are the important clusters?

  4. Cluster ranking • Goals: • Define a cluster strength measure • Assigns a strength score to each subset of nodes • Design cluster ranking algorithm • Outputs the clusters in the network, ordered by their strength

  5. A simple example • strength(C) = |C|, if C is a clique. • strength(C) = 0, if C is not a clique. • Cluster ranking: • {a,b,c}, {d,e,f} • {c,g}, {g,f} e b d c f a g

  6. Our contributions • Cluster ranking framework • New cluster strength measure • Properly captures similarity among cluster members • Applicable to both weighted and unweighted networks • Arbitrary similarity weights • Efficiently computable • Cluster ranking algorithm • Application to mining communities in “personal mailbox networks”

  7. Cluster strength measure:Unweighted networks G1 G2 • Which is a stronger cluster? • Cohesion = measure of strength for unweighted clusters • Cohesive cluster = does not “easily” break into pieces

  8. Edge separators • Edge separator: A subset of the network’s edges whose removal breaks the network into two or more connected components. • All previous work: cohesion(C) = “density” of “sparsest” edge separator • Different notions of density for edge separators: • Conductance [KannanVempalaVetta00] • Normalized cut [ShiMalik00] • Relative neighborhoods [FlakeLawrenceGiles00] • Edge betweenness [GirvanNewman02] • Modularity [GirvanNewman04]

  9. v u Clique of size m Clique of size m Clique of size m Clique of size m v Edge separators are not good enough • True: sparse edge separator noncohesive cluster • False: no sparse edge separator cohesive cluster

  10. Vertex separators S • Vertex separator: A subset of the network’s vertices whose removal breaks the network into two or more connected components. A B • Our strength measure: cohesion(C) = “density” of “sparsest” vertex separator • Separator is “sparse”, if • S is small • A,B are “balanced”

  11. v u Clique of size m Clique of size m Clique of size m Clique of size m v Vertex separators are better • Sparse edge separator sparse vertex separator noncohesive cluster • Sparse vertex separator noncohesive cluster

  12. 1 1 10 10 1 10 Cluster strength measure:Weighted networks • Which is a stronger cluster? • Cohesion is no longer the sole factor determining cluster strength G1 G2

  13. Thresholding • Traditional approach for dealing with weighted networks • Transforms the weighted network into an unweighted network by a threshold • Threshold T<1 • Threshold 1 ≤ T < 5 • No threshold is suitable GT G 1 G1 GT G2 5

  14. Cohesion(GT) Cohesion(GT) G1 G2 1 0.7 T T Integrated cohesion • Which is a stronger cluster? • Small T G1 is stronger • Large T G2is stronger • Integrated cohesion: area under the curve • Strong cluster: sustains high cohesion while increasing threshold

  15. C-Rank - Cluster Ranking Algorithm Candidate identification Ranking by strength score Elimination of non-maximal clusters

  16. Candidate identification: Unweighted networks • Given an unweighted network G • Find a sparse vertex separator S of G • Network splits into disconnected components A1,…,Ak • Clusters = SUA1,…,SUAk • Recurse on SUA1,…,SUAk A2 A1 A3 S A5 A4

  17. e Candidate identification - Example A1 • Sparse separator: S = {c,d} • Connected components: A1 = {a,b}, A2 = {e} • Add back {c,d} to A1and A2 A2 a c b d

  18. e Candidate identification - Example S U A1 • Sparse separator: S = {c,d} • Connected components: A1 = {a,b}, A2 = {e} • Add back {c,d} to A1and A2 • Since both components are cliques, no recursive calls are made a c b d S U A2 c d

  19. Mailbox networks • Nodes: contacts appearing in headers of messages in a person’s mailbox • Excluding mailbox owner • Edges: connect contacts who co-occur at the samemassage header • Edge weights: frequency of co-occurrence • This is an egocentric social network • Reflects the subjective perspective of the mailbox owner

  20. Mining mailbox networks Our Goal Given: A mailbox network G Output: A ranking of communities in G • Motivation • Advanced email client features • Automatic group completion and correction • Automatic group classification (colleagues, friends, spouse, etc.) • Identification of “spam groups” and management of blocked lists • Intelligence & law enforcement • Mine mailboxes of suspected terrorists and criminals

  21. Ziv Bar-Yossef’s top 10 communities

  22. Experiments • Enron Email Dataset (http://www.cs.cmu.edu/~enron/) • Made publicly available during the investigation of Enron fraud • ~150 mailboxes of Enron employees • More than 500,000 messages • Compared with another clustering algorithm • EB-Rank - Adaptation the popular edge betweenness algorithm [GirvanNewman02] to our framework

  23. Relative recall

  24. Score comparison

  25. Conclusions • The cluster ranking problem as a novel framework for clustering • Integrated cohesion as a strengthmeasure for overlapping clusters in weighted networks • C-Rank: A new cluster ranking algorithm • Application: mining mailbox networks

  26. Thank You

  27. Cohesion(GT) Cohesion(GT) G1 G2 1 0.7 T T Integrated cohesion • Which is a stronger cluster? • Note: to compute integral, need only GT for T’s that equal the distinct edge weights

  28. Integrated cohesion - Example G Cohesion(GT) 1 7 10 7 3 15 15 3 3 T 3 5 5 15 Cohesion = 1

  29. Integrated cohesion - Example Cohesion(GT) 1 7 10 0.667 7 3 2.333 15 15 T 3 7 5 5 15 Cohesion = 0.667

  30. Integrated cohesion - Example Cohesion(GT) 1 10 0.667 3 2.333 0.333 1 15 15 T 3 7 10 15 int_cohesion(G) = 3 + 2.333 + 1 = 6.333 Cohesion = 0.333

  31. Cluster subsumption and maximality • C is maximal iff partitioning any super-set of C into clusters leaves C in tact. • S = sparsest separator of C • (C1, C2) : induced cover of C • S = sparsest separator of D • (D1,D2) : Induced cover of D • C1 D1, C2 D2 • D subsumes C  C is not maximal D D2 D1 C1 S C2 C

  32. e Candidate identification: Weighted networks G • Apply a threshold T=0 on G 5 a c 2 2 2 5 5 2 b d 2

  33. e Candidate identification: Weighted networks G0 • Unweighted candidate identification a c b d

  34. e Candidate identification: Weighted networks • Recurse on ‘abcd’ and ‘cde’ separately a c b d c d

  35. Candidate identification: Weighted networks • Apply threshold T=2 on ‘abcd’ 5 a c 2 2 5 5 b d 2

  36. Candidate identification: Weighted networks • Apply threshold T=2 on ‘abcd’ • Recurse on ‘abc’ • No recursive call on singleton ‘d’ a c b d

  37. Candidate identification: Weighted networks • Apply threshold T=5on ‘abc’ 5 a c 5 5 b

  38. Candidate identification: Weighted networks • Apply threshold T=5on ‘abc’ • No recursive call on singletons ‘a’ ,‘b’ ,‘c’ a c b

  39. e Candidate identification: Weighted networks • Final candidate list: • ‘abcde’ • ‘abcd’ • ‘abc’ • ‘cde’ 5 a c 2 3 2 5 5 2 b d 2

  40. Computing sparse vertex separators • Complexity of Sparsest Vertex Separator • NP-hard • Can be approximated in polynomial time via Semi-Definite Programming [FeigeHajiaghayiLee05] • SDP might be inefficient in practice • We find sparse vertex separators via Vertex Betweenness [Freeman77] • Efficiently computable via dynamic programming • Works well empirically • In worst-case, approximation can be weak

  41. Computing sparse vertex separators • Complexity of Sparsest Vertex Separator • NP-hard • Can be approximated in polynomial time via Semi-Definite Programming [FeigeHajiaghayiLee05] • SDP might be inefficient in practice • We find sparse vertex separators via Vertex Betweenness [Freeman77] • Efficiently computable via dynamic programming • Works well empirically • In worst-case, approximation can be weak

  42. Clique of size m Clique of size m v Normalized Vertex Betweenness (NVB) [Freeman77] • Vertex Betweenness (VB) of a node v: Number of shortest paths passing through v • Ex: ~m2for v, 0 for the other vertices • NormalizedVertex Betweenness (NVB): divide by to get values in [0,1] • NVB(G): Maximum NVB value over all nodes • Theorem: cohesion(G) ≥ 1/(1 + |G| · NVB(G)) • In practice: cohesion(G) ≈ 1/(1 + |G| · NVB(G))

  43. Candidate identification: Weighted networks • Ideal algorithm: • Iterate over all possible thresholds T • Output all clusters in GT • Somewhat inefficient • Actual algorithm: • Apply threshold T = min weight in G • Output clusters of GT • For each clique C in GT Recurse on C

  44. C-Rank: Analysis • Theorem: C-Rank is guaranteed to output all the maximal clusters. • Lemma: C-Rank runs in time polynomial in its output length.

  45. 1 1 1 a c 1 1 1 1 b d 1 1 1 Mailbox networks • An egocentric social network • Reflects the subjective perspective of the mailbox owner • Nodes: contacts appearing in message headers • Excluding mailbox owner • Edges: connect contacts who co-occur at the samemessage header • Edge weights: frequency of co-occurrence • a b, c, d, and owner • c d, e, and owner

  46. Mailbox networks • An egocentric social network • Reflects the subjective perspective of the mailbox owner • Nodes: contacts appearing in message headers • Excluding mailbox owner • Edges: connect contacts who co-occur at the samemassage header • Edge weights: frequency of co-occurrence • a b, c, d, and owner • c d, e, and owner • b owner 1 2 1 a c 1 1 1 2 1 e 1 1 b d 1 1 2

  47. Mailbox networks • An egocentric social network • Reflects the subjective perspective of the mailbox owner • Nodes: contacts appearing in message headers • Excluding mailbox owner • Edges: connect contacts who co-occur at the samemassage header • Edge weights: frequency of co-occurrence • a b, c, d, and owner • c d, e, and owner • b owner 1 2 1 a c 1 1 1 2 1 e 1 1 b d 1 2 2

  48. Ido Guy’s top 10 communities

  49. Estimated precision

More Related