1 / 57

Local Sparsification for Scalable Module Identification in Networks

Local Sparsification for Scalable Module Identification in Networks. Srinivasan Parthasarathy Joint work with V. Satuluri , Y. Ruan , D. Fuhry , Y. Zhang. Data Mining Research Laboratory Dept. of Computer Science and Engineering The Ohio State University.

josef
Download Presentation

Local Sparsification for Scalable Module Identification in Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Local Sparsification for Scalable Module Identification in Networks • Srinivasan Parthasarathy • Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang Data Mining Research Laboratory Dept. of Computer Science and Engineering The Ohio State University

  2. “Every 2 days we create as much information as we did up to 2003”- Eric Schmidt, Google ex-CEO The Data Deluge

  3. Data Storage Costs are Low 600$ to buy a disk drive that can store all of the world’s music [McKinsey Global Institute Special Report, June ’11]

  4. Data does not exist in isolation.

  5. Data almost always exists in connection with other data.

  6. Social networks Protein Interactions Internet Neighborhood graphs Data dependencies VLSI networks

  7. All this data is only useful if we can scalably extract useful knowledge

  8. Challenges 1. Large Scale Billion-edge graphs commonplace Scalable solutions are needed

  9. Challenges 2. Noise Links on the web, protein interactions Need to alleviate

  10. Challenges 3. Novel structure Hub nodes, small world phenomena, clusters of varying densities and sizes, directionality Novel algorithms or techniques are needed

  11. Challenges 4. Domain Specific Needs E.g. Balance, Constraints etc. Need mechanisms to specify

  12. Challenges 5. Network Dynamics How do communities evolve? Which actors have influence? How do clusters change as a function of external factors?

  13. Challenges 6. Cognitive Overload Need to support guided interaction for human in the loop

  14. Our Vision and Approach Application Domains Bioinformatics (ISMB’07, ISMB’09, ISMB’12, ACM BCB’11, BMC’12) Social Network and Social Media Analysis (TKDD’09, WWW’11, WebSci’12, WebSci’12) Graph Pre-processing Sparsification SIGMOD ’11, WebSci’12 Near Neighbor Search For non-graph data PVLDB ’12 Symmetrization For directed graphs EDBT ’10 Core Clustering Consensus Clustering KDD’06, ISMB’07 Viewpoint Neighborhood Analysis KDD ’09 Graph Clustering via Stochastic Flows KDD ’09, BCB ’10 Dynamic Analysis and Visualization Event Based Analysis KDD’07,TKDD’09 Network Visualization KDD’08 Density Plots SIGMOD’08, ICDE’12 Scalable Implementations and Systems Support on Modern Architectures Multicore Systems (VLDB’07, VLDB’09), GPUs (VLDB’11), STCI Cell (ICS’08), Clusters (ICDM’06, SC’09, PPoPP’07, ICDE’10)

  15. Graph Sparsification for Community Discovery SIGMOD ’11, WebSci’12

  16. Is there a simple pre-processing of the graph to reduce the edge set that can “clarify” or “simplify” its cluster structure?

  17. Graph Clustering and Community Discovery Given a graph, discover groups of nodes that are strongly connected to one another but weakly connected to the rest of the graph.

  18. Graph Clustering : Applications Social Network and Graph Compression Direct Analytics on compressed representation

  19. Graph Clustering : Applications Optimize VLSI layout

  20. Graph Clustering : Applications Protein function prediction

  21. Graph Clustering : Applications Data distribution to minimize communication and balance load

  22. Is there a simple pre-processing of the graph to reduce the edge set that can “clarify” or “simplify” its cluster structure?

  23. Preview Original Sparsified [Automatically visualized using Prefuse]

  24. The promise Clustering algorithms can run much faster and be more accurate on a sparsified graph. Ditto for Network Visualization

  25. Utopian Objective Retain edges which are likely to be intra-cluster edges, while discarding likely inter-cluster edges.

  26. A way to rank edges on “strength” or similarity.

  27. Algorithm: Global Sparsification (G-Spar) • Parameter: Sparsification ratio, s • 1. For each edge <i,j>: • (i) Calculate Sim ( <i,j> ) • 2. Retain top s% of edges in order of Sim, discard others

  28. Dense clusters are over-represented, sparse clusters under-represented Works great when the goal is to just find the top communities

  29. Algorithm: Local Sparsification (L-Spar) • Parameter: Sparsification exponent, e (0 < e < 1) • 1. For each node i of degree di: • (i) For each neighbor j: • (a) Calculate Sim ( <i,j> ) • (ii) Retain top (d i)eneighbors in order of Sim, for node i Underscoring the importance of Local Ranking

  30. Ensures representation of clusters of varying densities

  31. But... Similarity computation is expensive!

  32. A randomized, approximate solution based on Minwise Hashing [Broder et. al., 1998]

  33. mh1(A) = min ( { mouse, lion } ) = mouse Minwise Hashing { dog, cat, lion, tiger, mouse} Universe [ cat, mouse, lion, dog, tiger] [ lion, cat, mouse, dog, tiger] A = { mouse, lion } mh2(A) = min ( { mouse, lion } ) = lion

  34. Key Fact For two sets A, B, and a min-hash function mhi(): Unbiased estimator for Sim using k hashes:

  35. Time complexity using Minwise Hashing Hashes Edges Only 2 sequential passes over input. Great for disk-resident data Note: exact similarity is less important – we really just care about relative ranking  lower k

  36. Theoretical Analysis of L-Spar: Main Results • Q: Why choose top de edges for a node of degree d? • A: Conservatively sparsify low-degree nodes, aggressively sparsify hub nodes. Easy to control degree of sparsification. • Proposition:If input graph has power-law degree distn. with exponent , then sparsified graph also has power-law degree distn. with exponent • Corollary:The sparsification ratio corresponding to exponent e is no more than • For  = 2.1 and e = 0.5, ~17% edges will be retained. • Higher  (steeper power-laws) and/or lower e leads to more sparsification.

  37. Experiments • Datasets • 3 PPI networks (BioGrid, DIP, Human) • 2 Information (Wiki, Flickr) & 2 Social (Orkut , Twitter) networks • Largest network (Orkut), roughly a Billion edges • Ground truth available for PPI networks and Wiki • Clustering algorithms • Metis [Karypis & Kumar ‘98], MLR-MCL [Satuluri & Parthasarathy, ‘09], Metis+MQI [Lang & Rao ‘04], Graclus [Dhillon et. al. ’07], Spectral methods [Shi ’00], Edge-based agglomerative/divisive methods [Newman ’04] • Compared sparsifications • L-Spar, G-Spar, RandomEdge and ForestFire

  38. Results Using Metis [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

  39. Results Using Metis Same sparsification ratio for all 3 methods. [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

  40. Results Using Metis Good speedups, but typically loss in quality. [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

  41. Results Using Metis Great speedups and quality. [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

  42. L-Spar: Results Using MLR-MCL [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

  43. L-Spar: Qualitative Examples Twitter executives, Silicon Valley figures

  44. Impact of Sparsification on Noisy Data As the graphs get noisier, L-Spar is increasingly beneficial.

  45. Impact of Sparsification on Spectrum: Yeast PPI

  46. Impact of Sparsification on Spectrum: Epinion Local sparsification seems to match trends of original graph Global Sparsification results in multiple components

  47. Impact of Sparsification on Spectrum: Human PPI

  48. Impact of Sparsification on Spectrum: Flickr

  49. Anatomy of density plot Some measure of density Specific ordering of the vertices in the graph

  50. Density Overlay Plots Visual Comparison between Global vs Local Sparsification

More Related