local sparsification for scalable module identification in networks
Download
Skip this Video
Download Presentation
Local Sparsification for Scalable Module Identification in Networks

Loading in 2 Seconds...

play fullscreen
1 / 57

Local Sparsification for Scalable Module Identification in Networks - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

Local Sparsification for Scalable Module Identification in Networks. Srinivasan Parthasarathy Joint work with V. Satuluri , Y. Ruan , D. Fuhry , Y. Zhang. Data Mining Research Laboratory Dept. of Computer Science and Engineering The Ohio State University.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Local Sparsification for Scalable Module Identification in Networks' - josef


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
local sparsification for scalable module identification in networks
Local Sparsification for Scalable Module Identification in Networks
  • Srinivasan Parthasarathy
  • Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Data Mining Research Laboratory

Dept. of Computer Science and Engineering

The Ohio State University

every 2 days we create as much information as we did up to 2003 eric schmidt google ex ceo

“Every 2 days we create as much information as we did up to 2003”- Eric Schmidt, Google ex-CEO

The Data Deluge

600 to buy a disk drive that can store all of the world s music

Data Storage Costs are Low

600$ to buy a disk drive that can store all of the world’s music

[McKinsey Global Institute Special Report, June ’11]

slide6

Social networks

Protein Interactions

Internet

Neighborhood graphs

Data dependencies

VLSI networks

slide8

Challenges

1. Large Scale

Billion-edge graphs commonplace

Scalable solutions are needed

slide9

Challenges

2. Noise

Links on the web, protein interactions

Need to alleviate

slide10

Challenges

3. Novel structure

Hub nodes, small world phenomena, clusters of varying densities and sizes, directionality

Novel algorithms or techniques are needed

slide11

Challenges

4. Domain Specific Needs

E.g. Balance, Constraints etc.

Need mechanisms to specify

slide12

Challenges

5. Network Dynamics

How do communities evolve? Which actors have influence? How do clusters change as a function of external factors?

slide13

Challenges

6. Cognitive Overload

Need to support guided interaction for human in the loop

slide14

Our Vision and Approach

Application Domains

Bioinformatics (ISMB’07, ISMB’09, ISMB’12, ACM BCB’11, BMC’12)

Social Network and Social Media Analysis (TKDD’09, WWW’11, WebSci’12, WebSci’12)

Graph Pre-processing

Sparsification

SIGMOD ’11, WebSci’12

Near Neighbor Search

For non-graph data

PVLDB ’12

Symmetrization

For directed graphs

EDBT ’10

Core Clustering

Consensus Clustering

KDD’06, ISMB’07

Viewpoint Neighborhood Analysis

KDD ’09

Graph Clustering via

Stochastic Flows

KDD ’09, BCB ’10

Dynamic Analysis and Visualization

Event Based Analysis KDD’07,TKDD’09

Network Visualization

KDD’08

Density Plots

SIGMOD’08, ICDE’12

Scalable Implementations and Systems Support on Modern Architectures

Multicore Systems (VLDB’07, VLDB’09), GPUs (VLDB’11), STCI Cell (ICS’08),

Clusters (ICDM’06, SC’09, PPoPP’07, ICDE’10)

slide16

Is there a simple pre-processing of the graph to reduce the edge set that can “clarify” or “simplify” its cluster structure?

slide17

Graph Clustering and Community Discovery

Given a graph, discover groups of nodes that are strongly connected to one another but weakly connected to the rest of the graph.

slide18

Graph Clustering : Applications

Social Network and Graph Compression

Direct Analytics on compressed representation

slide19

Graph Clustering : Applications

Optimize VLSI layout

slide20

Graph Clustering : Applications

Protein function prediction

slide21

Graph Clustering : Applications

Data distribution to minimize communication and balance load

slide22

Is there a simple pre-processing of the graph to reduce the edge set that can “clarify” or “simplify” its cluster structure?

slide23

Preview

Original

Sparsified

[Automatically visualized using Prefuse]

slide24

The promise

Clustering algorithms can run much faster and be more accurate on a sparsified graph.

Ditto for Network Visualization

slide25

Utopian Objective

Retain edges which are likely to be intra-cluster edges, while discarding likely inter-cluster edges.

slide27

Algorithm: Global Sparsification (G-Spar)

  • Parameter: Sparsification ratio, s
  • 1. For each edge <i,j>:
      • (i) Calculate Sim ( <i,j> )
  • 2. Retain top s% of edges in order of Sim, discard others
slide28

Dense clusters are over-represented, sparse clusters under-represented

Works great when the goal is to just find the top communities

slide29

Algorithm: Local Sparsification (L-Spar)

  • Parameter: Sparsification exponent, e (0 < e < 1)
  • 1. For each node i of degree di:
      • (i) For each neighbor j:
          • (a) Calculate Sim ( <i,j> )
      • (ii) Retain top (d i)eneighbors in order of Sim, for node i

Underscoring the importance of Local Ranking

slide31

But...

Similarity computation is expensive!

slide32

A randomized, approximate solution based on

Minwise Hashing

[Broder et. al., 1998]

slide33

mh1(A) = min ( { mouse, lion } ) = mouse

Minwise Hashing

{ dog, cat, lion, tiger, mouse}

Universe

[ cat, mouse, lion, dog, tiger]

[ lion, cat, mouse, dog, tiger]

A = { mouse, lion }

mh2(A) = min ( { mouse, lion } ) = lion

slide34

Key Fact

For two sets A, B, and a min-hash function mhi():

Unbiased estimator for Sim using k hashes:

slide35

Time complexity using Minwise Hashing

Hashes

Edges

Only 2 sequential passes over input.

Great for disk-resident data

Note: exact similarity is less important – we really just care about relative ranking  lower k

theoretical analysis of l spar main results
Theoretical Analysis of L-Spar: Main Results
  • Q: Why choose top de edges for a node of degree d?
    • A: Conservatively sparsify low-degree nodes, aggressively sparsify hub nodes. Easy to control degree of sparsification.
  • Proposition:If input graph has power-law degree distn. with exponent , then sparsified graph also has power-law degree distn. with exponent
  • Corollary:The sparsification ratio corresponding to exponent e is no more than
  • For  = 2.1 and e = 0.5, ~17% edges will be retained.
    • Higher  (steeper power-laws) and/or lower e leads to more sparsification.
experiments
Experiments
  • Datasets
    • 3 PPI networks (BioGrid, DIP, Human)
    • 2 Information (Wiki, Flickr) & 2 Social (Orkut , Twitter) networks
    • Largest network (Orkut), roughly a Billion edges
    • Ground truth available for PPI networks and Wiki
  • Clustering algorithms
    • Metis [Karypis & Kumar ‘98], MLR-MCL [Satuluri & Parthasarathy, ‘09], Metis+MQI [Lang & Rao ‘04], Graclus [Dhillon et. al. ’07], Spectral methods [Shi ’00], Edge-based agglomerative/divisive methods [Newman ’04]
  • Compared sparsifications
    • L-Spar, G-Spar, RandomEdge and ForestFire
slide38

Results Using Metis

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

slide39

Results Using Metis

Same sparsification ratio for all 3 methods.

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

slide40

Results Using Metis

Good speedups, but typically loss in quality.

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

slide41

Results Using Metis

Great speedups and quality.

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

slide42

L-Spar: Results Using MLR-MCL

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

l spar qualitative examples
L-Spar: Qualitative Examples

Twitter executives, Silicon Valley figures

slide44

Impact of Sparsification on Noisy Data

As the graphs get noisier, L-Spar is increasingly beneficial.

slide46

Impact of Sparsification on Spectrum: Epinion

Local sparsification seems to

match trends of original graph

Global Sparsification results

in multiple components

anatomy of density plot
Anatomy of density plot

Some measure of density

Specific ordering of the vertices in the graph

density overlay plots
Density Overlay Plots

Visual Comparison between Global vs Local Sparsification

slide51
Sparsification: Simple pre-processing that makes a big difference

Only tens of seconds to execute on multi-million-node graphs.

Reduces clustering time from hours down to minutes.

Improves accuracy by removing noisy edges for several algorithms.

Helps visualization

Ongoing and future work

Spectral results suggests one might be able to provide theoretical rationale – Can we tease it out?

Investigate other kinds of graphs, incorporating content, novel applications (e.g. wireless sensor networks, VLSI design)

Summary

slide52

Prior Work

  • Random edge Sampling [Karger ‘94]
  • Sampling in proportion to effective resistances: good guarantees but very slow [Spielman and Srivastava ‘08]
  • Matrix sparsification [Arora et. al. ’06]: Fast, but same as random sampling in the absence of weights.
modularity from wikipedia
Modularity (from Wikipedia )
  • Modularity is the fraction of the edges that fall within the given groups minus the expected such fraction if edges were distributed at random. The value of the modularity lies in the range [−1/2,1). It is positive if the number of edges within groups exceeds the number expected on the basis of chance.
the mcl algorithm
The MCL algorithm

Input: A, Adjacency matrix

Initialize M to MG, the canonical transition matrix M:= MG:= (A+I) D-1

Enhances flow to well-connected nodes (i.e. nodes within a community).

Expand: M := M*M

Increases inequality in each column. “Rich get richer, poor get poorer.”

(reduces flow across communities)

Inflate: M := M.^r (r usually 2), renormalize columns

Prune

Saves memory by removing entries close to zero. Enables faster convergence

No

Converged?

Yes

Clustering Interpretation: Nodes flowing into the same sink node are assigned same cluster labels

Output clusters

ad