Loading in 2 Seconds...

Local Sparsification for Scalable Module Identification in Networks

Loading in 2 Seconds...

- By
**josef** - Follow User

- 129 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Local Sparsification for Scalable Module Identification in Networks' - josef

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### “Every 2 days we create as much information as we did up to 2003”- Eric Schmidt, Google ex-CEO

Local Sparsification for Scalable Module Identification in Networks

- Srinivasan Parthasarathy
- Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Data Mining Research Laboratory

Dept. of Computer Science and Engineering

The Ohio State University

The Data Deluge

600$ to buy a disk drive that can store all of the world’s music

[McKinsey Global Institute Special Report, June ’11]

All this data is only useful if we can scalably extract useful knowledge

3. Novel structure

Hub nodes, small world phenomena, clusters of varying densities and sizes, directionality

Novel algorithms or techniques are needed

5. Network Dynamics

How do communities evolve? Which actors have influence? How do clusters change as a function of external factors?

Application Domains

Bioinformatics (ISMB’07, ISMB’09, ISMB’12, ACM BCB’11, BMC’12)

Social Network and Social Media Analysis (TKDD’09, WWW’11, WebSci’12, WebSci’12)

Graph Pre-processing

Sparsification

SIGMOD ’11, WebSci’12

Near Neighbor Search

For non-graph data

PVLDB ’12

Symmetrization

For directed graphs

EDBT ’10

Core Clustering

Consensus Clustering

KDD’06, ISMB’07

Viewpoint Neighborhood Analysis

KDD ’09

Graph Clustering via

Stochastic Flows

KDD ’09, BCB ’10

Dynamic Analysis and Visualization

Event Based Analysis KDD’07,TKDD’09

Network Visualization

KDD’08

Density Plots

SIGMOD’08, ICDE’12

Scalable Implementations and Systems Support on Modern Architectures

Multicore Systems (VLDB’07, VLDB’09), GPUs (VLDB’11), STCI Cell (ICS’08),

Clusters (ICDM’06, SC’09, PPoPP’07, ICDE’10)

Graph Sparsification for Community Discovery

SIGMOD ’11, WebSci’12

Is there a simple pre-processing of the graph to reduce the edge set that can “clarify” or “simplify” its cluster structure?

Graph Clustering and Community Discovery

Given a graph, discover groups of nodes that are strongly connected to one another but weakly connected to the rest of the graph.

Graph Clustering : Applications

Social Network and Graph Compression

Direct Analytics on compressed representation

Graph Clustering : Applications

Optimize VLSI layout

Graph Clustering : Applications

Protein function prediction

Graph Clustering : Applications

Data distribution to minimize communication and balance load

Is there a simple pre-processing of the graph to reduce the edge set that can “clarify” or “simplify” its cluster structure?

Clustering algorithms can run much faster and be more accurate on a sparsified graph.

Ditto for Network Visualization

Retain edges which are likely to be intra-cluster edges, while discarding likely inter-cluster edges.

Algorithm: Global Sparsification (G-Spar)

- Parameter: Sparsification ratio, s
- 1. For each edge <i,j>:
- (i) Calculate Sim ( <i,j> )
- 2. Retain top s% of edges in order of Sim, discard others

Dense clusters are over-represented, sparse clusters under-represented

Works great when the goal is to just find the top communities

Algorithm: Local Sparsification (L-Spar)

- Parameter: Sparsification exponent, e (0 < e < 1)
- 1. For each node i of degree di:
- (i) For each neighbor j:
- (a) Calculate Sim ( <i,j> )
- (ii) Retain top (d i)eneighbors in order of Sim, for node i

Underscoring the importance of Local Ranking

Similarity computation is expensive!

mh1(A) = min ( { mouse, lion } ) = mouse

Minwise Hashing

{ dog, cat, lion, tiger, mouse}

Universe

[ cat, mouse, lion, dog, tiger]

[ lion, cat, mouse, dog, tiger]

A = { mouse, lion }

mh2(A) = min ( { mouse, lion } ) = lion

Time complexity using Minwise Hashing

Hashes

Edges

Only 2 sequential passes over input.

Great for disk-resident data

Note: exact similarity is less important – we really just care about relative ranking lower k

Theoretical Analysis of L-Spar: Main Results

- Q: Why choose top de edges for a node of degree d?
- A: Conservatively sparsify low-degree nodes, aggressively sparsify hub nodes. Easy to control degree of sparsification.
- Proposition:If input graph has power-law degree distn. with exponent , then sparsified graph also has power-law degree distn. with exponent
- Corollary:The sparsification ratio corresponding to exponent e is no more than
- For = 2.1 and e = 0.5, ~17% edges will be retained.
- Higher (steeper power-laws) and/or lower e leads to more sparsification.

Experiments

- Datasets
- 3 PPI networks (BioGrid, DIP, Human)
- 2 Information (Wiki, Flickr) & 2 Social (Orkut , Twitter) networks
- Largest network (Orkut), roughly a Billion edges
- Ground truth available for PPI networks and Wiki
- Clustering algorithms
- Metis [Karypis & Kumar ‘98], MLR-MCL [Satuluri & Parthasarathy, ‘09], Metis+MQI [Lang & Rao ‘04], Graclus [Dhillon et. al. ’07], Spectral methods [Shi ’00], Edge-based agglomerative/divisive methods [Newman ’04]
- Compared sparsifications
- L-Spar, G-Spar, RandomEdge and ForestFire

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

Same sparsification ratio for all 3 methods.

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

Good speedups, but typically loss in quality.

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

Great speedups and quality.

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]

L-Spar: Qualitative Examples

Twitter executives, Silicon Valley figures

Impact of Sparsification on Noisy Data

As the graphs get noisier, L-Spar is increasingly beneficial.

Impact of Sparsification on Spectrum: Epinion

Local sparsification seems to

match trends of original graph

Global Sparsification results

in multiple components

Density Overlay Plots

Visual Comparison between Global vs Local Sparsification

Sparsification: Simple pre-processing that makes a big difference

Only tens of seconds to execute on multi-million-node graphs.

Reduces clustering time from hours down to minutes.

Improves accuracy by removing noisy edges for several algorithms.

Helps visualization

Ongoing and future work

Spectral results suggests one might be able to provide theoretical rationale – Can we tease it out?

Investigate other kinds of graphs, incorporating content, novel applications (e.g. wireless sensor networks, VLSI design)

Summary

- Random edge Sampling [Karger ‘94]
- Sampling in proportion to effective resistances: good guarantees but very slow [Spielman and Srivastava ‘08]
- Matrix sparsification [Arora et. al. ’06]: Fast, but same as random sampling in the absence of weights.

Modularity (from Wikipedia )

- Modularity is the fraction of the edges that fall within the given groups minus the expected such fraction if edges were distributed at random. The value of the modularity lies in the range [−1/2,1). It is positive if the number of edges within groups exceeds the number expected on the basis of chance.

The MCL algorithm

Input: A, Adjacency matrix

Initialize M to MG, the canonical transition matrix M:= MG:= (A+I) D-1

Enhances flow to well-connected nodes (i.e. nodes within a community).

Expand: M := M*M

Increases inequality in each column. “Rich get richer, poor get poorer.”

(reduces flow across communities)

Inflate: M := M.^r (r usually 2), renormalize columns

Prune

Saves memory by removing entries close to zero. Enables faster convergence

No

Converged?

Yes

Clustering Interpretation: Nodes flowing into the same sink node are assigned same cluster labels

Output clusters

Download Presentation

Connecting to Server..