Local sparsification for scalable module identification in networks
Download
1 / 57

Local Sparsification for Scalable Module Identification in Networks - PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on

Local Sparsification for Scalable Module Identification in Networks. Srinivasan Parthasarathy Joint work with V. Satuluri , Y. Ruan , D. Fuhry , Y. Zhang. Data Mining Research Laboratory Dept. of Computer Science and Engineering The Ohio State University.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Local Sparsification for Scalable Module Identification in Networks' - josef


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Local sparsification for scalable module identification in networks
Local Sparsification for Scalable Module Identification in Networks

  • Srinivasan Parthasarathy

  • Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang

Data Mining Research Laboratory

Dept. of Computer Science and Engineering

The Ohio State University


Every 2 days we create as much information as we did up to 2003 eric schmidt google ex ceo

Every 2 days we create as much information as we did up to 2003”- Eric Schmidt, Google ex-CEO

The Data Deluge


600 to buy a disk drive that can store all of the world s music

Data Storage Costs are Low

600$ to buy a disk drive that can store all of the world’s music

[McKinsey Global Institute Special Report, June ’11]




Social networks

Protein Interactions

Internet

Neighborhood graphs

Data dependencies

VLSI networks


All this data is only useful if we can scalably extract useful knowledge


Challenges

1. Large Scale

Billion-edge graphs commonplace

Scalable solutions are needed


Challenges

2. Noise

Links on the web, protein interactions

Need to alleviate


Challenges

3. Novel structure

Hub nodes, small world phenomena, clusters of varying densities and sizes, directionality

Novel algorithms or techniques are needed


Challenges

4. Domain Specific Needs

E.g. Balance, Constraints etc.

Need mechanisms to specify


Challenges

5. Network Dynamics

How do communities evolve? Which actors have influence? How do clusters change as a function of external factors?


Challenges

6. Cognitive Overload

Need to support guided interaction for human in the loop


Our Vision and Approach

Application Domains

Bioinformatics (ISMB’07, ISMB’09, ISMB’12, ACM BCB’11, BMC’12)

Social Network and Social Media Analysis (TKDD’09, WWW’11, WebSci’12, WebSci’12)

Graph Pre-processing

Sparsification

SIGMOD ’11, WebSci’12

Near Neighbor Search

For non-graph data

PVLDB ’12

Symmetrization

For directed graphs

EDBT ’10

Core Clustering

Consensus Clustering

KDD’06, ISMB’07

Viewpoint Neighborhood Analysis

KDD ’09

Graph Clustering via

Stochastic Flows

KDD ’09, BCB ’10

Dynamic Analysis and Visualization

Event Based Analysis KDD’07,TKDD’09

Network Visualization

KDD’08

Density Plots

SIGMOD’08, ICDE’12

Scalable Implementations and Systems Support on Modern Architectures

Multicore Systems (VLDB’07, VLDB’09), GPUs (VLDB’11), STCI Cell (ICS’08),

Clusters (ICDM’06, SC’09, PPoPP’07, ICDE’10)



Is there a simple pre-processing of the graph to reduce the edge set that can “clarify” or “simplify” its cluster structure?


Graph Clustering and Community Discovery edge set that can

Given a graph, discover groups of nodes that are strongly connected to one another but weakly connected to the rest of the graph.


Graph Clustering : Applications edge set that can

Social Network and Graph Compression

Direct Analytics on compressed representation


Graph Clustering : Applications edge set that can

Optimize VLSI layout


Graph Clustering : Applications edge set that can

Protein function prediction


Graph Clustering : Applications edge set that can

Data distribution to minimize communication and balance load


Is there a simple pre-processing of the graph to reduce the edge set that can “clarify” or “simplify” its cluster structure?


Preview edge set that can

Original

Sparsified

[Automatically visualized using Prefuse]


The promise edge set that can

Clustering algorithms can run much faster and be more accurate on a sparsified graph.

Ditto for Network Visualization


Utopian Objective edge set that can

Retain edges which are likely to be intra-cluster edges, while discarding likely inter-cluster edges.


A way to rank edges on edge set that can “strength” or similarity.


Algorithm: Global Sparsification (G-Spar) edge set that can

  • Parameter: Sparsification ratio, s

  • 1. For each edge <i,j>:

    • (i) Calculate Sim ( <i,j> )

  • 2. Retain top s% of edges in order of Sim, discard others


  • Dense clusters are over-represented, sparse clusters under-represented

    Works great when the goal is to just find the top communities


    Algorithm: Local Sparsification (L-Spar) under-represented

    • Parameter: Sparsification exponent, e (0 < e < 1)

    • 1. For each node i of degree di:

      • (i) For each neighbor j:

        • (a) Calculate Sim ( <i,j> )

    • (ii) Retain top (d i)eneighbors in order of Sim, for node i

    Underscoring the importance of Local Ranking



    But under-represented...

    Similarity computation is expensive!


    A randomized, approximate solution based on under-represented

    Minwise Hashing

    [Broder et. al., 1998]


    mh under-represented1(A) = min ( { mouse, lion } ) = mouse

    Minwise Hashing

    { dog, cat, lion, tiger, mouse}

    Universe

    [ cat, mouse, lion, dog, tiger]

    [ lion, cat, mouse, dog, tiger]

    A = { mouse, lion }

    mh2(A) = min ( { mouse, lion } ) = lion


    Key Fact under-represented

    For two sets A, B, and a min-hash function mhi():

    Unbiased estimator for Sim using k hashes:


    Time complexity using Minwise Hashing under-represented

    Hashes

    Edges

    Only 2 sequential passes over input.

    Great for disk-resident data

    Note: exact similarity is less important – we really just care about relative ranking  lower k


    Theoretical analysis of l spar main results
    Theoretical Analysis under-representedof L-Spar: Main Results

    • Q: Why choose top de edges for a node of degree d?

      • A: Conservatively sparsify low-degree nodes, aggressively sparsify hub nodes. Easy to control degree of sparsification.

    • Proposition:If input graph has power-law degree distn. with exponent , then sparsified graph also has power-law degree distn. with exponent

    • Corollary:The sparsification ratio corresponding to exponent e is no more than

    • For  = 2.1 and e = 0.5, ~17% edges will be retained.

      • Higher  (steeper power-laws) and/or lower e leads to more sparsification.


    Experiments
    Experiments under-represented

    • Datasets

      • 3 PPI networks (BioGrid, DIP, Human)

      • 2 Information (Wiki, Flickr) & 2 Social (Orkut , Twitter) networks

      • Largest network (Orkut), roughly a Billion edges

      • Ground truth available for PPI networks and Wiki

    • Clustering algorithms

      • Metis [Karypis & Kumar ‘98], MLR-MCL [Satuluri & Parthasarathy, ‘09], Metis+MQI [Lang & Rao ‘04], Graclus [Dhillon et. al. ’07], Spectral methods [Shi ’00], Edge-based agglomerative/divisive methods [Newman ’04]

    • Compared sparsifications

      • L-Spar, G-Spar, RandomEdge and ForestFire


    Results Using under-representedMetis

    [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]


    Results Using under-representedMetis

    Same sparsification ratio for all 3 methods.

    [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]


    Results Using under-representedMetis

    Good speedups, but typically loss in quality.

    [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]


    Results Using under-representedMetis

    Great speedups and quality.

    [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]


    L-Spar: Results Using under-representedMLR-MCL

    [Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]


    L spar qualitative examples
    L-Spar: Qualitative Examples under-represented

    Twitter executives, Silicon Valley figures


    Impact of Sparsification on Noisy Data under-represented

    As the graphs get noisier, L-Spar is increasingly beneficial.



    Impact of Sparsification on Spectrum: Epinion under-represented

    Local sparsification seems to

    match trends of original graph

    Global Sparsification results

    in multiple components




    Anatomy of density plot
    Anatomy of density plot under-represented

    Some measure of density

    Specific ordering of the vertices in the graph


    Density overlay plots
    Density Overlay Plots under-represented

    Visual Comparison between Global vs Local Sparsification


    Sparsification: Simple pre-processing that makes a big difference

    Only tens of seconds to execute on multi-million-node graphs.

    Reduces clustering time from hours down to minutes.

    Improves accuracy by removing noisy edges for several algorithms.

    Helps visualization

    Ongoing and future work

    Spectral results suggests one might be able to provide theoretical rationale – Can we tease it out?

    Investigate other kinds of graphs, incorporating content, novel applications (e.g. wireless sensor networks, VLSI design)

    Summary


    Prior Work difference

    • Random edge Sampling [Karger ‘94]

    • Sampling in proportion to effective resistances: good guarantees but very slow [Spielman and Srivastava ‘08]

    • Matrix sparsification [Arora et. al. ’06]: Fast, but same as random sampling in the absence of weights.



    Modularity from wikipedia
    Modularity (from Wikipedia ) difference

    • Modularity is the fraction of the edges that fall within the given groups minus the expected such fraction if edges were distributed at random. The value of the modularity lies in the range [−1/2,1). It is positive if the number of edges within groups exceeds the number expected on the basis of chance.


    The mcl algorithm
    The MCL algorithm difference

    Input: A, Adjacency matrix

    Initialize M to MG, the canonical transition matrix M:= MG:= (A+I) D-1

    Enhances flow to well-connected nodes (i.e. nodes within a community).

    Expand: M := M*M

    Increases inequality in each column. “Rich get richer, poor get poorer.”

    (reduces flow across communities)

    Inflate: M := M.^r (r usually 2), renormalize columns

    Prune

    Saves memory by removing entries close to zero. Enables faster convergence

    No

    Converged?

    Yes

    Clustering Interpretation: Nodes flowing into the same sink node are assigned same cluster labels

    Output clusters


    ad