The Block Two-Level Erdös-Rényi (BTER) Graph Model

The Block Two-Level Erdös-Rényi (BTER) Graph Model Ali Pinar, C. Seshadri, and Tamara G. Kolda Sandia National Laboratories U.S. Department of EnergyOffice of Advanced Scientific Computing Research U.S. Department of Defense DefenseAdvanced Research Projects Agency Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. Pinar - MMDS12

Why model massive graphs? • Enable sharing of surrogate data • Computer network traffic • Social networks • Financial transactions • Testing graph algorithms • Scalability • Versatility (e.g., vary degree distributions) • Characterizing algorithm performance • Verification & validation • Insight into… • Generative process • Community structure • Comparison • Evolution • Uncertainty Block Two-Level Erdös-Rényi(BTER) graph; image courtesy of NurcanDurak. Pinar - MMDS12

Model Desiderata Actor Collaboration • Captures heavy-tailed degree distribution • Not necessarily power law • Captures community structure • Measured indirectly through clustering coefficient, and other measures • Able to “fit” real-world data • Reproduce degree distribution • Reproduce community structure • Scales up to ~240 nodes • Motivated by GRAPH500 benchmark WWW Power Grid A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286(5349):509-512, 1999. M.E.J. Newman and M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E 69, 026113, 2004. http://www.graph500.org/ Pinar - MMDS12

Preserving Degree Distribution • Majority of models are amazingly bad at matching degree distributions • Tend to focus on matching “spirit” of the distribution • May not even have the capability of matching some distributions • Exception is the Chung-Lu (CL) model [Chung & Lu, 2002] • Also known as configuration model [Newman, 2003] or weighted Erdős-Rényi (ER) model • Premise: Probability of edge is proportional to degrees of endpoints • Pr(eij) = didj/(2m) • Can be implemented in a scalable way by repeated edge insertions. The CL model is nearly perfect at matching desired degree distributions. Pinar - MMDS12

Community Structure in Graphs • Numerous community finding algorithms exist • Difficult to validate • Trouble in finding full range of sizes • Instead, use related measures like clustering coefficient • Triangles arise because of community structure • What “community” structure must be presentto ensure a high clustering coefficient, especially for low-degree nodes? • How many communities? • What do they look like? M.E.J. Newman and M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E 69, 026113, 2004. Pinar - MMDS12

Clustering Coefficients Clustering Coefficient ti = # triangles at vertex i di = degree of vertex i Global Clustering Coeff. Pinar - MMDS12

Clustering coefficients are higher for lower degree vertices Pinar - MMDS12

Model building approach • Find features that restrict the space and identifythe structure imposed by these features • Given • askewed degree distribution • high clustering coefficients • What structure should we expect? • Let’s take it to an extreme, i.e., clustering coefficients ~1. • Implication 1: The graph must be a union of cliques. • Implication 2: The sizes of the cliques (communities) will have the same distribution as the degree distribution. • The challenge is in characterizing the gray area: clustering coefficients are high but not 1 Pinar - MMDS12

Building the basis for a model Seshadhri, Kolda, & Pinar, Phys. Rev. Let. E 2012 Empirical Observations • Degree distributions are heavy tailed. • Clustering coefficients are highest for small degree vertices. Theoretical Analysis Theorem: If a community has s edges then there must be Ω(√s) vertices with degree Ω(√s). We are not only trying to build a formal model, we are trying to formalize the model building process itself. Hypothesis: Real-world interaction networks consist of a scale –free collection of relatively dense Erdős-Rényi graphs. Pinar - MMDS12

Verifying the model Hypothesis: Real-world interaction networks consist of a scale –free collection of dense Erdős-Rényi graphs. Prediction: the number of communities grows with the number of nodes. Prediction: Within a community, the degree variance should be small. Verification: Vertex degree is correlated to the average degree of neighbors. And degrees of triangle vertices are close. Verification: The sizes of the communities are small, and do not grow (or grow very slowly) for larger graphs. (e.g., Dunbar number). Pinar - MMDS12

Matching the Clustering Coefficients • Random pairings cannot close wedges! • Models that achieve high clustering coefficients require history to guide future edge insertions • To get high clustering coefficient without history, we have to pre-group the nodes into affinity blocks • Our theory: CL graph with high clustering coefficient must have dense ER block at its core • Each affinity block is a near-clique • Simplifying assumption: Non-overlapping blocks • Implies blocks must comprise nodes of similar degree • Can tune clustering coefficient to be appropriate for the degree by changing connectivity of the block 1 2 4 3 If nodes of different degrees are in the same block, then the high-degree nodes will have low clustering coefficients. Pinar - MMDS12

Preprocessing: Determining Blocks heterogeneous • Input: Desired degree-distribution • Assume nodes sorted by degree, least to greatest • Ignore degree-1 vertices in this phase • Algorithm: • Group nodes in order • Bulk assign, if possible • Output: Compressed information about blocks • Size of each block • Starting index of each block • “Weight” of each block (depends on connectivity) • O(duniq) information, if compressed • Observe: If degree distribution is power law, then so is the distribution of the block sizes • Evocative of Dunbar’s number: Max block size = 100 for γ=2 2 2 3 2 2 3 3 2 homogeneous 2 3 3 2 2 3 3 2 2 3 3 2 2 3 3 2 Pinar - MMDS12

BTER: Block Two-Level Erdős-Rényi Phase 1: Erdős-Rényi graphs in each block. User-specified connectivity determines number of links within each block. A formula is given that depends on lowest-degree node in block. Preprocessing: Create explicit affinity blocks of nodes with same (or nearly same) degree. The blocks are determined by the degree distribution. Phase 2:CL (aka weighted ER) model on “excess” degree, creates connections across blocks. Can run in tandem to Phase 1 by worked with expected excess degree. Pinar - MMDS12

Scalable BTER: Independent Edges Choose phase 1 or 2? Choose block proportional to number of “samples” per block Create Phase 2 edgeusing CL model on “excess degree” Create Block 1 edge per ER model with connectivity ρ1 Create Block K edge per ER model with connectivity ρK Choose 1st endpoint Choose 2nd endpoint Requires O(n) data to determine the various probabilities, and can be compressed to O(duniq). Choose 1st endpoint Choose 2nd endpoint Choose 1st endpoint Choose 2nd endpoint Pinar - MMDS12

Visualization of BTER Adjacency Matrix Red = Phase 1 Blue = Phase 2 Pinar - MMDS12

Visualization of BTER Graph Nodes colored by degree (darker=higher) Phase 1 = Blue Phase 2 = Green Image courtesy of Nurcan Durak. Pinar - MMDS12

Co-authorship (ca-AstroPh) • 18,771 nodes 396,100 edges; based on arxiv • BTER Phase 1 edges: 290,268 Phase 2 edges: 112,808 • Global clustering coefficient Original: 0.32, BTER: 0.31, CL: 0.01 • Normalized size of the largest connected component: Original: 0.95, BTER: 0.86, CL: 1.0 Eigenvalues are not determined by the degree distribution. Pinar - MMDS12

Trust Network • 75,879vertices, 811,480edges; based on Epinion web site; edges represent trust between two users • BTER Phase 1 Edges: 300,162 Phase 2 Edges: 515,192 • Global clustering coefficientOriginal: 0.07, BTER: 0.07, CL: 0.03 • Normalized size of the largest connected componentOriginal: 1.0, BTER: 0.96, CL: 0.98 BTER provides good matches to real data, even when the clustering coefficients are not too high Pinar - MMDS12

Concluding Remarks • Models are a critical challenge in network analysis • The challenge is not in building a formal model, but formalizing the modeling process itself • We propose the Block Two-Level Erdös-Rényi (BTER) • New theory says there must be many dense subgraphs for high clustering coefficient • New BTER model explicitly creates dense communities using ER • Exceptional similarities to real data in terms of clustering coefficients and eigenvalues • The code is available at http://www.sandia.gov/~tgkolda/bter_supplement/. • Scalable versions (Hadoopand MPI) will be available soon • For more information, • Ali Pinar apinar@sandia.gov Pinar - MMDS12

A new workshop • SIAM Workshop on Network Science • Proposed; under review • Dates: July 7-8, 2013 • Place: San Diego, CA • Co-located with SIAM Annual Meeting • Contact: • Ali Pinar (apinar@sandia.gov), Sandia National Labs • MadhavMarathe(mmarathe@vbi.vt.edu), Virginia Tech Pinar - MMDS12

Relevant Publications • Modeling • C. Seshadhri, T. Kolda, and A. Pinar, “The Blocked Two-Level ErdosRenyi Graph Model,” Phys. Rev. E 2012. • C. Seshadhri, A. Pinar, and T. Kolda, “An In Depth analysis of Stochastic Kronecker Graphs," submitted. • A. Pinar, C. Seshadhri, and T. Kolda, “The Similarity of Stochastic Kronecker Graphs to Edge-Configuration Models,” SDM’12 • C. Seshadhri, A. Pinar, and T. Kolda, “An In Depth study of Stochastic Kronecker Graphs,” ICDM’12 • Generating a random graph • J. Ray, A. Pinar, and C. Sehadhri, Are we there yet? When to stop a Markov chain while generating random graphs," submitted. • I. Stanton and A. Pinar, “Constructing and uniform sampling graphs with prescribed joint degree distribution using Markov Chains,” to appear in ACM JEA. • I. Stanton and A. Pinar, “Sampling graphs with prescribed joint degree distribution using Markov Chains,” ALENEX’11. • Community structure • C. Seshadhri, A. Pinar, and T. Kolda, “Fast Triangle Counting through Wedge Sampling," submitted. • M. Rocklin and A. Pinar, “On Clustering on Graphs with Multiple Edge Types,” Internet Math. 2012. • M. Rocklin and A. Pinar, “Latent Clustering on Graphs with Multiple Edge Types,” Proc. 8th Workshop on Algorithms and Models for the Web Graph WAW’ 11. • M. Rocklin and A. Pinar, “Computing an Aggregate Edge-weight function for Clustering Graphs with Multiple Edge Types,” WAW’10. Pinar - MMDS12

The Block Two-Level Erdös-Rényi (BTER) Graph Model

The Block Two-Level Erdös-Rényi (BTER) Graph Model

Presentation Transcript

The s-Block Elements

Graph Problems in the Streaming Model

S - R S - R S - R S - R S - R S - R S - R S - R S - R

Block-level Link Analysis

Two Level Morphology

S - R S - R S - R S - R S - R S - R S - R S - R S - R S - R S - R S - R

Phenotype R S R S S R R S

R-MAT: A Recursive Model for Graph Mining

CPP Training Block Two

The Two Stream Model

A Two-Level Electricity Demand Model

Ne two r k s

S Block Elements

S-Block Elements

ERD

The s -Block Elements

The s -Block Elements

The s -Block Elements

The s -Block Elements

Two Level Networks

Mapping from Data Model (ERD) to Relational Model

Mapping from Data Model (ERD) to Relational Model