1 / 44

Estimating Clique Composition and Size Distributions from Sampled Network Data

Estimating Clique Composition and Size Distributions from Sampled Network Data. Minas Gjoka , Emily Smith, Carter T. Butts. University of California, Irvine. Outline. Problem statement Estimation methodology Results with real-life graphs. Cliques.

Download Presentation

Estimating Clique Composition and Size Distributions from Sampled Network Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Estimating Clique Composition and SizeDistributions from Sampled Network Data Minas Gjoka, Emily Smith, Carter T. Butts University of California, Irvine

  2. Outline • Problem statement • Estimation methodology • Results with real-life graphs

  3. Cliques A complete subgraph that contains i vertices is an order-i clique order-1 order-2 A maximal clique is a clique that is not included in a larger clique order-3 order-4 order-5 … order-i

  4. Cliques A complete subgraph that contains i vertices is an order-i clique A maximal clique is a clique that is not included in a larger clique order-3 b b b a c a c order-4 d d 4 non-maximal order-3 cliques d b a c a c d

  5. Counting of Cliques Ciis the count of order-i cliques (maximal or non-maximal) order-1 C1 graph G order-2 C2 3 2 1 4 5 order-3 C3 8 6 7 order-4 C4 Clique Distribution of G C = (C1, C2, C3, C4) = ( 0, 1, 2, 1 ) Goal 1: Estimate Ci(for all i) in graph G from sampled network data

  6. Counting of Cliques Vertex Attributes Vertex Attribute vector Xj j=1..p, p<=N p =3 graph G 3 2 u =[ 3 0 0 ] 1 4 5 8 u =[ 2 1 0 ] 6 7 u =[ 2 0 1 ] Clique Composition Distribution of G Cu is the count of order-u cliques Goal 2: Estimate Cu (for all u) in graph G from sampled network data

  7. Motivation • Counting of Cliques • cliques describe local structure (clustering, cohesive subgroups) • algorithmic implications of cliques in engineering context • cliques used as input in network models • Sampled network data • unknown graphs with access limitations • massive known graphs

  8. Related Work • Model-based methods • Do not scale • Do not help with counting • Design-based methods • Subgraph (or motif) counting tools that use sampling e.g. MFinder, FANMOD, MODA • No support for subgraphs of size larger than 10 • No support for vertex attributes • Biased Estimation

  9. Estimation

  10. Methodology • Collect an egocentric network sample H1,..,Hn • Collect a probability sample of “n” nodes from the graph: Vj, X[Vj] j=1..n uniform independence sampling weighted independence sampling link-trace sampling with replacement without replacement

  11. Methodology • Collect an egocentric network sample H1,..,Hn • Collect a probability sample of “n” nodes from the graph: Vj, X[Vj] j=1..n graph G(V,E) 3 2 1 4 4 5 n=2 C3 8 6 7 7

  12. Methodology • Collect an egocentric network sample H1,..,Hn • Collect a probability sample of “n” nodes from the graph: • Fetch the egonet of each sampled node: Vj, X[Vj] j=1..n G[Vj] j=1..n graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 8 7 4 6 5 6 7

  13. Methodology j=1..n • Collect an egocentric network sample H1,..,Hn • Collect a probability sample of “n” nodes from the graph • Fetch the egonet of each sampled node • Calculate the clique count Ci(or Cu) in each egonetHj Vj, X[Vj] G[Vj] graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 8 7 4 6 5 6 7

  14. Methodology j=1..n • Collect an egocentric network sample H1,..,Hn • Collect a probability sample of “n” nodes from the graph • Fetch the egonet of each sampled node • Calculate the clique count Ci(or Cu) in each egonetHj • can use existing exact clique counting algorithms • clique type is determined by counting algorithm. Vj, X[Vj] G[Vj] graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 8 7 4 6 5 0 1 6 7

  15. Methodology j=1..n • Collect an egocentric network sample H1,..,Hn • Collect a probability sample of “n” nodes from the graph • Fetch the egonet of each sampled node • Calculate the clique count Ci(or Cu) in each egonetHj • Apply estimation method that combines calculations • Clique Degree Sums (CDS) • Distinct Clique Counting (CC) Vj, X[Vj] G[Vj] graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 8 7 4 6 5 0 1 6 7

  16. Methodology j=1..n • Collect an egocentric network sample H1,..,Hn • Collect a probability sample of “n” nodes from the graph • Fetch the egonet of each sampled node • Calculate the clique count Ci(or Cu) in each egonetHj • Apply estimation method that combines calculations • Clique Degree Sums (CDS) • labeling of neighbors not required, more space efficient • Distinct Clique Counting (CC) • higher accuracy Vj, X[Vj] G[Vj] graph G(V,E) 3 2 1 3 2 4 5 8 n=2 C3 8 7 4 6 5 0 1 6 7

  17. Labeling of neighbors C3 8 7 1 9 6 2 5 4 3 graph G

  18. Labeling of neighbors Vj, X[Vj], G[Vj] C3 8 8 7 7 1 1 9 9 9 6 6 6 2 2 5 5 5 4 4 3 3 graph G n=2

  19. Labeling of neighbors • Distinct Clique Counting (CC) • labeled neighbors 8 7 Labeled Neighbors C3 9 9 6 6 8 7 Calculate count C3 5 5 1 9 6 9 9 6 6 2 5 5 5 5 5 5 4 3 4 4 4 3 3 graph G n=2

  20. Labeling of neighbors • Distinct Clique Counting (CC) • labeled neighbors • Clique Degree Sums (CDS) • unlabeled neighbors 8 7 Labeled Neighbors C3 9 6 9 9 6 5 8 7 Calculate count C3 5 5 4 3 1 9 6 9 6 2 5 Calculate count C3 5 5 5 5 5 4 3 4 4 3 Unlabeled Neighbors graph G n=2

  21. Clique Degree Sums unlabeled neighbors Order-i Clique Degree dij contains the number of i-cliques that node j belongs

  22. Clique Degree Sums unlabeled neighbors graph G (V,E) Order-i Clique Degree dij contains the number of i-cliques that node j belongs 6 4 3 8 8 7 5 2 H8 1 d38 = 2 C3

  23. Clique Degree Sums unlabeled neighbors All nodes Number of i-cliques that node j belongs Di is the Order-iClique Degree Sum

  24. Clique Degree Sums unlabeled neighbors graph G (V,E) All nodes 6 4 3 Number of i-cliques that node j belongs d38 8 8 7 5 2 Di is the Order-iClique Degree Sum 1 C3 D3 = d31 + d32 + d33 + d34 + d35 +d36 + d37 + d38 D3 = 1 + 1 + 0 + 1 + 2 + 1 + 1 + 2 D3 = 9 D3 = 3C3

  25. Clique Degree Sums unlabeled neighbors All nodes Number of i-cliques that node j belongs Sampled nodes Node j inclusion probability is a design-unbiased Horvitz-Thompson estimator ()

  26. Clique Degree Sums unlabeled neighbors All nodes Number of i-cliques that node j belongs Number of u-cliques that node j belongs Sampled nodes Node j inclusion probability is a design-unbiased Horvitz-Thompson estimator ()

  27. Clique Degree Sums Estimator Variance We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and Node inclusion probability Joint node inclusion probability

  28. Clique Degree Sums Estimator Variance We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and • Uniform Independence Sampling • Weighted Independence Sampling • Link-trace Sampling • Without replacement • With replacement

  29. Clique Degree Sums Estimator Variance We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and • Uniform Independence Sampling • Without replacement Sampled nodes All nodes Node inclusion probability Joint node inclusion probability

  30. Distinct Clique Counting labeled neighbors number of distinct i-cliques in H1, .., Hn i-clique inclusion probability is a design-unbiased Horvitz-Thompson estimator ( ) ) • Uniform Independence Sampling • Weighted Independence Sampling • Link-trace Sampling • With replacement • Without replacement

  31. Distinct Clique Counting labeled neighbors number of distinct i-cliques in H1, .., Hn i-clique inclusion probability is a design-unbiased Horvitz-Thompson estimator ( ) ) • Uniform Independence Sampling • With replacement

  32. Distinct Clique Counting labeled neighbors graph G 6 4 3 a 8 C3 7 5 2 b c N=8 1 n=4 UIS with replacement

  33. Distinct Clique Counting labeled neighbors graph G 6 4 3 a 8 C3 7 5 2 b c N=8 1 n=4 UIS with replacement Observed order-3 cliques 6 6 5 5 2 2 8 8 1 1 7 7 Distinct order-3 cliques 6 5 2 8 1 7

  34. Computational complexity • Space complexity to count Ci or Cu • O(1) for Clique Degree Sums Method • O(ci) or O(cu) for Distinct Clique Counting Method • Time complexity • from O(3N/3) to O(n*3D/3) where N is the graph size, D is the maximum degree, and n is the sample size • from O(n*3D/3) to O(3D/3) via parallel computations per egonet

  35. Benefits of our methodology • Full knowledge of graph not required • Fast estimation for massive known graphs • Estimation or exact computation easily parallelizable for massive known graphs • Estimation with or without neighbor labels • Supports vertex attributes • Supports a variety of sampling designs

  36. Results

  37. Simulation Results

  38. Simulation ResultsFacebook New Orleans Distinct Clique Counting Clique Degree Sums Egonet sample size n=1,000 Uniform independence sampling, without replacement 1000 simulations

  39. Simulation Results 1000 simulations Error metric Normalized Mean Absolute Error : Clique Degree Sums Distinct Clique Counting

  40. Simulation Results Clique Degree Sums Distinct Clique Counting

  41. Which estimation method to use?Heuristic All edges between egos and neighbors Average Edge Count = Unique edges between egos and neighbors graph G 6 4 3 n=3 6 8 6 5 6 5 2 2 7 8 5 8 2 8 1 1 7 7 7 N=8 1 a 9 Average Edge Count = = 1.5 b c 6

  42. Estimation ResultsFacebook ‘09 • Facebook ‘09 crawled dataset[1] • 36,628 unique egonets • [1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, IEEE INFOCOM 2010.

  43. Estimation Resultsvertex attributes, Facebook ‘09 • Complemented dataset with gender attributes • about 6 million users

  44. Unbiased estimation methods of clique distributions • Clique Degree Sums • Distinct Clique Counting • Facebook cliques • Future work • support estimation of any subgraphs (beyond cliques) References • [1] M. Gjoka, E. Smith, C. T. Butts, “Estimating Clique Composition and Size Distributions from Sampled Network Data”, IEEE NetSciCom '14 . • [2] Facebook datasets: http://odysseas.calit2.uci.edu/research/osn.html • [3] Python code for Clique Estimators: http://tinyurl.com/clique-estimators • Thank you!

More Related