1 / 30

Walking in facebook : a case study of unbiased sampling of osnS

Walking in facebook : a case study of unbiased sampling of osnS. 2010.6.9 junction. Outline. Motivation and Problem Statement Sampling Methodology Evaluation of Sampling Techniques Facebook Data Analysis Conclusion. Online Social Networks (OSNs). Social Graph undirected graph

tymon
Download Presentation

Walking in facebook : a case study of unbiased sampling of osnS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Walking in facebook:a case study of unbiased sampling of osnS 2010.6.9 junction

  2. Outline • Motivation and Problem Statement • Sampling Methodology • Evaluation of Sampling Techniques • Facebook Data Analysis • Conclusion

  3. Online Social Networks (OSNs) • Social Graph • undirected graph • G = (V, E) • V: nodes (users) • E: edges (relationships) • kv : node degree • A network of declared friendships between users, allowing users to maintain relationships • Many popular OSNs with different focus • Facebook, LinkedIn, Flickr, … • Facebook • More than 400 million active users • 50% of themlog on to Facebook in any given day • Average user has 130 friends • People spend over 500 billion minutes per month on Facebook • more than 100 million mobile users • Mobile user are twice more active than non-mobile users.

  4. Why Sample OSNs? • Representative samples desirable • study properties • test algorithms • Obtaining complete dataset difficult • companies usually unwilling to share data • tremendous overhead to measure all (~100TB for Facebook)

  5. Problem Statement • Obtain a representative sample of usersin a given OSN by exploration of the social graph. • Uniform sample of Facebook users • explore graph using various crawling techniques

  6. Outline • Motivation and Problem Statement • Sampling Methodology • Crawling Methods • Convergence Evaluation • Data Collection • Evaluation of Sampling Techniques • Facebook Data Analysis • Conclusion

  7. Crawling Methods • Crawling Methods • Breadth First Search (BFS) • Random Walk (RW) • Re-Weighted Random Walk (RWRW) • Metropolis-Hastings Random Walk (MHRW) • Uniform Sampling (UNI)

  8. Breadth First Search (BFS) • Early measurement studies of OSNs use BFS as primary sampling technique • Starting from a seed, explores all neighbor nodes. • As this method discovers all nodes within some distance from the starting point, an incomplete BFS is likely to densely cover only some specific region of the graph. • BFS leads to bias towards high degree nodes

  9. Random Walk (RW) • Explores graph one node at a time with replacement • In the stationary distribution • biased towards higher degree nodes (πv ~ kv) Degree of node υ Number of edges

  10. Re-Weighted Random Walk (RWRW) • Corrects for degree bias at the end of collection • Without re-weighting, the probability distribution for node property A is: (e.g. the degree, network size ...) • Re-Weighted probability distribution : Degree of node u

  11. Metropolis-Hastings Random Walk (MHRW) • Explore graph one node at a time with replacement • In the stationary distribution • Exactly the uniform distribution

  12. Uniform Sampling (UNI) • As a basis for comparison (ground truth) • Rejection sampling • uniform sampling of on the 32-bit IDs • discarding the non-existing ones • yields a uniform sample of the existing user IDs in Facebook for any allocation policy (i.e. even if the userIDs are not evenly allocated in the 32-bit address space) • UNI not a general solution for sampling OSNs • userID space must not be sparse • names instead of numbers • must be supported by the systems

  13. Convergence Detection • Number of samples (iterations) to loose dependence from starting points?

  14. Convergence Evaluation • Using Multiple Parallel Walks to improve convergence • avoid getting trapped in certain region • starting from 28 different randomly chosen initial nodes • Detecting Convergence with Online Diagnostics • sampling longer and discard a number of initial “burn-in” iterations • Consumed BW (TB) and measurement time (days) • Crucial to decide appropriate ‘burn-in’ and total running time • Grweke Diagnostic • Gelman-Rubin Diagnostic

  15. Geweke Diagnostic • Detect the convergence of a single Markov chain • With increasing number of iterations, Xa and Xb move further apart, which limits the correlation between them. • according to the law of large numbers, the z values become normally distributed ~ (0, 1) • Declare convergence when most values fall in the [-1,1] interval Xa Xb

  16. Gelman-Rubin Diagnostic • Detects convergence for m>1 walks (m: # of chains) • Compare the empirical distributions of individual chains with the empirical distribution of all sequences together • if they are similar enough (R,1.02) , declare convergence Between walks variance Walk 1 Walk 2 Walk 3 Within walks variance

  17. Data Collection • Information collected

  18. Data Collection • 28 x 81K = 2.26 M • 28 initial starting nodes • crawl until exactly 81K samples are collected • Summary of data set • 18.53M nodes picked uniform from [1, 232] • only 1216 K users existed • 228 K users had zero friends • repeat the same node in a walk • # of rejected nodes without repetition : 645 K • RW: 97 % nodes are unique • BFS: 97 % nodes are unique • confirms that the random seeding chose different areas of FB

  19. Outline • Motivation and Problem Statement • Sampling Methodology • Evaluation of Sampling Techniques • Convergence Analysis • Methods Comparison • Unbiased Estimation • Facebook Data Anaylsis • Conclusion

  20. Convergence Analysis • What is a fair way to compare the results of MHRW with RW and BFS? • MHRW visits fewer unique nodes than RW and BFS • MHRW stays at some nodes for relatively long time/iterations • Happens usually at some low degree node • An appropriate practical comparison should be based on the number of visited unique nodes

  21. Convergence Test • When does it reach equilibrium? • Burn-in determined to be 3K -> discard 6K • converge when all 28 values fall in the [-1, 1] interval • 500 iterations Node Degree • converge when all R scores drop below 1.02 • (0,1): not in / in • 3000 iterations

  22. Methods Comparison • MHRW, RWRW produce good in estimating the probability of a node degree • The degree distribution will converge fast to a good uniform sample • Poor performance for BFS, RW 28 crawls

  23. Unbiased Estimation (BFS, RW) • Node degree distribution • introduce a strong bias towards the high degree nodes • the low-degree nodes are under-represented

  24. Unbiased Estimation (MHRW) • Degree distribution identical to UNI (MHRW, RWRW)

  25. Outline • Motivation and Problem Statement • Sampling Methodology • Evaluation of Sampling Techniques • Facebook Data Analysis • Conclusion

  26. FB Social Graph– degree distribution • Degree distribution not a power law a1=1.32 a2=3.38

  27. FB Social Graph - Assortativity • Assortativity • nodes tend to connect to similar or different nodes? • positive correlation: high degree nodes tend to connect to other high degree nodes

  28. FB Social Graph – Privacy Awareness

  29. Outline • Motivation and Problem Statement • Sampling Methodology • Evaluation of Sampling Techniques • Facebook Data Analysis • Conclusion

  30. Conclusion • Compared graph crawling methods • MHRW, RWRW performed remarkably well • BFS, RW lead to substantial bias • Practical recommendations • correct for bias • usage of online convergence diagnostics • proper use of multiple chains

More Related