Respondent-driven Sampling for Characterizing Unstructured Overlays

Respondent-driven Sampling for Characterizing Unstructured Overlays Graciously Presented By: Shubho Sen AT&T Labs - Research

Motivation P2P systems are very popular in practice. • Millions of simultaneous users. • A significant fraction of Internet traffic Measurement studies aid understanding existing systems and user behavior. Capturing an accurate global “snapshot” is often infeasible. • P2P systems are distributed, large, and rapidly changing. • P2P crawlers are likely to capture incomplete or distorted snapshots Sampling is a natural approach, and has been used implicitly in most earlier P2P measurement studies. How can we collect representative samples?

The Graph Sampling Problem We focus on sampling peer properties, such as number of neighbors (degree), access link bandwidth, session time, # files Sampling peer properties has two steps: • Discovering and selecting peers (or samples) • Measuring the desired properties of selected peers Selecting peers uniformly at randomis hard – there are two sources of bias [Stutzbach:IMC06] • Topological: high-degree peers are more likely to be selected • Temporal: short-lived peers are more likely to be selected Random walks are a promising approach to sampling • The resulting bias is precisely known • Samples can be collected in parallel by multiple walkers

Sampling Using Random Walk Random walks can be described with a transition matrix P(x,y) P(x,y) : probability of moving from x to y Pr(x,y) : probability of moving from x to y after r moves Random walks converge to a stationary distribution Problem: we need a uniform distribution

Metropolized Random Walk (MRW) The Metropolis-Hastings method modifies the transition matrix to yield the desired uniform distribution [Stutzbach:IMC06] MRW method: Select a neighbor y of x uniformly at random Transition to y with probability min( deg(x)/deg(y) , 1) Otherwise, self-loop to x. Results in uniform stationary dist. (x)= 1/|V| MRW compensates for bias as samples are collected y x

This paper Presents a new graph sampling technique, Respondent-Driven Sampling (RDS) Compares the performance of RDS and MRW sampling techniques using simulations & experiments

Respondent-driven Sampling A development of Snowball Sampling [Salganik04] Commonly used in social sciences to sample “hidden” populations, e.g. HIV+ individuals Social relationships (references) are used by sampler to diffuse into hidden populations • Each person introduces n other persons • Similar to random walk (n = 1) We adopt the RDS technique from social sciences for sampling P2P networks

RDS Formulation Goal: Estimate the distribution of node property X Perform regular random walk, collect values of property X and node degree (deg(v)) at each visited node Deal with the bias during the post-processing as follows: Divide possible values for X into several ranges: {R1, . . . ,Rm} Partition nodes with the X value within the same range: {V1, . . . ,Vm} Using Hansen-Hurwitz estimator to compensate for the bias, the proportion of all nodes in group i is estimated as follows: • Ti: visited samples in group i • T: all visited samples

Evaluation Overview Performance metric • Consider only peer properties that may interact with the walk: • 1) Peer Degree, 2) Peer Uptime, 3) Peer RTT • Compare the dist. of the these peer properties from samples and “ground truth” using Kolmogorov-Smirnov (KS) statistics Evaluation Methodology • Evaluation over static graphs • Effect of graph structure • Evaluation over dynamic graphs (session level simulation) • Benefits of parallel Sampling (see the paper) • Effect of 1) churn, 2) peer discovery, 3) target peer degree • Experiments over Gnutella network

Evaluation: Static Graphs Using graphs with different degree distribution & clustering characteristics: • Random graphs (ER): Erdos-Renyi • Small-world graphs (SW): Watts and Strogatz • Scale-free graphs (BA): Barabasi and Albert • Hierarchical Scale-Free graphs (HSF): Barabasi ‘02 • Power-law degree distribution • Node clustering is inversely proportional to node degree • Gnutella graphs (GA): Snapshots of Gnutella Ultrapeer topology

Hierarchical Scale-Free (HSF)

Static Graphs • Accuracy of both techniques is improved with the number of samples in most cases • The rate of improvement in accuracy is much lower over HSF especially for MRW • Walkers are likely to get trapped within clusters in HSF graphs • Leaving a cluster requires visiting high degree nodes but MRW is less likely to visit these nodes • Rewiring a small fraction of randomly selected edges in HSF significantly improves accuracy for both techniques • RDS is less sensitive to graph clustering than MRW

Dynamic Graphs • Churn is a primary limiting factor for accuracy • Session len.> 5m  Very good sampling accuracy • Churn model has little effect • Similar impact on other peer properties (see the paper) • Sampling error is small once nodes have sufficient connectivity (> 5) • Lower accuracy for smaller degree is due to graph partitioning • Partitioned nodes in History mech. reduce the accuracy of sampling

Experiment: Gnutella • Run crawler, 1000 RDS & 1000 MRW walkers in parallel • 500 steps per walker • Use captured snapshots by crawler as a “rough” reference • Show min, max, avg KS over 6 experiments • Focus only on degree dist • The degree dist from samples & crawls are very similar (KS~0.03) • The accuracy is an order of magnitude lower than dynamic sim due to inaccurate reference. • Both sampling technique achieve similar accuracy

Conclusions & Future Work • RDS always performs as good or better than MRW • High level of graph clustering can significantly degrade the accuracy of both RDS and MRW • RDS is less sensitive than MRW to graph clustering • There is sweet spot for the number of parallel samplers. • Poor connectivity & high dynamics adversely affect the accuracy of both techniques. • Future Work: • RDS is a promising approach for sampling user properties in Online Social Networks • Sampling over directed graphs raises new challenges.

Thank You !

Different Grpah Structures

Dynamic Simulation Setting Simulation environment • Session-time distributions : Weibull, Exponential, Pareto • Poisson arrival process • Peer discovery : Oracle, FIFO, HeartBeat, History • Target population : 100’000 • Min. Degree : 3-30 • Sampling Parameters: • Node degree (DEG) • Node query latency (RTT) • Session length/uptime (UT)

Dynamic Graphs: Effect of Parallelism • Too much parallelism does not improve performance • Too long random walks have negative effect • Sweet spot exists

Respondent-driven Sampling for Characterizing Unstructured Overlays

Respondent-driven Sampling for Characterizing Unstructured Overlays

Presentation Transcript

Respondent Conditioning

Sampling and Characterizing Plaster

A Framework for Architecting Peer-to-Peer Receiver-driven Overlays

Concrete Overlays

Unstructured information integration through data-driven similarity discovery

Network Overlays.

Respondent Burden

An iterative approach to respondent driven sampling (RDS) using community-based

Respondent Conditioning

Schema-Driven Relationship Extraction from Unstructured Text

Testing the Random Recruitment Assumption of Respondent-Driven Sampling: Practical Implications

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text

Respondent-Driven Sampling

Population Size Estimations With Respondent Driven Sampling

Schema-Driven Relationship Extraction from Unstructured Text

Unstructured overlays: construction, optimization, applications

Respondent Driven Sampling (RDS)

Content Overlays

IT use in a respondent driven sampling survey, Kampala, Uganda

A stata program for Respondent Driven Sampling

Schema-Driven Relationship Extraction from Unstructured Text

Sea Ice

Sea Ice