Sampling from Large Graphs
E N D
Presentation Transcript
Motivation • Our purpose is to analyze and model social networks • An online social network graph is composed of millions of nodes and edges • In order to analyze it we have to store the whole graph in the computers memory • Sometimes this is impossible • Even when it is possible it is extremely time consuming only to compute some basic graph properties • Thus we need to extract a small sample of the graph and analyze it
Problem • Given a huge real graph, how can we derive a representative sample? • Which sampling method to use? • How small can the sample size be? • How do we measure success?
Problem • What do we compare against? • Scale down sampling: • Given a graph G with n nodes, derive a sample graph G’ with n’ nodes (n’ << n) that will be most similar to G • Back in time sampling: • Let Gn’ denote graph G at some point in time when it had n’ nodes • Find a sample S on n’ nodes that is most similar to Gn’ (when graph G had the same size as S)
Evaluation Techniques • Criteria for scale down sampling • In degree distribution • Out degree distribution • Distribution of sizes of weakly connected components • Distribution of sizes of strongly connected components • Hop plot, number of reachable pairs of nodes at distance h • Hop plot on the largest WCC • Distribution of the clustering coefficient • Distribution of singular values of the graph adjacency matrix versus the rank
Evaluation Techniques • Criteria for back in time sampling • Densification Power Law: • Number of edges vs number of nodes over time • The effective diameter of the graph over time • Observed that shrinks and stabilizes over time • Normalized size of the largest WCC over time • Average clustering coefficient over time • Largest singular value of graph adjacency matrix over time
Statistical Tests • Comparing graph patterns using Kolmogorov-Smirnov D-statistic • Measure the agreement between two distributions using D = maxx{|F’(x) – F(x)|} • Where F and F’ are two cumulative distribution functions • Does not address the issue of scaling • Just compares the shape of the distributions • Comparing graph patterns using the visiting probability • For each node u E G, calculate the probability of visiting node w E G • Use of Frobenius norm to calculate the difference in visiting probability.
Algorithms • Sampling by random node selection • Random Node Sampling: • Uniformly at random select a set of nodes • Random PageRank sampling • Set the probability of a node being selected into the sample proportional to its PageRank weight • Random Degree Node • Se the probability of a node being selected into the sample proportional to its degree
Algorithms • Sampling by random edge selection • Random edge sampling • Uniformly select edges at random • Random node – edge sampling • Uniformly at random select a node, then uniformly at random select an edge incident to it • Hybrid sampling • With probability p perform RNE sampling, with probability 1-p perform RE sampling
Algorithms • Sampling by exploration • Random node neighbor • Select a node uniformly at random together with all his out-going neighbors • Random walk sampling • Uniformly at random select a random node and perform a random walk with restarts • If we get stuck, randomly select another node to start • Random jump sampling • Same as random walk sampling but with a probability p we jump to a new node • Forest fire sampling • Choose a node u uniformly at random • Generate a random number z and select z out links of u that are not yet visited • Apply this step recursively for all z links selected
Evaluation • Three groups of algorithms: • RDN, RJ, RW: biased towards high degree nodes and densely connected part of the graph • FF, RPN, RN: not biased towards high degree nodes, match the temporal densification of the true graph • RE, RNE, HYB: For small sample size the resulting graph is very sparsely connected • Conclusion: • For the scale down goal methods based on random walks perform best • For the back in time goal forest fire algorithm performs best • No single perfect answer to graph sampling • Experiments showed that a 15% sample is usually enough
Further thoughts • Wrong approach trying to match all properties? Maybe we should try matching one at a time • Test methods for sampling on graphs with weighted – labeled edges • Current algorithms are extremely slow when we read a graph from a file • Need to implement better versions of them in order to decrease the I/O cost
Bibliography • Sampling from large graphs, J. Leskovec and C. Faloutsos • Unbiased sampling of Facebook, M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou • What is the real size of a sampled network? The case of the Internet, F. Viger, A. Barrat. L. Dall’Asta, C. Zhang and E. D. Kolaczyk • Sampling large Internet topologies for simulation purposes, V. Krishnamurthy, M. Faloutsos, M. Chrobak, J. Cui, L. Lao and A. G. Percus