1 / 32

Streaming and MapReduce for Graphs

Streaming and MapReduce for Graphs. Sample problems. How do we solve these problems ? finding connected components estimating clustering coefficient minm . spanning tree (weighted) minm -cut, other partitioning maximum matching (weighted) random walks. Streaming Model.

tirzah
Download Presentation

Streaming and MapReduce for Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Streaming and MapReduce for Graphs

  2. Sample problems • How do we solve these problems ? • finding connected components • estimating clustering coefficient • minm. spanning tree (weighted) • minm-cut, other partitioning • maximum matching (weighted) • random walks

  3. Streaming Model • Stream = m elements from an universe of size n (possibly with some weights) …..(v1, 1), (v2, 2), (v2, 1), (v1, 300),…. • Vector interpretation • stream over universe [n] => vector of size n • Restrictions • Restricted memory, preferably logarithmic • Small number of passes over input, preferably constant • Fast update time • Different models • Simple: (e,w) – each element arrives only once • Cash register : multiple arrivals, i.e. updates (e, +w) arrive but are all increments • Turnstile : (e, ± w) -- both positive and negative updates

  4. Estimating moments • stream over [n] => vector • Estimate moments • To a factor (1±)w.p. 1-  • (AMS) In order to (, ) estimate • space is sufficient for 0 < p < 2 • n1- 2/p space is necessary for p > 2

  5. Estimating F2 • Pick a random hash function h:[n]  {+1,-1} • For each update (i, v) perform • At the end estimate X = Z2 • Finally, use median of means.

  6. Estimating F0 • Define hash function • k = 1/2 • On element (x, v) • compute h(x) and maintain v = k-th minimum • Finally, output X = k*M/v h:[n]  [M]

  7. Graph Streams and Problems • Stream = edges • e1, e2, e3,…. • other variants too • Space used = O(n*polylog(n)) • Problems • Connectivity • Matching • Spanners • Clustering coefficient • Moments of degree distribution

  8. Connectivity • Doable in O(n log(n)) space • keep a label L(u) with every node u • same labels indicate same component • update label information as new edge (u, v) arrives • L(u)  L(w) for all w with label L(v) • At the end each connected component has same label

  9. Connectivity • Not doable in space • P is a “balanced” property if for there exists G and node u such that • V1 = {v: G + (u, v) satisfies P} ; • V2 = {v: G + (u, v) does not satisfy P} • min( |V1|, |V2|) > O(n) • Any such P needs space

  10. Spanners • = shortest path distance in G • Want a subgraph H = (V, E’) such that • H is -spanner • Can construct a (2t-1)-spanner in space

  11. Spanner Algorithm • Initialize H = empty • For each new edge (u, v) • if current d(u, v) > 2t -1 , include (u, v) in H • Claim • H is (2t – 1) spanner. • Number of edges • Takes time O(n) per edge, but faster algorithms exist

  12. Counting triangles • Clustering coefficient = (#closed triplets)/(#connected triplets) • signature of community structure • Different types of signed triangles measure the “balance” of the network ( +++ or --- vs. ++- ) • Algorithms • sampling based: sparsify the graph such that it fits into memory • streaming: reduce to frequency moments • linear algebra based: reduce triangle counting to a trace estimation problem and use randomized approximations

  13. Naïve triangle counting • Time O(mn)

  14. Improving Exact Counting(Alon, Yuster, Zwick) • Algorithm: • Divide vertices according to  • For all low-degree vertices • check neighbor-pairs and whether they are connected • For high-degree subgraph • use matrix multiplication to estimate number of triangles Asymptotically the fastest algorithm but not practical for large graphs.

  15. AYZ triangle counting • Use threshold  • Time spend in counting triangle with low-degree pivots = E • Number of high degree vertices = 2E/ • Time spent in matrix multiplication = (2E/) • Total time = O(E + (2E/) ) • By appropriate choice of , minimized at

  16. Naïve sampling • r independent samples of three distinct vertices = number of triplets with i edges Then the following holds: with probability at least 1-δ Works for dense graphs. e.g., T3 n2logn WAW '10

  17. Edge sampling • Triangle Sparsifiers • Keep an edge with probability p. Count the triangles in sparsified graph and multiply by 1/p3. • If the graph has O(n*logc(n)) triangles we get concentration • Proof of concentration tricky • uses the Kim-Vu concentration result for multivariate polynomials which have bad Lipschitz constant but behave “well” on average • improved using colorability result by Hajnal-Szemeredi • works ; t = #triangles.  = max degree; d=avg.

  18. Streaming Triangle counting • Consider a pseudo-array, where each element is a triplet • t1 = (a1,b1,c1) • Estimate F0, F1, F2 for this pseudo-array • using sketches • Use the relation to estimate T3 • Number of samples = • Better in the incidence model = number of triplets with i edges

  19. Random Walks on a stream • Naïve method • For each step of random walk, do a pass over the network • Using space O(n), k steps need k passes • Sample O(kn) edges, one from every node • In one pass, can do walk of length k. • Main result: • Using space O(n), can do k steps of the random walk using only k1/2 passes • Uses this to approximate PR, conductance etc.

  20. Random walks • Multiple start points: sample each node w.p. p and create a w-length random walk from there in w passes • Will try to stitch these together • Can get stuck as • Endpoint was not in original sample (i.e. no random walk from here) • Endpoint was already used (i.e. cannot take independent steps) • Handling stuck nodes: • Maintain the stuck node(s) and the set of “used” startpoints • Take a new random sample of s edges from each of these (maybe multiple times) • Crucial step: • Whenever stuck, either the new random sample is enough to make progress, or we discover new nodes (and there are not many of them)

  21. Key-value groups map map reduce k k v v k v v v k v k v v group k k k v v v … k v MapReduce Input key-value pairs Intermediate key-value pairs Output key-value pairs … … k v [slide from J. Ullman cs345A]

  22. MapReduce formalization • Number of machines = N1 - • Memory per machine = N1 - • Total communication = N2 - • Over all rounds • MRCk = problems that need <=k rounds • Each round has 2-phase map, then reduce structure • Ideally, want same “total work” as optimal sequential algorithm

  23. Connectivity (via mst)

  24. MST • Suppose |E| = |V|1 + c • Assume #machines = |V|1 - • memory per machine = |V|1 - • Claim: • number of iterations = c/ • total work = O(m*(m,n)/)

  25. Back to triangle counting: curse of the last reducer • Naïve mapreduce algorithm • In the first pass, collect edge-pairs [(u,v), (v, x)] • In the second pass, count triangles • Problem • Reducers that deal with high degrees take a long time

  26. Trick 1: pivoting on smallest degrees • Pivot on the smallest degree node of the triangle • Reduces counting time to O(m 3/2 ) • Intuition: • Similar to the AYZ proof, divide analysis by pivot degree threshold m1/2 • In the MapReduce setting, just use this trick to decide which vertices should be pivots

  27. Trick 2 : Overlapping partitions • Divide vertices V = {V1, V2,…Vt} • Vijk= Vi Vj Vk . Eijk = corresponding edges • In the first pass, partition the graph and weight each triangle such that it is counted exactly once • Run the previous algorithm on each partition in parallel • Total work done is still O(m1/2)

  28. Datasets

  29. Runtimes • Note reduction in number of paths using Trick-1 • However, running it on MR requires overheads

  30. Runtimes

  31. Models +Bag of Algorithmic tricks • Models • streaming, semi-streaming, stream + sort, mapreduce • Algorithmic tricks • Moment estimation on data stream • Edge sampling > triplet sampling • Reducing triangle counting to moment estimation • Piecing together random walks • Pivoting on the smallest degree to count triangles • Overlapping partitions to fit graph into memory

  32. Not covered • Streaming + dynamic: • Model in which graph edges can appear/disappear • How can we test connectivity? • Multigraph stream • How do we compute different function of node degrees • Streaming + sort • Can solve a number of the discussed problems in poly(log) space • Interesting only if there is an efficient way to do disk based sort • Clustering • Are these the right computational models?

More Related