1 / 37

Massive Data Streams in Graph Theory and Computational Geometry Ph.D. Dissertation Defense

Massive Data Streams in Graph Theory and Computational Geometry Ph.D. Dissertation Defense. Jian Zhang Advisor: Joan Feigenbaum. Committee: Ravi Kannan Avi Silberschatz Sampath Kannan (UPenn). Support: NSF grants 0105337 and 0331548. Talk Outline.

pandora
Download Presentation

Massive Data Streams in Graph Theory and Computational Geometry Ph.D. Dissertation Defense

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Massive Data Streams in Graph Theory and Computational GeometryPh.D. Dissertation Defense Jian Zhang Advisor: Joan Feigenbaum Committee: Ravi Kannan Avi Silberschatz Sampath Kannan (UPenn) Support: NSF grants 0105337 and 0331548

  2. Talk Outline • Streaming computational model • Overview of results • Approximate graph distances in the streaming model • Future research directions J. Zhang - Ph.D. Dissertation Defense

  3. Data Streams • A data stream is a sequence of data elements: a1, a2, …, an . • Stream of stock prices • Stream of IP packets • Data elements have different forms in different applications. • Scalar value • Tuple • The semantics of the data elements are also different in different applications. J. Zhang - Ph.D. Dissertation Defense

  4. Streaming Computational Model • Sequential access to the input stream • Order of data elements in the stream is not controlled by the algorithm and may be adversarial. • Algorithms may perform pre- or post-processing without access to the data stream. Working Space STREAM J. Zhang - Ph.D. Dissertation Defense

  5. Features of Streaming Algorithms • Small working space compared to the stream length n • Polylog n • n • Small number of passes over the stream • One pass • Constant number of passes • Fast per-data-element processing time J. Zhang - Ph.D. Dissertation Defense

  6. Sliding-Window Model • A variation of streaming • Data stream is a time series and may be infinite. • Consider the n most recent data elements. • As time progresses, new data elements arrive, and old data elements expire. • The deletion of old data elements is implicit. J. Zhang - Ph.D. Dissertation Defense

  7. Why Streaming ? • Data streams occur in real systems. • IP-traffic flow • Need to distinguish the working space from the data storage. • Storage devices: large capacity but slow access • Working space: small capacity but fast random access • We want to restrict random access to the mass storagebut still see every element of the input set at least once. J. Zhang - Ph.D. Dissertation Defense

  8. Earlier Work on Streaming • Despite the restrictions of the model, a lot can be done, e.g.: • Lp norms [FKSV02, Indyk00] • histograms [GKS01] • clustering [GMMO00] • Much of the work focuses on computing statistics. • Often the working-space size is restricted to polylog space. J. Zhang - Ph.D. Dissertation Defense

  9. Talk Outline • Streaming computational model • Overview of results • Approximate graph distances in the streaming model • Future research directions J. Zhang - Ph.D. Dissertation Defense

  10. Dissertation Contributions • Investigate important problem domains. • Computational geometry problems • Graph problems • Show the importance of a more relaxed model. • Sublinear space instead of polylog space • Multiple passes There are problems that are provably hard in the restricted model but feasible in the more relaxed model. J. Zhang - Ph.D. Dissertation Defense

  11. Results on Geometric Problems (1) [ Feigenbaum-S. Kannan-Zhang ] • Exact computation is hard using sublinear space. Computing the exactDiameter, Closest Pair, or Convex Hull requires (n) bits of space, where n is the number of points in the stream. • Approximation is feasible. We give a one-pass, ε-approximation, streamingalgorithm for diameter. The algorithm needs storage for O(1/ε)points and processes each point in O(log(1/ε)) time. J. Zhang - Ph.D. Dissertation Defense

  12. Results on Geometric Problems (2) • We give an ε-approximation algorithm that maintains the diameter in the sliding-window model. • The algorithm uses O(1/ε log3n logR)bits of space, where R is the largest diameter attained in any window. The amortized processing time for each point is O(logn). • We show that is (1/ε logn logR) space is required for such an approximation. J. Zhang - Ph.D. Dissertation Defense

  13. 3 5 1 4 2 Graph Stream • Consider undirected graph: G=(V,E) V = {v1, v2, …, vn} E = {e1, e2, …, em} • A graph stream is a sequence of edges in E. • Edges arrive in arbitrary order in the stream. • More general than adjacency matrices or adjacency lists (4,5) (2,3) (1,3) (3,5) (1,2) (2,4) (1,5) (3,4) J. Zhang - Ph.D. Dissertation Defense

  14. Results on Graph Problems (1) [ Feigenbaum-S. Kannan-McGregor-Suri-Zhang ] • Many problems require (n) bits of space. Graph distances (even approximation), Connectivity testing, Planarity testing … • Consider streaming algorithms that use O(n·polylogn) space and O(1)passes. In such a model, we can compute or approximate: • Spanning trees • Graph distances J. Zhang - Ph.D. Dissertation Defense

  15. Results on Graph Problems (2) [ Elkin-Zhang ] • (1+,)-approximation: Our algorithm outputs {(u,v)}s.t. (u,v) (1+ ) distG(u,v) + ,where distG(u,v) is the true distance between vertices u and v. • The algorithm uses O(n1+1/k) space. • Processing time per edge is O(n1/k). • Needs multiple passes. • 1/k and  are arbitrarily small parameters.  and the number of passes are functions ofk and 1/. We give a randomized streaming algorithm that approximates graph distances: J. Zhang - Ph.D. Dissertation Defense

  16. Results on Graph Problems (3) [ Feigenbaum-S. Kannan-McGregor-Suri-Zhang ] • We give a one-pass, streaming algorithm for approximating graph distances. • (2t+1)-approximation:(u,v) (2t+1)·distG(u,v) • O(t·n1+1/t ·logn) space • Processing time per edge: O(t2·n1/t·logn) • Needsone pass. • Lower bound: The space complexity of one-pass, t-approximation is (n1+1/t). For t = log n, this gives a one-pass, O(logn)-approximation algorithm using n·polylog space and polylog time per edge. J. Zhang - Ph.D. Dissertation Defense

  17. Publications • J. Feigenbaum, S. Kannan, and J. Zhang, “Computing Diameter in the Streaming and Sliding-Window Models,” Algorithmica41 (2005), pp. 25-41 • J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang, “On Graph Problems in a Semi-Streaming Model,” ICALP 2004, pp. 531-543. Journal version to appear in Theoretical Computer Science. • M. Elkin and J. Zhang, “Efficient Algorithms for Constructing (1+ε,β)-Spanners in the Distributed and Streaming Models,” PODC 2004, pp. 160-168 • J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang, “Graph Distances in the Streaming Model: The Value of Space,” SODA 2005, pp 745-754 J. Zhang - Ph.D. Dissertation Defense

  18. Other Results in Thesis • Streaming-space requirement can be reduced by annotating the stream. J. Feigenbaum, S. Kannan, and J. Zhang, “Annotation and Computational Geometry in the Streaming Model,” Yale University Technical Report YALEU/DCS/TR-1249, 2003 • Using streaming algorithms to detect BGP-update anomalies. J. Zhang, J. Rexford, and J. Feigenbaum, “Learning-Based Anomaly Detection in BGP Updates,” to appear in SIGCOMM Workshop on Mining Network Data 2005 J. Zhang - Ph.D. Dissertation Defense

  19. Talk Outline • Streaming computational model • Overview of results • Approximate graph distances in the streaming model • Future research directions J. Zhang - Ph.D. Dissertation Defense

  20. Shortest-Path Distances • Distance is the length of the shortest path. • Fundamental problem in graph theory • Many algorithms and approximations • Most of them use BFS-like subroutines, which are hard to adapt to the streaming model. J. Zhang - Ph.D. Dissertation Defense

  21. The “Sketch” Approach • A two-stage approach • First stage: While going through the stream, construct a smallsketch of the input graph. • Second stage: Compute the distance using the sketch, without further access to the stream. • Perform BFS-like computations in the second stage. J. Zhang - Ph.D. Dissertation Defense

  22. Graph Spanners as Sketches • Edge subgraphH of a graph G, s.t., for any pair of vertices u and v, their distance in H,distH(u,v), is not far from their distance in G,distG(u,v). • Multiplicative spanner [t-Spanner]: distH(u,v)  t·distG(u,v). • Spanners are sparse. A t-Spanner hasO(n1+1/t) edges. • Reduce streaming graph distance to streaming spanner construction. • BFS-like subroutines are used in most existing spanner constructions. J. Zhang - Ph.D. Dissertation Defense

  23. Streaming Spanner Construction • For each incoming edge, decide whether it should be in the spanner. • If the edge causes a cycle of length  t, do not put the edge in the spanner. • This gives a t-spanner, because there is a path P of length < t connecting the two endpoints of any discarded edge. • This spanner is sparse. Thm [Bollobás78] : A graph whose girth is larger than k can only have O(n1+2/(k-1)) edges. • Need to know: For an incoming edge, does the path P exist? J. Zhang - Ph.D. Dissertation Defense

  24. Partial Solution: Clusters (1) • A cluster is a subset of vertices and a small diameter spanning tree built on these vertices. • Intra-cluster edge J. Zhang - Ph.D. Dissertation Defense

  25. Partial Solution: Clusters (2) • Inter-cluster edges Bollobás’s result no longer applies. Need to control the number of clusters (i.e., make it ). J. Zhang - Ph.D. Dissertation Defense

  26. Summary of the One-Pass Algorithm • Use a vertex-labeling scheme to construct the clusters. • Structure of the algorithm: • In the pre-processing phase, generate a multi-level set of labels. • Go through the stream; for each edge: • According to the current assignment of labels to vertices, decide whether to put this edge in the spanner. • Depending on the type of edge, possibly assign more labels to one of its endpoints. • Next, an example with t = log n J. Zhang - Ph.D. Dissertation Defense

  27. Level 2 Level 1 Level 0 Labels • logn/2 levels • w.h.p., there are top-level labels. • Semantics of labels: • The set of vertices assigned the same top-level label forms a cluster. • The set of vertices assigned the same lower-level label forms a “pre-cluster.” (2,2) (2,7) (1,2) (1,4) (1,7) (1,11) (1,2) (1,4) (1,7) (1,11) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10) (0,11) (0,12) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10) (0,11) (0,12) J. Zhang - Ph.D. Dissertation Defense

  28. Level 2 Level 1 Level 0 Initial Label Assignment (2,2) (2,7) (1,2) (1,4) (1,7) (1,11) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10) (0,11) (0,12) v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 J. Zhang - Ph.D. Dissertation Defense

  29. On arrival of an edge • Already know what to do with: • Intra-cluster/pre-cluster edges • Inter-cluster edges • Edges connecting pre-clusters: the sticky edges • They are added to the spanner. • They may lead to new label assignment and cluster growth. J. Zhang - Ph.D. Dissertation Defense

  30. “Good” Neighbor (1) (3,2) (3,2) (2,2) (1,2) (0,2) (2,2) Has marked labels (1,6) (0,6) v u J. Zhang - Ph.D. Dissertation Defense

  31. Good Neighbor (2) C(3,2) C(2,2) C(1,2) C(1,6) v u J. Zhang - Ph.D. Dissertation Defense

  32. “Bad” Neighbor No marked labels (1,6) (3,2) v u J. Zhang - Ph.D. Dissertation Defense

  33. Properties of the Clusters • Small diameter • Number of clusters bounded by . • Do not need to cover the whole graph with clusters, but the uncovered subgraph issparse. The uncovered subgraph consists of sticky edges, and there are not too many of them. J. Zhang - Ph.D. Dissertation Defense

  34. Sticky Edges are Rare u1 • A neighbor is good with probability at least ½. • After seeing at most logn/2good neighbors, v will be assigned a top-level label and be included in a cluster. No more sticky edges for v. • The number of sticky edges can be bounded by the length of the shortest prefix in the above sequence that contains logn/2good neighbors. v u1, u2, u3, u4 … u4 u2 u3 J. Zhang - Ph.D. Dissertation Defense

  35. Talk Outline • Streaming computational model • Overview of results • Approximate graph distances in the streaming model • Future research directions J. Zhang - Ph.D. Dissertation Defense

  36. Summary • We investigated two important problem domains. • Exact computation is hard; approximation may be feasible. • For some problems, particularly graph problems, considering a more general model is important, becausepolylog space is too restrictive. • Constructing a sketch of non-numerical input is an important tool in streaming-algorithm design. J. Zhang - Ph.D. Dissertation Defense

  37. Future Research Directions • Geometric problems: • High-dimensional geometric problems • Sliding-window with flexible size • Graph problems: • Dynamic graph problems J. Zhang - Ph.D. Dissertation Defense

More Related