html5-img
1 / 40

Towards Efficient Stream Processing in the Wide Area

Towards Efficient Stream Processing in the Wide Area. Matvey A rye Siddhartha Sen , Ariel Rabkin , Michael J. Freedman Princeton University. Our Problem Domain. Also Our Problem Domain. Use Cases. Network Monitoring Internet Service Monitoring Military Intelligence Smart Grid

quanda
Download Presentation

Towards Efficient Stream Processing in the Wide Area

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Efficient Stream Processing in the Wide Area Matvey Arye Siddhartha Sen, Ariel Rabkin, Michael J. Freedman Princeton University

  2. Our Problem Domain

  3. Also Our Problem Domain

  4. Use Cases • Network Monitoring • Internet Service Monitoring • Military Intelligence • Smart Grid • Environmental Sensing • Internet of Things

  5. The World of Analytical Processing Real-Time Historical Streaming OLAP Databases

  6. The World of Analytical Processing Real-Time Historical Simpler queries Standing queries Real-time answers Borealis/Streambase System-S, Storm Streaming OLAP Databases Single Datacenter High ingest time Fast query time Oracle, SAP, IBM

  7. Data Transfer

  8. Trends in Cost/Performance2003-2008 [Above the Clouds, Armbrust et. al.]

  9. Aggregate At Local Datacenters

  10. The World of Analytical Processing Real-Time Historical Streaming OLAP Databases Single Datacenter JetStream Wide area JetStream = Real-time + Historical + Wide Area

  11. Large Caveat • Preliminary work • We want feedback and suggestions

  12. Challenges • Query placement and scheduling • Approximation of answers • Supporting User Defined Functions (UDFs) • Queries on historical data • Adaptation to network changes • Handling node failures

  13. Motivating Example • “Top-K domains served by a CDN” • Recall CDN is globally distributed • Services many domains • Main Challenge: Minimize backhaul of data

  14. How Is the Query Specified Union Count Sort Limit

  15. Problems Single aggregation point Union Count Sort Limit Runs on a single node

  16. Aggregate at local DC Less Data DC1 Count Partial DC3 Union Count Sort Limit DC2 Count Partial

  17. Count Partials (Google,1) (Google,1) Count Partial Union Count (Google,5) (Google,1) (Google,1) (Google,1)

  18. Non-Distributed Computation DC3 DC1 Union Count Sort Limit DC2

  19. Split Count DC3 Count A-H DC1 Union Sort Limit Count I-M DC2 Count N-Z

  20. Split Union DC3 Count A-H DC1 Load Bal. Sort Limit Count I-M Load Bal. DC2 Count N-Z

  21. Do Partial Sort DC3 Count A-H Sort Partial DC1 Load Bal. Sort Count I-M Sort Partial Limit Load Bal. DC2 Count N-Z Sort Partial

  22. Push Limit Back DC3 Count A-H Sort Partial Limit DC1 Load Bal. Count I-M Sort Partial Limit Sort Limit Load Bal. DC2 Count N-Z Sort Partial Limit

  23. Distributed Version DC3 Single Host Count A-H Sort Partial Limit DC1 Load Bal. Count I-M Sort Partial Limit Sort Limit Load Bal. DC2 Count N-Z Sort Partial Limit

  24. What Is New • Previous streaming systems • User guided transformations (System-S, Storm) • Simple transforms (Aurora) • JetStream • More complex transforms • Transformation is network aware • Annotations for user defined functions

  25. Joint Problems • Transformations • Choosing which ones • Placement • Network constrained • Heterogeneous nodes • Resource availability • Decision has to be made at run-time

  26. Tackling the Joint Problems • Using heuristics • Split into increasingly more local decisions • Global decisions are coarse grained • Example: Assign operators to DCs • Localized decisions • Operate only on local part of subgraph • Have more current view of available resources • Do not affect other parts of of query graph placement

  27. Bottlenecks Still Possible Possible Bottleneck DC1 Count Partial DC3 Union Count Sort Limit DC2 Count Partial Use Approximations when necessary

  28. Adjusting Amount of ApproximationAs a reaction to network dynamism DC1 Count Partial DC3 Union Count Sort Limit DC2 Count Partial If bottleneck goes away, return to exact answers

  29. Approximation Challenges • How to quantify error for approximations? • Uniform across approximation methods • Easy to understand • Integrates well with metrics for source/node failures • How do we allow UDF approximation algorithms • Which exact operators can they replace • Quantifying the tradeoffs • Placement & Scheduling

  30. Approximation Composition DC1 Count Partial DC3 Union Count Sort Limit DC2 Count Partial Error=? Error=e If we approximate count, how does that error affect sort & final answer?

  31. Approximations in Uneven Networks DC1 High Bandwidth Link Count Partial DC3 Union Count Sort Limit DC2 Count Partial Low Bandwidth Link; Needs Approximation Do we need to approximate link DC1-DC3 if we approximate link DC2-DC3?

  32. Discovering data trends? • How has top-k changed over past hour? • Current streaming systems don’t answer this • Except by using centralized DBs. • JetStream proposes using storage at the edges

  33. Hypercube Data Structure 1 60 Minute …

  34. Hypercube Data Structure 01 1 1 24 12 31 … … … All Month Day Hour 1 60 Minute …

  35. Hypercube Data Structure 01 1 1 1 12 60 24 31 … … … … All Month Day Aggregate Hour Minute

  36. Query: “Last Hour and a half”(without materializing intermediate nodes) 1 01 31 12 … … All Month Day 2 1 … Hour 60 1 1 30 Minute … … …

  37. Query: “Last Hour and a half”by materializing intermediate nodes 1 01 31 12 … … All Month Day 2 1 … Hour 60 1 1 30 Minute … … …

  38. Historical Queries • Hypercubes have been used before • In the database literature • What’s Novel • Storage at the edges (and in the network) • Time hierarchy

  39. Challenges we talked about • Query placement and scheduling • Approximation of answers • Supporting User Defined Functions (UDFs) • Queries on historical data • Adaptation to network changes • Handling node failures

  40. Conclusion JetStream Explores… + Stream Processing + Historical Data / Trend Analysis + Wide Area Thanks! arye@cs.princeton.edu

More Related