1 / 47

Computing on Jetstream: Streaming Analytics In the Wide-Area

Computing on Jetstream: Streaming Analytics In the Wide-Area. Matvey Arye Joint work with: Ari Rabkin , Sid Sen , Mike Freedman and Vivek Pai. The Rise of Global Distributed Systems. Image shows. CDN. Traditional Analytics. Centralized Database. Image shows. CDN.

solana
Download Presentation

Computing on Jetstream: Streaming Analytics In the Wide-Area

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing on Jetstream:Streaming Analytics In the Wide-Area Matvey Arye Joint work with: Ari Rabkin, Sid Sen, Mike Freedman and VivekPai

  2. The Rise of Global Distributed Systems Image shows CDN

  3. Traditional Analytics Centralized Database Image shows CDN

  4. Bandwidth is Expensive Price Trends 2005-2008 [Above the Clouds, Armbrust et. al.]

  5. Bandwidth Trends 20% 20% [TeleGeography's Global Bandwidth Research Service]

  6. Bandwidth Costs • Amazon EC2 bandwidth: $0.05 per GB • Wireless broadband: $2 per GB • Cell phone broadband (ATT/Verizon): $6 per GB • (Other providers are similar) • Satellite Bandwidth $200 - $460 per GB • May drop to ~$20

  7. This Approach is Not Scalable Centralized Database Image shows CDN

  8. The Coming Future: Dispersed Data Dispersed Databases Dispersed Databases Dispersed Databases Dispersed Databases

  9. Wide-Area Computer Systems Military Global Network Drones UAVs Surveillance • Web Services • CDNs • Ad Services • IaaS • Social Media • Infrastructure • Energy Grid

  10. Need Queries on a global view • CDNs: • Popularity of websites globally • Tracking security threats • Military • Threat “chatter” correlation • Big picture view of battlefield • Energy Grid • Wide-area view of energy production and expenditure

  11. Standing Computation Processing To User Processing Source Cube Union Cube Source Cube Processed Data Processing Processed Data Processed Data Network bottleneck

  12. Some queries are easy Server Crashed Alert me when servers crash

  13. Others are hard Requests Requests Requests Requests Requests Requests CDN Requests CDN Requests How popular are all of my domains? Urls?

  14. Before JetStream Needed for backhaul 95% Level Bandwidth Analyst’s remorse: not enough data wasted bandwidth Buyers’s remorse: system overloador overprovisioning Time [two days]

  15. What Happens During Overload? Available Needed for backhaul ? ? ? ? ? ? Bandwidth Latency Queue size grows without bound! Time Time [one day]

  16. The JetStream Vision Needed for backhaul Available Used by JetStream Bandwidth JetStreamlets programs adapt to shortages and backfill later. Need new abstractions for programmers Time [two days]

  17. System Architecture … Query graph Optimized query … … … JetStream API Planner Library Coordinator Daemon Control plane worker node Data plane compute resources (several sites) stream source

  18. An Example Query Local Storage File Read Operator Parse Log File Query Every 10 s Site A Central Storage Site C Site B Local Storage File Read Operator Parse Log File Query Every 10 s

  19. Adaptive Degradation Local Data Network Summarized or Approximated Data Dataflow Operators Dataflow Operators Feedback control • Feedback control to decide when to degrade • User-defined policies for how to degrade data

  20. Monitoring Available BandWidth Data Data Data Time Marker Data • Sources insert time markers into the data stream every k seconds • Network monitor records time it took to process interval – t • => k/t estimates available capacity

  21. Ways to Degrade Data • Can coarsen a dimension • Can drop low-rank values

  22. An Interface for Degradation (I) Network Incoming data Sampled Data Coarsening Operator Sending 4x too much First attempt: policy specified by choosing an operator. Operators read the congestion sensor and respond.

  23. Coarsening reduces data volumes

  24. But not always

  25. Depends on level of coarsening Data from CoralCDN logs

  26. Getting The Most Data Quality For The Least BW Issue Some degradation techniques result in good quality but have unpredictable savings. Solution Use multiple techniques • Start off with technique that gives best quality • Supplement with other techniques when BW scarce => Keeps latency bounded; minimize analyst’s remorse

  27. Allowing Composite Policies Network Incoming data Sampling Operator Coarsening Operator Sending 4x too much Chaos if two operators are simultaneously responding to the same sensor Operator placement constrained in ways that don’t match degradation policy.

  28. Introducing a Controller Network Incoming data Sampling Operator Coarsening Operator Controller Drop 75% of data! Sending 4x too much Introduce a controller for each network connection that determines which degradations to apply Degradation policies for each controller Policy no longer constrained by operator topology

  29. Degradation

  30. Mergeability is Nontrivial 01 - 05 06 - 10 11 - 15 16 - 20 21 - 25 26 - 30 Every 5 ?????? 01 - 06 07 - 12 13 - 18 19 - 24 25 - 30 Every 6 01 - 30 Every 30?? 01 - 10 11 - 20 Every 10 21 - 30 Can’t cleanly unify data at arbitrary degradation Degradation operators need to have fixed levels

  31. Interfacing with the Controller Network Incoming data Sampling Operator Coarsening Operator Controller Sending 4x too much Operator Controller Shrinking data by 50% Possible levels: [0%, 50%, 75%, 95%, …] Go to level 75%

  32. A Planner for Policy Query planners: Query + Data Distribution => Execution Plan Why not do this for degradation policy? What is the Query? For us the policy affects the data ingestion => Effects all subsequent Queries Planning All Potential Queries + Data Distribution => Policy

  33. Experimental Setup Princeton Policy: Drop data if insufficient BW 80 nodes on VICCI testbedin US and Germany

  34. Without Adaptation Bandwidth Shaping

  35. WITH Adaptation Bandwidth Shaping

  36. Composite policies

  37. Operating on Dispersed Data Dispersed Databases Dispersed Databases Dispersed Databases Dispersed Databases

  38. Cube Dimensions 01:01:01 Time 01:01:00 bar.com/m bar.com/n foo.com/q foo.com/r URL

  39. Cube Aggregates Count Requests Max Latency 01:01:01 bar.com/m

  40. Cube Rollup bar.com/* foo.com/* Time 01:01:00 bar.com/m bar.com/n foo.com/q foo.com/r URL

  41. Full Hierarchy (37,199) URL: * Time: 01:01:01 (8,90) (29,199) (5,90) (3,75) (8,199) (21,40) Time 01:01:00 bar.com/m bar.com/n foo.com/q foo.com/r URL

  42. Rich Structure E D … (5,90) A 01:01:59 B (3,75) bar.com/m 01:01:58 (8,199) C bar.com/n (21,40) 01:01:01 foo.com/q 01:01:00 foo.com/r

  43. Two kinds of aggregation • Rollups – Across Dimensions • Inserts – Across Sources The data cube model constrains the system to use the same aggregate function for both. Constraint: no queries on tuple arrival order Makes reasoning easier!

  44. An Example Query Local Storage File Read Operator Parse Log File Query Every 10 s Site A Central Storage Site C Site B Local Storage File Read Operator Parse Log File Query Every 10 s

  45. Subscribers • Extract data from cubes to send downstream • Control latency vs. completeness trade-off Parse Log File Parse Log File Query Every 10 s File Read Operator Local Storage File Read Operator Site A

  46. Subscriber API Subscriber is an operator++: • Notified of every tuple inserted into cube • Can slice and rollup cube Possible policies: • Wait for all upstream nodes to contribute • Wait for a timer to go off

  47. Future Work • Reliability • Individual queries • Statistical methods • Multi-round protocols • Currently working on improving top-k • Fairness that gives best data quality Thanks for listening!

More Related