1 / 28

Making Every Bit Count in Wide Area Analytics

Making Every Bit Count in Wide Area Analytics. Ariel Rabkin Joint work with: Matvey Arye , Siddhartha Sen , Michael J. Freedman, and Vivek Pai. Global Systems Have Global Data. The Rise of Big Distributed Data. CDNs: Akamai has ~20 m illion requests per second

nibal
Download Presentation

Making Every Bit Count in Wide Area Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: MatveyArye, Siddhartha Sen, Michael J. Freedman, and VivekPai

  2. Global Systems Have Global Data

  3. The Rise of Big Distributed Data • CDNs: • Akamai has ~20 million requests per second • CloudFlare has about 300 MB/s of logs, volume doubles every 4 months • Sensor data (e.g., power grid, highways) • Smart camera networks

  4. Trends Data Volumes Wide-area Bandwidth Amount per dollar Time

  5. Analyzing Low-rate Events is Easy Server Crashed! Alert me when server crashes!

  6. High-rate Events can be Costly Requests Requests Requests Requests Requests Requests Requests Requests Every minute, computerequest counts by URL

  7. Backhaul has Bad Dynamics Example: backhaul count of events every 5 minutes Choice of summaries is made upfront statically • Buyer’s remorse: Chose to collect unnecessary and expensive data • Analyst’s remorse: Summaries insufficient for analysis. No way to retroactively get more data

  8. Local Storage! Local Aggregation and Storage Requests Requests Requests Requests Requests Requests Requests Requests Every minute, computerequest counts by URL Local Aggregation and Storage

  9. Challenge: Bandwidth Scarcity I want the request count for every URL every second I can’t do that, Ari. That costs 100 MB/sec. You only have 12 MB/sec. Want to impose a rank cutoff, value cutoff, or change frequency? I can do that for 900 KB/sec. Can I get the top 1000 URLs every second? Great, do it!

  10. Challenge: Varying Scarcity Available ? ? ? ? ? ? ? Needed Firstaggregate over longer time periods, up to 30 seconds. Then only keep the top URLs. Bandwidth Can do Time

  11. Data Processing Requirements • Aggregatable StoredData += Update • Merge-able Data Data Merged Representation + = • Reducible Data Data

  12. Raw byte strings e.g. MapReduce Database tables

  13. The Data Cube Model Cube: A multidimensional array, with one or more aggregates, indexed by a set of dimensions • Aggregation function used for: • Updates • Roll-ups • Merging cubes • Degrading cubes Roll-up of mysite.com by time from 12:00 to 12:01: 8 Roll-up of sites at time 12:00: 16

  14. Raw byte strings e.g. MapReduce Database tables Data Cube

  15. A Vision for Wide-Area Analytics Merged Cube Dataflow Operators Dataflow Operators Dataflow Operators Local Cube Local Cube Dataflow Operators Dataflow Operators Dataflow Operators Network bottleneck Dataflow adapted to bandwidth

  16. Adaptivity Local Cube Network bottleneck Dataflow Operators Dataflow Operators

  17. Adaptivity Local Cube Network bottleneck Summarized Cube Dataflow Operators Dataflow Operators Feedback control • Key ingredients: • Cube summarization as mechanism • User-defined policies • Feedback control

  18. Backup Slides

  19. Conclusions • The hard problems in wide-area analysis: • Reasoning about bandwidth/data quality tradeoffs • Optimizing data quality under changing conditions. • Jointly optimizing bandwidth and other resources • We are building a system. • We call it JetStream. Stay tuned….

  20. Bandwidth Costs do not Decline Smoothly [TeleGeography's Global Bandwidth Research Service]

  21. 2012 Bandwidth Price Shifts Frankfurt- London 20% 20% [TeleGeography's Global Bandwidth Research Service]

  22. Diurnal Load Makes Overprovisioning Expensive • Leased lines waste capacity during off-peak • Public internet gets congested during peak

  23. Benefit: Iteration Can iteratively pose different queries Local Aggregation and Storage Requests Requests Requests Requests Requests Requests Requests Requests A revised query Local Aggregation and Storage

  24. Benefit: adaptation Can adapt data volume collected to available bw Local Aggregation and Storage Requests Requests Requests Requests Requests Requests Requests Requests Limited Bandwidth Local Aggregation and Storage

  25. Benefit: adaptation Can adapt data volume collected to available bw Local Aggregation and Storage Requests Requests Requests Requests Requests Requests Requests Requests Ample Bandwidth Local Aggregation and Storage

  26. A dataflow model for wide-area analytics Defines data transformation on tuples. Can do input or output. Operator Cube Structured storage of data

  27. Generated data Ingested Into Local cubes Source Cube Source Cube Processing Processing Processed Data Network bottleneck

  28. Processing Processed Data

More Related