1 / 35

CS 410/510 Data Streams Lecture 12: Stream Warehousing with DataDepot

CS 410/510 Data Streams Lecture 12: Stream Warehousing with DataDepot. Kristin Tufte, David Maier Thanks to: Theodore Johnson; Lukasz Golab; Spence Seidel; Vladislav Shkapenyuk AT&T Labs - Research. DSMS vs. Stream Warehouse. DSMS Temporal data (append only) Best effort consistency

candy
Download Presentation

CS 410/510 Data Streams Lecture 12: Stream Warehousing with DataDepot

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 410/510 Data StreamsLecture 12: Stream Warehousing with DataDepot Kristin Tufte, David Maier Thanks to: Theodore Johnson; Lukasz Golab; Spence Seidel; Vladislav Shkapenyuk AT&T Labs - Research Data Streams: Lecture 12

  2. DSMS vs. Stream Warehouse • DSMS • Temporal data (append only) • Best effort consistency • Real-time response • Stream Warehouse • 10s to 100s of TB of historical data • Time windows measured in years or decades • Real-time queries on recent data • Deep analyses on historical data Data Streams: Lecture 12

  3. Stream Warehousing • Real-time warehouse sourced from data streams. • All data has a timestamp (perhaps more) • Real-time updates, multi-year window. • Deeply nested materialized views. • Propagate incremental updates. • DSMS in slow motion dj d1 s1 b1 sn bn di dk Derived tables Raw tables Data Streams: Lecture 12

  4. DataDepot – Key Features • ETL – Raw tables are non-materialized views • Timestamp-based horizontal partitioning • Multiple timestamps -> timestamp correlation • Partition dependencies for updates • Real-time scheduling (table priorities) • Consistency issues (leading/trailing edge) • Warehouse dashboard & Data quality tools Data Streams: Lecture 12

  5. DataDepot Architecture Data Streams: Lecture 12

  6. Partitioning • Temporal partitioning for temporal tables. • Roll in at the leading edge, roll off at the trailing edge • Set partition size to be update increment • Avoid expensive incremental updates, index rebuilding • Correlate non-partitioning temporal fields with the partitioning field window Roll out update StartTime=55 EndTime=55 Data Streams: Lecture 12

  7. Raw Tables • Non-materialized view of source data files • Except for indirect/loaded-to tables • File finding • Where are files? • File classification • Filename metadata • Data extraction • gzcat • Schema (create table) Data Streams: Lecture 12

  8. Direct Raw Tables • Non-materialized view • Temporal partitioning attribute • Timestamps in file name • Partition directory • Maps partitions to data files • New data/files -> update to partition directory • Records extracted when updating derived table Data Streams: Lecture 12

  9. Indirect Raw Tables • Timestamps in data in file • Loaded-to table • Extract data only once • Efficient append-only update • Similar partition table structure to direct raw tables Data Streams: Lecture 12

  10. Derived Tables • Materialized views (EMF-SQL query) • Query; indices; partitioning function; priority • Derived table partitions recomputed (not updated) • Partitions should rarely be recomputed • Precisely identify partitions to be recomputed • Specify mapping between source and destination partitions (SLB, SUB) Data Streams: Lecture 12

  11. Source Dependencies Data Streams: Lecture 12

  12. Update Propagation • Basic algorithm: recompute partitions • In general, most of the partition is affected. • Non-SQL views : outputs of analyses • Determine the source partitions of a derived partition • Recompute if a source changes • Eventual Consistency update D S • Partition on timestamp (temporal partitioning attribute) • Partitions marked with last-update timestamp Data Streams: Lecture 12

  13. Issues & Featues • Data loading architecture (files & queries) • Data generated externally • No user updates or inserts • Provenance is documented (queries) • Declarative specification enables optimization • Late data • Most data arrive in order (direct raw tables) • Partitions computed only once • Late data motivates indirect raw tables Data Streams: Lecture 12

  14. Real-time Warehousing • Provide real-time data as well as historical perspective. • Important warehouse features • Multi-Version Concurrency Control (MVCC) • Multiple-granularity partitioning • Real-time aware resource control • Golab et al., Scheduling Updates in a Real-Time Stream Warehouse. ICDE 2009 Data Streams: Lecture 12

  15. MVCC • Continuous updates: don’t block queries • Or other updates • Simple implementation • Single-updater, multiple reader • Swap in a new partition directory 1-day summaries old new Real-time table Garbage collected later Data Streams: Lecture 12

  16. Temporal Correlation • Typical to have multiple timestamps • Record represents interval -> start/end • Timestamps assigned by several entities • Multiple timestamps often highly correlated • Can specify temporal correlations in derived tables • Query rewrites • Convert range predicate on correlated temporal attributes to range predicate on partitioning attribute Data Streams: Lecture 12

  17. Multi-granularity Partitioning • Small partitions for new data, large partitions for old data • Well-known in the folklore • Two or more levels of granularity • Maintain as separate tables • Present a coherent combined view • Separate storage, indices • Query execution is granularity-cognizant Combined table Rollup Data Streams: Lecture 12

  18. Real-time Resource control • How should we schedule Derived table updates? • Lazy updates, recompute at query time X • May require huge computations across multiple tables • Not compatible with a real-time warehouse • Eager (immediate) updates X • Dozens of data feeds, hundreds of tables • Avoid over-subscription of server resources • Memory, disk-arm thrashing, context-switch thrashing • Ensure that resources are available for queries • Complicating factors • Parallel scheduling, overload, priorities, table hierarchies, non-preemptive scheduling. Data Streams: Lecture 12

  19. Table Staleness Si(t) 6 5 4 3 2 1 t 1 2 3 4 5 6 7 8 9 10 11 Update arrives (data up to t=3) Update loaded Update arrives (data between t=3 and t=7) Update arrives (data between t=7 and t=9) Update loaded Data Streams: Lecture 12

  20. Real-time Aware Update Manager • Abstract processing resources as tracks • CPU, memory, disk bandwidth, etc. • Partition m jobs among n tracks • Don’t let long-running task block short-period tasks • Schedule each track set independently • Prioritized Earliest Deadline First • Update chopping • Break up updates of long-deferred tables • Overload, feed problems, etc. J1 J8 J9 J10 J2 J3 J4 J5 J6 J7 Track 1 Track 2 Track 3 Short, frequent jobs Long, infrequent jobs Data Streams: Lecture 12

  21. Consistency Leading Edge • DataDepot provides eventual consistency • All data derived from raw sources • If all updates get propagated, will converge • What happens before eventually? • Leading edge consistency : use all the data that is available • Trailing edge consistency : use only stable data • But when is the data stable? Trailing Edge S1 D S2 Data Streams: Lecture 12

  22. Streaming Data Quality • Dozens of data feeds collected world-wide • How do we know which data are “correct” and which are “dirty”? • Integrity constraints define correct data • Constraint violations = dirty data • Exploratory data mining via integrity constraints • “Try out” a constraint on the data and see which records satisfy or fail it • Data Auditor system for analyzing the quality of a DataDepot streaming warehouse Data Streams: Lecture 12

  23. Data Quality Analysis 00:00 00:05 00:10 00:15 00:20 • Want to find patterns in tuples that satisfy or fail a constraint (rather than listing all violations) • Example: • table ROUTER_CPU_COUNTS stores the number of CPU utilization polls, call it num_polls, returned by each router in each 5-minute time interval • Expected behaviour: num_polls = 1 for each row of ROUTER_CPU_COUNTS Data Streams: Lecture 12

  24. Data Quality Analysis ROUTER_CPU_COUNTS Constraint: Forall t in ROUTER_CPU_COUNTS, num_polls > 0 Summary of violating subsets with 25% or under polling rate (‘-’ denotes a wildcard pattern) Data Streams: Lecture 12

  25. Other Types of Constraints • Sequential dependencies • Golab et al., Sequential dependencies, VLDB 2009 • “Time gaps” between successive CPU utilization polls should be between G1 and G2 Data Streams: Lecture 12

  26. DataDepot Applications • Darkstar • Collection of feeds related to the core network • Scripts over feeds → queries over tables • Time to get answers: weeks → minutes • Internet measurement experiments become repeatable • Other applications: production • Performance management operation support • Warehouse backend for GS Tool installations • 5+ projects • Highest volume: 2GB/minute and growing Data Streams: Lecture 12

  27. Data Streams: Lecture 12

  28. Data Streams: Lecture 12

  29. Data Streams: Lecture 12

  30. Applications of Darkstar • The point of the warehouse is to support critical analysis and operations tasks • PathMiner • Analyze performance problems on the path between two endpoints • NICE • Network-wide Information, Correlation & Exploration • Mine for root causes of network problems Data Streams: Lecture 12

  31. Customerlocation B Events AT&T Customerlocation A PathMiner: Customer Troubleshooting Tool Kobus Van der Merwe, Seungjoon Lee, Carsten Lund • Challenge: Reduce Tier 3 workload investigating customer-impacting events • PathMiner : automatically identify network events between customer locations (temporal and spatial correlation) • Rapid analysis using wide range of data sources • Accurate (compared with human “guesstimates”) Collects interestingevents along pathbetween customerlocations Data Streams: Lecture 12

  32. Interface down Bundle down PathMiner: SYSLOG Events along path Data Streams: Lecture 12

  33. Network-wide Information Correlation & Exploration Mahimkar, et al.,Troubleshooting Chronic IP Conditions in a Large IP Network, CoNEXT ’08. • Statistical correlation analysis across multiple data sets • Recurring condition manifests in many measurements • “Blind” data mining: the analyst drowns in data • NICE starts with symptom and identifies correlated events Statistically Correlated Events NICE Recurring Symptom Spatial Proximity model Unified Data Model Statistical Correlation Other Network Events Data Streams: Lecture 12

  34. Four interfaces with very high correlations with uplink overruns: all same customer Example: dlstx34c3 Uplink Overruns Input time seriesand event count CorrelationScores Correlated timeseries Event Series Joins Data Streams: Lecture 12

  35. Conclusions • Stream warehousing • Inherently temporal data • Gathered from world-wide sources • Critical applications • Temporal management via partitioning • Consistency management • Real-time updates in a multi-year warehouse • Real-time and historical analysis on the same data sets • Stream quality management Data Streams: Lecture 12

More Related