Streaming Data, Continuous Queries, and Adaptive Dataflow

Streaming Data, Continuous Queries, and Adaptive Dataflow Michael Franklin UC Berkeley NRC June 2002 .

Data Stream Processing • Networked data streams central to current and future computing. • Existing data management and query processing infrastructure is lacking: • Adaptability • Continuous and Incremental Processing • Work Sharing for large scale • Resource scalability: from “smart dust” up to clusters to grids. • XML provides additional opportunites. 2

Example 1: “Transactional Flows” • E-Commerce, clickstream, swipestream, logs… • Network Monitoring • B2B and Enterprise apps • Supply-Chain, CRM, ERP • (Quasi) real-time flow of events and data • Must manage these flows to drive business processes. • Mine flows to create and adjust business rules. • Can also “tap into” flows for on-line analysis. 3

Example 2: Information Dissemination • Doc creation or crawler initiates flow of data towards users. • profiles are aggregated back towards data. Data Sources User Profiles Filtered Data Users 4

Example 3: Sensor Nets • Tiny (or not so tiny) devices measure the physical world. • Berkeley “motes”, Smart Dust, Smart Tags, … • Many monitoring applications • Transportation, Seismic, Energy, Military… • Form dynamic ad hoc networks. • Aggregate and communicate streams of values. • Not one way – can actuate to effect or actively monitor the environment 5

Common Features • Centrality of Dataflow and Data Routing • Architecture is focused on data movement • Moving streams of data through code in a network • Volatility of the environment • Dynamic resources & topology, partial failures • Long-running (never-ending?) tasks • Potential for user interaction during the flow • Large Scale: users, data, resources, … • Resource Constraints • Bandwidth, memory,processing,battery,… • Time and human attention 6

Query Result In The Beginning Index Data 7

Data Result Pub Sub/CQ/Filtering Index Queries • Effectively processes all queries simultaneously. • Shares work for common sub-expressions. 8

Result Data Telegraph/PSoup: Query & Data Duality Index Index Queries Data 9

Result Query Telegraph/PSoup: Query & Data Duality Index Index Queries Data 10

PSoup – Query Invocation • PSoup continuously maintains materialized views over streaming data andqueries. • Data is returned to user when query is invoked. • Invocation requires applying “windows” to precomputed results. • Adaptive approach allows system to continuously absorb new data and new queries without recompilation. • Lots of issues to study: • Query indexing, Spilling to disk, bulk processing • Other semantics and interaction models (e.g., alerts) 11

Stream Processing Research Agenda • Need continuously-adaptive processing. • Need appropriate data model & query lang. • Window semantics: input and output • Notification semantics & thresholds • Approximation, satisficing, and QoS • must be driven by user needs and context • adapt to available resources & time constraints • Integration & interaction with “pooled” data. • time travel, archiving, “normal” databases • Structured, semi-, and un- data; XML etc. • Sensor-sensitive processing. • Metrics and Benchmarks (challenge problems). 12

Conclusions • Dataflow and streaming are central to many emerging application areas. • Solutions require a mixture of database and networking approaches: • adaptivity and tolerance of partial failure • exploitation of user, app, and data semantics • A new infrastructure is needed for solving these problems. • Duality of Data and Queries • Currently a topic of major interest in the research community. 13

Streaming Data, Continuous Queries, and Adaptive Dataflow