stream data management system prototypes l.
Skip this Video
Loading SlideShow in 5 Seconds..
Stream Data Management System Prototypes PowerPoint Presentation
Download Presentation
Stream Data Management System Prototypes

Loading in 2 Seconds...

play fullscreen
1 / 48

Stream Data Management System Prototypes - PowerPoint PPT Presentation

  • Uploaded on

Stream Data Management System Prototypes. Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004. Outline. Motivation of DSMS Aurora (Brown, Brandeis, MIT) Model Operator Scheduling Storage/Memory Management QoS issue STREAM (Stanford) System Architecture

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Stream Data Management System Prototypes' - Patman

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
stream data management system prototypes

Stream Data Management System Prototypes

Ying Sheng, Richard Sia

June 1, 2004

Professor Carlo Zaniolo

CS 240B

Spring 2004

  • Motivation of DSMS
  • Aurora (Brown, Brandeis, MIT)
    • Model
    • Operator Scheduling
    • Storage/Memory Management
    • QoS issue
  • STREAM (Stanford)
    • System Architecture
    • Query Language
    • Query Plans and Execution
    • Performance Issues
    • Approximation Techniques
    • STREAM Interface
  • Conclusion
  • Continuous data and static queries
    • Monitoring using sensor
      • Military
      • Traffic
      • Environment
    • Financial analysis
    • Object tracking
aurora model
Aurora – Model
  • General Purpose DSMS
  • Continuous stream data comes
  • Flow through a set of operators
  • Output to application or materialized
aurora model6
Aurora – Model
  • Components
    • Storage manager
    • Scheduler
    • Load Shedder
    • Router
    • QoS Monitor
    • GUI
aurora model7
Aurora – Model
  • 3 kinds of query supported
    • Continuous
    • View
    • Ad-Hoc Query
aurora model8
Aurora – Model
  • 8 primitive operators (Box)
    • Windowed
      • Slide
      • Tumble
      • Latch
      • Resample
    • Non-windowed
      • Filter
      • Map
      • GroupBy
      • Join
aurora operator optimization
Aurora – Operator Optimization
  • Each operator associated with
    • Selectivity: s(b), sel(b)
    • Computation time: c(b), cost(b)
  • General Optimization Techniques
    • Pushing projection upstream
    • Combining boxes
    • Reordering boxes
aurora operator optimization10
Aurora – Operator Optimization
  • Case 1 : cost of ab
    • c(a) + s(a)c(b)
  • Case 2: cost of ba
    • c(b) + s(b)c(a)
  • Criteria for switching box position
    • c(a)+s(a)c(b) > c(b)+s(b)c(a)





aurora operator scheduling
Aurora – Operator Scheduling
  • Scheduling by OS
    • One thread per box, shift the job to OS
    • Easier to program
  • Aurora Scheduler
    • Single thread for the scheduler
    • The scheduler pick a box with highest priority and call the box to consume tuples from queue
    • Allow finer control of resource
  • Scalable !
aurora operator scheduling13
Aurora – Operator Scheduling
  • Problem: which box to execute next?
  • Min-Cost (MC)
    • Reduce computation cost
  • Min-Latency (ML)
    • Return result as soon as possible
  • Min-Memory (MM)
    • Reduce memory usage of queue
aurora operator scheduling14
Aurora – Operator Scheduling
  • Example










aurora operator scheduling15
Aurora – Operator Scheduling
  • Min-Cost
    • Objective: avoid overhead of calling boxes
  • Min-Latency
    • Prefer box which can produce tuples in the output at a shorter period of time
  • Min-Memory
    • Give preference to box which will consume more tuples with less computation time
    • Similar to “Chain Operator Scheduling”
  • More at:Operator Scheduling in a Data Stream Manager, VLDB 2003
aurora storage memory management
Aurora – Storage/Memory Management
  • Manage the queue in front of each box
    • 2 boxes sharing the same queue
    • windowed operator
  • The initial queue size is 128 KB
  • Queues are managed as a circular queue
    • If overflow, double the queue size, or vice versa
aurora storage memory management17
Aurora – Storage/Memory Management
  • Swap in/out between memory / disk based on priority of boxes using it
  • Work with Operator Scheduler to exchange box priority and buffer-state information
  • Connection Point Management
    • A B-tree indexed on timestamp is built to support random access of tuples by ad-hoc query
aurora qos issue
Aurora – QoS Issue
  • Different queries/applications have different QoS requirement
    • Stock market monitoring
    • Average temperature of a set of sensor
  • QoS Graph
latency based qos graph
Latency-based QoS Graph

Critical Point










aurora qos driven scheduling
Aurora – QoS-driven Scheduling
  • Assign priority to each box based on
    • priority (b) = [utility (b), est (b)]
    • utility (b) = gradient (eol (b))
      • How is the QoS degrading by the time the tuple leave the system when we process it now.
    • est (b)
      • How soon it will exhibit another performance degradation if we don’t process it now.
  • Performance
    • 200 queries/application, each with 5 boxes
    • Round robin - 0.43
    • QoS driven scheduling – 0.85
aurora current status
Aurora – Current Status
  • Main components of a DSMS are introduced
    • Operator scheduler
    • Memory/storage management
    • QoS concept in stress environment
    • Load shedding
  • Implemented in C++, with Java-based GUI
    • Dependent on a few software/library
  • More?
    • Distributed architecture – Aurora*
    • Fault tolerance or disaster recovery ?
stream introduction
STREAM – Introduction
  • General-purpose prototype DSMS
  • Supports data streams and stored relations
  • Declarative language for registering continuous queries
  • Flexible query plans and execution strategies
  • Aggressive sharing of state and computation among queries
stream introduction25
STREAM – Introduction
  • Designed to cope with
    • Stream rates that may be high, variable, bursty
    • Continuous query loads that may be high, volatile
  • Primary coping techniques
    • Graceful approximation as necessary
    • Careful resource allocation and use
    • Continuous self-monitoring and reoptimization
stream system architecture







Input streams


STREAM – System Architecture


Scratch Store



stream query language
STREAM – Query Language
  • Continuous Query Language – CQL
    • Extends SQL with
      • Streams as new data type
        • Stream: Unbounded bag of pairs <tuple, timestamp>
        • Relation: time-varying bags of tuples
      • Continuous instead of one-time semantics
      • Three classes of operators
        • Relation-to-relation
        • Stream-to-relation
        • Relation-to-stream
stream cql operators
STREAM – CQL Operators
  • Relation-to-relation
    • SQL constructs
  • Stream-to-relation
    • Tuple-based sliding window: [Rows N], [Rows Unbounded]
    • Time-based sliding window: [Range ω], [Now]
    • Partitioned sliding window: [Partition By A1,…Ak Rows N]
  • Relation-to-stream
    • Istream: insert stream
    • Dstream: delete stream
    • Rstream: relation stream
stream example query 1
STREAM – Example Query 1
  • Two example streams:

Orders (orderID, customer, cost)

Fulfillments (orderID, clerk)

  • Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”:

Select Sum(O.cost)

From Orders O, Fulfillments F [Range 1 Day]

Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

stream example query 2
STREAM – Example Query 2
  • Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost:

Select F.clerk, Max(O.cost)

From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample

Where O.orderID = F.orderID

Group By F.clerk

stream simplified query 2
STREAM – Simplified Query 2
  • Result is a relation, updated as stream elements arrive:

Select F.clerk, Max(O.cost)

From O, F [Rows 100]

Where O.orderID = F.orderID

Group By F.clerk

stream simplified query 232
STREAM – Simplified Query 2
  • Result is streamed: Emits <clerk, max> stream element whenever max changes for a clerk (or new clerk):

Select Istream(F.clerk, Max(O.cost))

From O, F [Rows 100]

Where O.orderID = F.orderID

Group By F.clerk

stream example query 3
STREAM – Example Query 3
  • Relation: CurPrice(stock, price)
  • Average price over last day for each stock:

Select stock, Avg(price)

From Istream(CurPrice) [Range 1 Day]

Group By stock

  • Istream provides history of CurPrice
  • Window on history (back to relation), group and aggregate
stream query plans and execution
STREAM – Query plans and Execution
  • When a continuous query is registered, generate a query plan
    • New plan merged with existing plans
    • Users can also create & manipulate plans directly
  • Plans composed of three main components:
    • Operators
      • Flag: insertion(+), deletion (-)
      • Elements: tuple-timestamp-flag tuples
      • Streams: only + elements
      • Relations: both + and - elements
    • Queues
      • Enforce nondecreasing timestamps (“heartbeats”)
      • Mechanisms for buffering tuples
    • States (Synopses)
  • Global scheduler for plan execution
stream states



STREAM – States
  • States (Synopses)
    • Summarize elements seen so far (exact or approximate) for operators requiring history
    • To implement windows
  • Example: synopsis join
    • Sliding-window join
    • Approximation of full join
stream simple query plan
STREAM – Simple Query Plan

Select *

From S1 [Rows 1000],

S2 [Range 2 Minutes]

Where S1.A = S2.A

And S1.A > 10

stream performance issues
STREAM – Performance Issues
  • Synopsis Sharing
    • Eliminate data redundancy
  • Exploiting Constraints
    • Selectively discard data to reduce state
  • Operator Scheduling
    • Reduce queue sizes
stream synopsis sharing
STREAM – Synopsis Sharing
  • Eliminate redundancy by
    • replacing the nearly identical synopses with light weight stubs
    • a single store to hold the actual tuples
  • Store tracks the progress of each stub, presents the appropriate view to each stub.
  • The store contains the union of its corresponding stubs
stream synopsis sharing39
STREAM – Synopsis Sharing

Select *

From S1 [Rows 1000],

S2 [Range 2 Minutes]

Where S1.A = S2.A

And S1.A > 10

Select A, Max(B)

From S1 [Rows 200]

Group By A

stream exploiting constraints
STREAM – Exploiting Constraints
  • Specify an adherence parameter k to capture how closely a given stream or sets of streams adheres to a constraint of that type
    • Referential integrity k-constraint
    • Ordered-arrival k-constraint
    • Clustered-arrival k-constraint
  • Query execution plans reduce or eliminate sate based on k-constraints
  • If constraint violated, get approximate result
stream operator scheduling
STREAM – Operator Scheduling
  • Goal: minimize total queue size for unpredictable, bursty stream arrival patterns
  • Chain Scheduling Algorithm:
    • Mark the first operator in the plan as the “current” operator
    • Find the block of consecutive operators starting at the “current” operator that maximizes the reduction in total queue size per unit time.
    • Mark the first operator following this block as the “current” operator and repeat Step 2 until all operators have been assigned to chains.
    • Chains are scheduled according to the greedy algorithm, but within a chain, execution proceeds in FIFO order.
  • Proven: within constant factor of any “clairvoyant” strategy, i.e., the optimal strategy based on knowledge of future input, for some queries
  • Empirical results: large savings over naive strategies for many queries
  • But minimizing queue sizes is at odds with minimizing latency
stream approximation
STREAM – Approximation
  • CPU-Limited Approximation
    • Insufficient CPU time to process each stream element due to the high data arrival rate.
    • load-shedding
      • sampling operators
      • Approximate by probabilistically dropping elements before they are processed
  • Memory-Limited Approximation
    • The total state required for all registered queries exceeds available memory.
    • The system selectively shrinks or discards synopses.
stream query interface
STREAM – Query Interface
  • View the structure of query plans the their component entities.
  • View the detailed properties of each entity.
  • Dynamically adjust entity properties.
  • View monitoring graphs that display time-varying entity properties plotted dynamically against time.
    • Queue sizes, throughput, overall memory usage, and join selectivity.
stream current status
STREAM – Current Status
  • Version 1.0 up and running
  • Includes a new monitoring and adaptive query processing infrastructure – StreaMon
    • Executor runs query plans to produce results.
    • Profiler collects and maintains statistics about stream and plan characteristics.
    • Reoptimizer ensures that the plans and memory structures are the most efficient for current characteristics.
  • Web demo available at
  • Future Directions:
    • Distributed Stream Processing
    • Crash Recovery
    • Improved Approximation
    • Classification of Applications
  • Ideal DSMS
    • Well defined and flexible query language
    • User-friendly interface
    • Scalable
      • Operator scheduling
      • Storage management
      • Synopsis sharing
      • Approximation
    • Quality assurance
    • Fault tolerant
  • R. Motwani et al., “Query Processing, Approximation, and Resource Management in a Data Stream Management System”, in proceedings of the 1st CIDR Conference, 2003.
  • S. Madden et al., “Continuously Adaptive Continuous Queries over Streams”, in proceedings of SIGMOD Conference, 2002
  • D. Carney et al., “Monitoring Streams - A New Class of Data Management Applications”, in Proceedings of VLDB conference, 2002.
  • D. Carney et al., “Operator Scheduling in a Data Stream Manager”, in Proceedings of VLDB conference, 2003
  • Stanford STREAM Project Website:
  • Aurora Project Website: