Stream data management system prototypes
Download
1 / 48

stream data management system prototypes - PowerPoint PPT Presentation


  • 539 Views
  • Updated On :

Stream Data Management System Prototypes. Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004. Outline. Motivation of DSMS Aurora (Brown, Brandeis, MIT) Model Operator Scheduling Storage/Memory Management QoS issue STREAM (Stanford) System Architecture

Related searches for stream data management system prototypes

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'stream data management system prototypes' - Patman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Stream data management system prototypes l.jpg

Stream Data Management System Prototypes

Ying Sheng, Richard Sia

June 1, 2004

Professor Carlo Zaniolo

CS 240B

Spring 2004


Outline l.jpg
Outline

  • Motivation of DSMS

  • Aurora (Brown, Brandeis, MIT)

    • Model

    • Operator Scheduling

    • Storage/Memory Management

    • QoS issue

  • STREAM (Stanford)

    • System Architecture

    • Query Language

    • Query Plans and Execution

    • Performance Issues

    • Approximation Techniques

    • STREAM Interface

  • Conclusion


Motivation l.jpg
Motivation

  • HADP  DAHP

  • Continuous data and static queries

    • Monitoring using sensor

      • Military

      • Traffic

      • Environment

    • Financial analysis

    • Object tracking



Aurora model l.jpg
Aurora – Model

  • General Purpose DSMS

  • Continuous stream data comes

  • Flow through a set of operators

  • Output to application or materialized


Aurora model6 l.jpg
Aurora – Model

  • Components

    • Storage manager

    • Scheduler

    • Load Shedder

    • Router

    • QoS Monitor

    • GUI


Aurora model7 l.jpg
Aurora – Model

  • 3 kinds of query supported

    • Continuous

    • View

    • Ad-Hoc Query


Aurora model8 l.jpg
Aurora – Model

  • 8 primitive operators (Box)

    • Windowed

      • Slide

      • Tumble

      • Latch

      • Resample

    • Non-windowed

      • Filter

      • Map

      • GroupBy

      • Join


Aurora operator optimization l.jpg
Aurora – Operator Optimization

  • Each operator associated with

    • Selectivity: s(b), sel(b)

    • Computation time: c(b), cost(b)

  • General Optimization Techniques

    • Pushing projection upstream

    • Combining boxes

    • Reordering boxes


Aurora operator optimization10 l.jpg
Aurora – Operator Optimization

  • Case 1 : cost of ab

    • c(a) + s(a)c(b)

  • Case 2: cost of ba

    • c(b) + s(b)c(a)

  • Criteria for switching box position

    • c(a)+s(a)c(b) > c(b)+s(b)c(a)

a

b

b

a


Aurora operator scheduling l.jpg
Aurora – Operator Scheduling

  • Scheduling by OS

    • One thread per box, shift the job to OS

    • Easier to program

  • Aurora Scheduler

    • Single thread for the scheduler

    • The scheduler pick a box with highest priority and call the box to consume tuples from queue

    • Allow finer control of resource

  • Scalable !



Aurora operator scheduling13 l.jpg
Aurora – Operator Scheduling

  • Problem: which box to execute next?

  • Min-Cost (MC)

    • Reduce computation cost

  • Min-Latency (ML)

    • Return result as soon as possible

  • Min-Memory (MM)

    • Reduce memory usage of queue


Aurora operator scheduling14 l.jpg
Aurora – Operator Scheduling

  • Example

b4

b2

streams

application

b5

b3

b1

b6

Downstream


Aurora operator scheduling15 l.jpg
Aurora – Operator Scheduling

  • Min-Cost

    • Objective: avoid overhead of calling boxes

  • Min-Latency

    • Prefer box which can produce tuples in the output at a shorter period of time

  • Min-Memory

    • Give preference to box which will consume more tuples with less computation time

    • Similar to “Chain Operator Scheduling”

  • More at:Operator Scheduling in a Data Stream Manager, VLDB 2003


Aurora storage memory management l.jpg
Aurora – Storage/Memory Management

  • Manage the queue in front of each box

    • 2 boxes sharing the same queue

    • windowed operator

  • The initial queue size is 128 KB

  • Queues are managed as a circular queue

    • If overflow, double the queue size, or vice versa


Aurora storage memory management17 l.jpg
Aurora – Storage/Memory Management

  • Swap in/out between memory / disk based on priority of boxes using it

  • Work with Operator Scheduler to exchange box priority and buffer-state information

  • Connection Point Management

    • A B-tree indexed on timestamp is built to support random access of tuples by ad-hoc query



Aurora qos issue l.jpg
Aurora – QoS Issue

  • Different queries/applications have different QoS requirement

    • Stock market monitoring

    • Average temperature of a set of sensor

  • QoS Graph


Latency based qos graph l.jpg
Latency-based QoS Graph

Critical Point

QoS

cost(D(b))

est(b)

0

time

eol(b)

latency(b)

b

D(b)


Aurora qos driven scheduling l.jpg
Aurora – QoS-driven Scheduling

  • Assign priority to each box based on

    • priority (b) = [utility (b), est (b)]

    • utility (b) = gradient (eol (b))

      • How is the QoS degrading by the time the tuple leave the system when we process it now.

    • est (b)

      • How soon it will exhibit another performance degradation if we don’t process it now.

  • Performance

    • 200 queries/application, each with 5 boxes

    • Round robin - 0.43

    • QoS driven scheduling – 0.85


Aurora current status l.jpg
Aurora – Current Status

  • Main components of a DSMS are introduced

    • Operator scheduler

    • Memory/storage management

    • QoS concept in stress environment

    • Load shedding

  • Implemented in C++, with Java-based GUI

    • Dependent on a few software/library

  • More?

    • Distributed architecture – Aurora*

    • Fault tolerance or disaster recovery ?



Stream introduction l.jpg
STREAM – Introduction

  • General-purpose prototype DSMS

  • Supports data streams and stored relations

  • Declarative language for registering continuous queries

  • Flexible query plans and execution strategies

  • Aggressive sharing of state and computation among queries


Stream introduction25 l.jpg
STREAM – Introduction

  • Designed to cope with

    • Stream rates that may be high, variable, bursty

    • Continuous query loads that may be high, volatile

  • Primary coping techniques

    • Graceful approximation as necessary

    • Careful resource allocation and use

    • Continuous self-monitoring and reoptimization


Stream system architecture l.jpg

Streamed

Result

Stored

Result

Register

Query

Input streams

Archive

STREAM – System Architecture

DSMS

Scratch Store

Stored

Relations


Stream query language l.jpg
STREAM – Query Language

  • Continuous Query Language – CQL

    • Extends SQL with

      • Streams as new data type

        • Stream: Unbounded bag of pairs <tuple, timestamp>

        • Relation: time-varying bags of tuples

      • Continuous instead of one-time semantics

      • Three classes of operators

        • Relation-to-relation

        • Stream-to-relation

        • Relation-to-stream


Stream cql operators l.jpg
STREAM – CQL Operators

  • Relation-to-relation

    • SQL constructs

  • Stream-to-relation

    • Tuple-based sliding window: [Rows N], [Rows Unbounded]

    • Time-based sliding window: [Range ω], [Now]

    • Partitioned sliding window: [Partition By A1,…Ak Rows N]

  • Relation-to-stream

    • Istream: insert stream

    • Dstream: delete stream

    • Rstream: relation stream


Stream example query 1 l.jpg
STREAM – Example Query 1

  • Two example streams:

    Orders (orderID, customer, cost)

    Fulfillments (orderID, clerk)

  • Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”:

    Select Sum(O.cost)

    From Orders O, Fulfillments F [Range 1 Day]

    Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”


Stream example query 2 l.jpg
STREAM – Example Query 2

  • Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost:

    Select F.clerk, Max(O.cost)

    From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample

    Where O.orderID = F.orderID

    Group By F.clerk


Stream simplified query 2 l.jpg
STREAM – Simplified Query 2

  • Result is a relation, updated as stream elements arrive:

    Select F.clerk, Max(O.cost)

    From O, F [Rows 100]

    Where O.orderID = F.orderID

    Group By F.clerk


Stream simplified query 232 l.jpg
STREAM – Simplified Query 2

  • Result is streamed: Emits <clerk, max> stream element whenever max changes for a clerk (or new clerk):

    Select Istream(F.clerk, Max(O.cost))

    From O, F [Rows 100]

    Where O.orderID = F.orderID

    Group By F.clerk


Stream example query 3 l.jpg
STREAM – Example Query 3

  • Relation: CurPrice(stock, price)

  • Average price over last day for each stock:

    Select stock, Avg(price)

    From Istream(CurPrice) [Range 1 Day]

    Group By stock

  • Istream provides history of CurPrice

  • Window on history (back to relation), group and aggregate


Stream query plans and execution l.jpg
STREAM – Query plans and Execution

  • When a continuous query is registered, generate a query plan

    • New plan merged with existing plans

    • Users can also create & manipulate plans directly

  • Plans composed of three main components:

    • Operators

      • Flag: insertion(+), deletion (-)

      • Elements: tuple-timestamp-flag tuples

      • Streams: only + elements

      • Relations: both + and - elements

    • Queues

      • Enforce nondecreasing timestamps (“heartbeats”)

      • Mechanisms for buffering tuples

    • States (Synopses)

  • Global scheduler for plan execution


Stream states l.jpg

State1

State2

STREAM – States

  • States (Synopses)

    • Summarize elements seen so far (exact or approximate) for operators requiring history

    • To implement windows

  • Example: synopsis join

    • Sliding-window join

    • Approximation of full join


Stream simple query plan l.jpg
STREAM – Simple Query Plan

Select *

From S1 [Rows 1000],

S2 [Range 2 Minutes]

Where S1.A = S2.A

And S1.A > 10


Stream performance issues l.jpg
STREAM – Performance Issues

  • Synopsis Sharing

    • Eliminate data redundancy

  • Exploiting Constraints

    • Selectively discard data to reduce state

  • Operator Scheduling

    • Reduce queue sizes


Stream synopsis sharing l.jpg
STREAM – Synopsis Sharing

  • Eliminate redundancy by

    • replacing the nearly identical synopses with light weight stubs

    • a single store to hold the actual tuples

  • Store tracks the progress of each stub, presents the appropriate view to each stub.

  • The store contains the union of its corresponding stubs


Stream synopsis sharing39 l.jpg
STREAM – Synopsis Sharing

Select *

From S1 [Rows 1000],

S2 [Range 2 Minutes]

Where S1.A = S2.A

And S1.A > 10

Select A, Max(B)

From S1 [Rows 200]

Group By A


Stream exploiting constraints l.jpg
STREAM – Exploiting Constraints

  • Specify an adherence parameter k to capture how closely a given stream or sets of streams adheres to a constraint of that type

    • Referential integrity k-constraint

    • Ordered-arrival k-constraint

    • Clustered-arrival k-constraint

  • Query execution plans reduce or eliminate sate based on k-constraints

  • If constraint violated, get approximate result


Stream operator scheduling l.jpg
STREAM – Operator Scheduling

  • Goal: minimize total queue size for unpredictable, bursty stream arrival patterns

  • Chain Scheduling Algorithm:

    • Mark the first operator in the plan as the “current” operator

    • Find the block of consecutive operators starting at the “current” operator that maximizes the reduction in total queue size per unit time.

    • Mark the first operator following this block as the “current” operator and repeat Step 2 until all operators have been assigned to chains.

    • Chains are scheduled according to the greedy algorithm, but within a chain, execution proceeds in FIFO order.

  • Proven: within constant factor of any “clairvoyant” strategy, i.e., the optimal strategy based on knowledge of future input, for some queries

  • Empirical results: large savings over naive strategies for many queries

  • But minimizing queue sizes is at odds with minimizing latency


Stream approximation l.jpg
STREAM – Approximation

  • CPU-Limited Approximation

    • Insufficient CPU time to process each stream element due to the high data arrival rate.

    • load-shedding

      • sampling operators

      • Approximate by probabilistically dropping elements before they are processed

  • Memory-Limited Approximation

    • The total state required for all registered queries exceeds available memory.

    • The system selectively shrinks or discards synopses.


Stream query interface l.jpg
STREAM – Query Interface

  • View the structure of query plans the their component entities.

  • View the detailed properties of each entity.

  • Dynamically adjust entity properties.

  • View monitoring graphs that display time-varying entity properties plotted dynamically against time.

    • Queue sizes, throughput, overall memory usage, and join selectivity.



Stream current status l.jpg
STREAM – Current Status

  • Version 1.0 up and running

  • Includes a new monitoring and adaptive query processing infrastructure – StreaMon

    • Executor runs query plans to produce results.

    • Profiler collects and maintains statistics about stream and plan characteristics.

    • Reoptimizer ensures that the plans and memory structures are the most efficient for current characteristics.

  • Web demo available at http://shark.stanford.edu:8080/

  • Future Directions:

    • Distributed Stream Processing

    • Crash Recovery

    • Improved Approximation

    • Classification of Applications


Conclusion l.jpg
Conclusion

  • Ideal DSMS

    • Well defined and flexible query language

    • User-friendly interface

    • Scalable

      • Operator scheduling

      • Storage management

      • Synopsis sharing

      • Approximation

    • Quality assurance

    • Fault tolerant


References l.jpg
References

  • R. Motwani et al., “Query Processing, Approximation, and Resource Management in a Data Stream Management System”, in proceedings of the 1st CIDR Conference, 2003.

  • S. Madden et al., “Continuously Adaptive Continuous Queries over Streams”, in proceedings of SIGMOD Conference, 2002

  • D. Carney et al., “Monitoring Streams - A New Class of Data Management Applications”, in Proceedings of VLDB conference, 2002.

  • D. Carney et al., “Operator Scheduling in a Data Stream Manager”, in Proceedings of VLDB conference, 2003

  • Stanford STREAM Project Website: http://www-db.stanford.edu/stream/index.html

  • Aurora Project Website: http://www.cs.brown.edu/research/aurora



ad