STREAM: The Stanford Data Stream Management System. ST anfordst RE amdat AM anager http://infolab.stanford.edu/stream/ 69521001 陳盈君 69521038 吳哲維 69521040 林冠良. Outline. Introduction The Continuous Query Language Query Plan and Execution Performance Issue (for query plan)

Download Presentation

STREAM: The Stanford Data Stream Management System

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

Data Stream • Continuous, unbounded, rapid, time-varying streams of data element. • Occur in variety of modern application: • Network monitoring and traffic engineering • Sensor network, RFID tags • Telecom call records • Financial analysis • Manufacturing processes

STREAM system • Stanford Data Stream Management Simply view: First load data, then index it, the run queries • Continuous data stream • Continuous queries

CQL & SQL CQL starts with SQL Then add… • Streams as new data type • Continuous instead of one-time semantics • Windows on stream • Sampling on stream • Operators between stream and relation

CQL ：abstract semantics The abstract semantics is based on two data types • Streams • Relations Above data types are defined on discrete and orderedtime domain .

Data type : stream • A stream S is an unbounded bag of pairs s,T , where s is a tuple and T is the timestamp that denotes the logical arrival time of tuple s on stream S A stream is a collection of timestamped tuple. The element <s,T> of stream Sindicates that tuple s arrives on S at time T.

Data type : relation • A relation R is a time-varying bag of tuples. • The bag of tuples at time T is denoted R(T), and we call R(T) an instantaneous relation . For an example At time 0, R(0)=ø At time 1, R(1)={<50k>} At time 2, R(2)={<100k>,<50k>} At time 3, R(3)={<900k>,<100k>}

Three classes of operators over streams and relations • A relation-to-relation operator takes one or more relations as input and produces a relation as output. • A stream-to-relation operator takes a stream as input and produces a relation as output. • A relation-to-stream operator takes a relation as input and produces a stream as output.

Operator classes • Stream-to-stream operators ? They are absent : they are composed from operators of the above three classed. • A continuous query Q is a tree of operators belonging to above three classes. The streams and relations are input to the leaf operators, and the output of Q is the output of the root operator.

Illustration of a continuous query tree Stream output relation-to-stream Relation-to-relation stream-to-relation stream-to-relation Back to queue stream input stream input

Relation-to-relation operators in CQL • CQL uses SQL constructs to express its relation-to-relation operators. • Some relation-to-relation operators： select , project , binary-join , union , except etc…

Stream-to-relation operators in CQL • The stream-to-relation operators in CQL are based on the concept pf a sliding windows over a stream. Three sliding window types: • Tuple-based sliding window • Time-based sliding window • Partitioned sliding window

Tuple-based sliding window • A tuple-based sliding window on a stream S takes an integer N > 0 as a parameter and produces a relation R. At time , R() contains the N tuples of S with the largest timestamps<= . • It is specified by following S with “[Rows N]”. • As a special case, “[Rows unbounded]” denotes the append-only window “[Rows (infinite)]”.

Time-based sliding window • A time-based sliding window on a stream S takes a time interval as a parameter and produces a relation R. At time , R( ) contains all tuples of S with timestamp - and . • It is specified by following S with “[Range w]”. • As a special case, “[Now]” denotes the windows with =0 .

Partitioned sliding window • A partitioned sliding window on a stream S takes an integer N and a set of attributes {A1, …, Ak} of S as parameters. • It is specified by following S with “[Partition By A1,…,Ak Rows N]”.

Relation-to-stream operators in CQL • CQL has three relation-to-stream operators: • Istream (for “insert stream”) • Dstream(for “delete stream”) • Rstream(for “relation stream”)

Istream • Istream applied to a relation R contains a stream element <s, > whenever tuples s is in R()-R(-1), i.e., whenever s is inserted into R at time . • Assume R(-1)=ø for notational simplicity, we have :

Dstream • Dstream applied to a relation R contains a stream element <s,> whenever tuple s in R( -1)-R( ), i.e, whenever s is deleted from R at time . • Formally:

Istream example • If we have the relation R At time , R()={<23k>,<45k>,<200k>,<90k>} At time -1; R(-1)={<45k>,<200k>,<90k>,<10k>} Istream=R()-R(-1)={<23k>}

Dstream example • If we have the relation R At time , R()={<23k>,<45k>,<200k>,<90k>} At time -1; R(-1)={<45k>,<200k>,<90k>,<10k>} Dstream=R( -1)-R()={<10k>}

Rstream • Rstream applied to a relation R contains a stream element <s,> whenever tuples s is in R(). • Formally:

Example CQL queries • Example 1. Continuous query filters a stream S. Select Istream(*) From S[Rows unbounded] Where S.A>10 • Stream S is converted into a relation by applying an unbounded window. • The relation-to-relation filter “S.A>10” • The relation inserted to the filtered are streamed as the result. (using relation-to-stream operator : Istream(*)) Note: the query can be rewritten in the following intuitive form: Select * From S Where S.A>10

Example 2 • Window join example Select * From S1 [Rows 1000],S2 [Rang 2 Minutes] Where S1.A=S2.A And S1.A>10 • The answer to this query is a relation. • At any given time, the answer relation contains the join of the last 1000 tuples of S1 with the tuples of S2 that have arrived in previous 2 minutes. • If we want to have a stream result? (*)=> Istream (S1.A)

Exercise !! (example 3) • We have a stored table R and a stream S, then we want get a streamresult that attribute A in R is the same as in S. And we just interested in attribute A in S and attribute B in R. Answer : Select Rstream(S.A,R.B) From S[Now], R Where S.A=R.A

Introduction • When a continuous query specified in CQL is registered with the STREAM system, a query plan is complied form it. • Query plans are composed of operators, which perform the actual processing, queues, which buffer tuples as they move between operators, and synopses, which store operator state.

Operators • Each query plan operator reads from one or more input queues, processes the input based on its semantics, and writes its output to an output queue.

Queues • A queue in query plan connects its “producing” plan operator Op to its “consuming” operator Oc. see query tree • A queue logically contains sequences of elements representing either streams or relations. • Many of the operators in STREAM system require that elements on their input queues be read in non-decreasing timestamp order. • For an example operator : stream-to-relation operator (sliding window).

Synopses • Logically, a synopsis belongs to a specific plan operator, storing state that may be required for future evaluation of that operator. • Each operator has different number of synopsis, for example, binary-join has 2 synopses, select has 0 synopsis. • Synopses store summary of the tuples. • Share Synopses

Query Plan • When CQL query is registered, STREAM constructs a query plan : a tree of operators, connected by queues, with synopses attached to operators as needed. • Show an example: Select * From S1 [Rows 1000], S2[Rang 2 Minutes] Where S1.A=S2.A And S1.A>10

Query plan execution • Add an flag “+” or “－” into pair <s,> such that become <s,,+> or <s,,->. • “+” means insert; “－”means delete; • How to use the flag? Show that by executing the query plan example. Select * From S1 [Rows 1000], S2[Rang 2 Minutes] Where S1.A=S2.A And S1.A>10

t1 t1 s1 s1 s2 s2 t2 t2 s3 s3 t3 t3 … … … … s1000 s1000 tn tn Query plan execution At time +1 At time . delete <s1,‘,-> <s1001,+1,+> Output to q3 two relation s1001 insert

Performance Issue • Simply generating the straightforward query plans and executing them as described can be very inefficient. • Idea : • Eliminating data redundancy • discarding data that will not be used selectively • scheduling operators to most efficiently reduce intermediate state • Above these are reducing memory overhead .

Synopsis sharing (1/2) • Synopsis sharing has two different classification • Synopsis sharing for single query plan and multiple query plans. • Purpose: • In order to eliminate data redundancy, we replace synopsis with lightweight stub and a single store to hold actual tuples. (Figure 4.2) • Elements: • Stubs implement the same interfaces like synopses. • A single synopsis store can share different views of data by different operators. • Store ability • Tracking the progress of each stub • Present the appropriate view to each stubs (subset of tuples )

Synopsis sharing (2/2) • A tuple is inserted into the store as soon as it is inserted by any one of the stubs, and it is removed only when it has been removed from all of the stubs. • To decrease state redundancy, multiple query plans involving similar intermediate relations can share synopses as well. • Example : • Select A, Max(B) From S1 [Rows 200] Group By A (Fig.3) idea : S1 [Rows 200] is a subset of S1 [Rows 1000]

Exploiting constraints (1/2) • Streams may exhibit data or arrival patterns that can be exploited to reduce run-time synopsis sizes. • Data constraintscan either be specified at stream-registration time, or inferred by gathering statistics over time. • Example: a continuous query that joins a stream Orders with a stream Fulfillments based on attributes orderID and itemID, perhaps to monitor average fulfillment delays.

Exploiting constraints (2/2) • Idea: • In general case, this query precisely requires synopses of unbounded size. However, if we know that all elements for a given orderID and itemID arrive on Orders before the corresponding elements arrive on Fulllments, then we need not maintain a join synopsis for the Fulllments operand at all. • If Fulllments elements arrive clustered by orderID, then we need only save Orders tuples for a given orderID until the next orderID is seen.

Operator scheduling (1/6) • An operator consumes elements from its input queues and produces elements on its output queue. • The global operator scheduling policy can have a large effect on memory utilization. • Two scheduling strategies: • FIFO scheduling • Greedy scheduling • FIFO scheduling: When batches of n elements have been accumulated, they are passed through both operators in two consecutive time units, during which no other element is processed.

Operator scheduling (2/6) • Greedy scheduling: it gives preference to the operator that has the greatest rate of reduction in total queue size per unit time. • Example: a query plan has two operators,O1 followed by O2 . O1 takes one time unit to process a batch of n elements, and output 0.2n elements per input batch (i.e. its selectivity is 0.2). If O2 takes one time unit to operate on 0.2n elements, and it sends its output out of the system. ( its selectivity is 0). • Consider the following arrival pattern: n elements arrive at every time instant from t = 0 to t = 6, then no elements arrive from time t = 7 through t = 13.

Operator scheduling (3/6) (1.0-1.0+0.2)+1=1.2 Input data=1 Input data=1 O1:(1-1+0.2)+1=1.2 O1:(1.2-1+0.2)+1=1.4 1.2-0.2+1=2.0 (2-1+0.2)+1=2.2 • Result:(t0-t6) • Issue: The greedy strategy performs better because it runs O1whenever it has input, reducing queue size by 0.8n elements each timestep.