Analysis of : Operator Scheduling in a Data Stream Manager

Analysis of :Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom

Agenda • Overview of Stream Processing • Aurora Project Goals • Aurora Processing Example • Aurora Architecture • Multi-Thread Vs. Single-Thread processing • Important Definitions • Superbox Scheduling and Processing • Tuple Batching • Experimental Evaluation • Quality of Service (QoS) Scheduling • QoS Scheduling Scalability • Related Work

Overview of Stream Processing Stream Processing is the processing of potentially unbounded, continuous streams of data • Data streams are created via micro-sensors, GPS devices, monitoring devices • Examples include: soldier location tracking, traffic sensors, stock market exchanges, heart monitors • Data may be received evenly or in bursts

Aurora Project Goals • To build a data stream manager that addresses the performance and processing requirements of stream-based applications • To support multiple concurrent continuous queries on one or more application data streams • To use Quality-of-Service(QoS) based criteria to make resource allocation decisions

Aurora Processing Example Input Data Streams Output to Applications Historical Storage Operator Boxes Continuous & ad hoc queries

Aurora Architecture Inputs Outputs Router Buffer Manager Scheduler B1 B2 B3 B4 Box Processors Catalogs Persistent Store Load Shredder QoS Monitor

Multi-Thread Vs. Single Thread Processing • Multi-Thread Processing • Each query is processed in its own thread • The operating system manages resource allocation • Advantages • Processing can take advantage of efficient operating system algorithms • Easier to program • Disadvantages • Software has limited control of resource management • Additional overhead do to cache misses, lock contention and switching

Multi-Thread Vs. Single Thread Processing • Single-Thread Processing • All operations are processed within a single thread • All resource allocation decisions are made by the scheduler • Advantages • Allows processing to be scheduled based on latency and other Quality of Service factors based on query needs • Avoids the limitations of multi-thread processing • Disadvantages • More complex to program • Aurora has chosen to implement a single-threaded scheduling model

Important Definitions • Quality of Service (QoS) – Specific requirements that represent the needs of a specific query. In Aurora, the primary QoS factor is latency • Query Tree – The set of operators (boxes) and data streams that represent a query. • Superbox – A sequence of operators that are scheduled and executed as an atomic group. Aurora treats each query as separate superbox. • Two-Level Scheduling – Scheduling is done at two levels. First, at the superbox level (deciding which superbox to process) and second, what order to execute the operators within the superbox once a superbox is selected.

Important Definitions (Cont.) • Scheduling Plan – The combination of dynamically based superbox scheduling and algorithm based operator execution order within the superbox is called a scheduling plan. • Application-at-a-time (AAAT) is a term used in Aurora that statically defines each query (application) as a superbox • Box-at-a-time (BAAT) refers to scheduling at the box level rather then the superbox level • Static and dynamic scheduling approaches – Static approaches to scheduling are defined prior to runtime. Dynamic scheduling approaches use runtime information and statistics to adjust and prioritize scheduling order during execution • Traversing a superbox – This refers to how the operators within a superbox should be scheduled and executed

Non-Superbox Processing 1 6 9 12 14 2 7 10 13 15 3 11 4 8 16 5

Superbox Processing A1 A2 A3 A5 A5 C1 C2 C3 C4 C5 B4 B2 B5 B3 B1 B6

Superbox Traversal Superbox traversal refers to how the operators within a superbox should be executed • Min-Cost (MC) – Attempts to optimize per-output-tuple processing costs by minimizing the number of operator calls per output tuple • Min-Latency (ML) – Attempts to produce initial tuples as soon as possible • Min-Memory (MM) – Attempts to minimize memory usage

Superbox Traversal Processing • Min-Cost(MC) • B4 > B5 > B3 > B2 > B6 > B1 • Min-Latency(ML) • B1 > B2 > B1 > B6 > B1 > B4 > B2 > B1 > B3 > B2 > B1 > B5 > B3 > B2 > B1 • Min-Memory(MM) • B3 > B6 > B2 > B5 > B3 > B2 > B1 > B4 > B2 > B1 B4 B2 B5 B3 B1 B6

Tuple Batching (Train Processing) • A Tuple Train is the process of executing tuples as a batch within a single operator call. • The goal of Tuple Train processing is to reduce overall processing cost per tuple processed • Advantages of Tuple Train processing are: • Decreased number of total operator executions • Cuts down on low level overhead such as context switching, scheduling, memory management and execution queue maintenance • Some windowing and merge-join operators work efficiently when batching tuples

Experimental Evaluation Definitions • Stream-based applications do not currently have a standardized benchmark • Aurora modeled queries as a rooted tree structure from a stream input box to an application output box • Trees are categorized based on depth and fan-out • Depth is the number of box levels from input to output • Fan-out is the average number of children of each box

Experimental Evaluation Results • At low volumes “Round Robin Box-At-A-Time (RR-BAAT)” scheduling was almost as efficient as “Minimum Cost Application-At-A-Time (MC-AAAT)” at low volumes but much less efficient and higher levels • At low volumes, the efficiencies of MC-AAAT were reduced by more complex scheduling overhead • As volumes increased, the efficiencies of MC-AAAT became more apparent as scheduling overhead became a lower percentage to total processing • Experimentation was also done to compare ML, MC and MM scheduling techniques • As expected, each technique minimized their specified attribute (latency, cost and memory respectively) • However, at very low processing levels the simplest algorithms tended to do the best (but who cares :)

Quality of Service (QoS) Scheduling • Definitions • Utility – is how useful the tuple will be when it exits the query • Urgency – is represented by the angle of the downward slope of the utility QoS parameter. In other words, how fast the utility deteriorates • Approach • Keep track of the latency of tuples that reside in the queues and pick tuples for processing based on whose execution will provide the highest aggregate QoS delivered to the applications.

Latency-Utility Relationship Critical Points The older the data gets, The less it is worth, The lower the quality of service ------------------------------------ Aurora combines the QoS charts of each query being executed with the average latency of the tuples in each box to decide which superbox to execute next. The idea is to, on average, maintain the highest quality of service. 1 Quality of Service 0,0 Latency

QoS Scheduling Scalability • Problem • A per-tuple approach to QoS based scheduling will not scale because of the amount of processing needed to maintain it • Solution • Latency is not calculated at the tuple level, rather, it is calculated as the average latency of tuples in the box input queue • Priority is given based on the combination of utility and urgency • Once a box’s priority (priority tuple or “p-tuple”) is calculated, the boxes are placed in logical buckets bases on their priority value • Scheduling is then done based on the priority of the bucket • All boxes in a given bucket are considered equal

Related Work • Eddies – has a tuple-at-a-time scheduler providing adaptablility, but does not scale well • Urhan – works on rate-based pipeline scheduling of data between operators • NiagaraCQ – query optimization for streaming data from wide-area information sources • STREAM – provides comprehensive data stream management using chain scheduling algorithms • Note, that none of the above projects have a notion of QoS

Analysis of : Operator Scheduling in a Data Stream Manager

Analysis of : Operator Scheduling in a Data Stream Manager

Presentation Transcript

Airline Scheduling

High impact Data Warehousing with SQL Server Integration Services and Analysis Services

Stream Cipher

Data Stream Algorithms Intro, Sampling, Entropy

Two Phase Locking 2PL

Quantitative Data Analysis

William Stallings Computer Organization and Architecture 7 th Edition

Data-Flow Analysis (Chapter 8)

An Introduction to Functional Data Analysis

Analysis of FRMS Forum data using BAM 2 September 2011

Chapter 6: CPU Scheduling

Processor Scheduling

Pumping Apparatus Driver/Operator — Lesson 7

Scheduling

Microarray Data Analysis Using BASE

Scheduling for Grid Computing

Deterministic Scheduling

Front Office Organization Chart

Phased Scheduling of Stream Programs

University of California Retention Scheduling Project Laurie Sletten, CRM, CA Records Manager

Front Office Organization Chart