Freddies: DHT-Based Adaptive Query Processing via Federated Eddies

Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS 294-4 Peer-to-Peer Systems 12/9/03

Outline • Background: PIER • Motivation: Adaptive Query Processing (Eddies) • Federated Eddies (Freddies) • System Model • Routing Policies • Implementation • Experimental Results • Conclusions and Continuing Work

PIER • Fully decentralized relational query processing engine • Principles: • Relaxed consistency • Organic Scaling • Data in its Natural Habitat • Standard Schemas via Grassroots software • Relational queries can be executed in a number of logically equivalent ways • Optimization step chooses the best performance-wise • Currently, PIER has no means to optimize queries

Adaptive Query Processing • Traditional query optimization occurs at query time and is based on statistics. This is hard because: • Catalog (statistics) must be accurate and maintained • Cannot recover from poor choices • The story gets worse! • Long running queries: • Changing selectivity/costs of operators • Assumptions made at query time may no longer hold • Federated/autonomous data sources: • No control/knowledge of statistics • Heterogeneous data sources: • Different arrival rates • Thus, Adaptive Query Processing systems attempt to change execution order during the query • Query Scrambling, Tukwila, Wisconsin, Eddies

Eddies • Eddy: A tuple router that dynamically chooses the order of operators in a query plan • Optimize query at runtime on a per-tuple basis • Monitors selectivities and costs of operators to determine where to send a tuple to next • Currently centralized in design and implementation • Some other efforts for distributed Eddies from Wisconsin & Singapore (neither use a DHT)

Why use Eddies in P2P? (The easy answers) • Much of the promise of P2P lies in its fully distributed nature • No central point of synchronization  no central catalog • Distributed catalog with statistics helps, but does not solve all problems • Possibly stale, hard to maintain • Need CAP to do the best optimization • No knowledge of available resources or the current state of the system (load, etc) • This is the PIER Philosophy! • Eddies were designed for a federated query processor • Changing operator selectivities and costs • Federated/heterogeneous data sources

Why Eddies in P2P? (The not so obvious answers) • Available compute resources in a P2P network are heterogeneous and dynamically changing • Where should the query be processed? • In a large P2P system, local data distributions, arrival rates, etc. maybe different than global

Freddies: Federated Eddies • A Freddy is an adaptive query processing operator within the PIER framework • Goals: • Show feasibility of adaptive query processing in PIER • Build foundation and infrastructure for smarter adaptive query processing • Establish baseline for Freddy performance to improve upon with smarter routing policies

An Example Freddy R join S S join T Put (Join Value RS) Local Operators Put (Join Value ST) To DHT Freddy Output Get(R) Get(T) Get(S) R S T From DHT

System Model • Same functionality as centralized Eddy • Allows easy concept reuse • Freddy uses its Routing Policy to determine the next operator for a tuple • Tuples in a Freddy are tagged with DoneBits indicating which operators have processed it • Freddy does all state management, thus existing operators require no modifications • Local processing comes first (in most cases) • Conserve network bandwidth • Not as simple as it seems • Freddy: decide how to rehash a tuple • This determines join order • Challenge: Decoupling of routing decision and operator. Most Eddy techniques no longer valid

Query Processing in Freddies • Query origin creates a query plan with a Freddy • Possible routings determined at this time, but not the order • Freddy operators on all participating nodes initiate data flow • As tuples arrive, the Freddy determines the next operator for this tuple based on the DoneBits and routing policy • Source tuples tagged with clean DoneBits and routed appropriately • When all DoneBits are set, the tuple is sent to the output operator (return to query origin)

Tuple Routing Policy • Determines to which operator to send a tuple • Local information • Messages expensive • Monitor local usage and adjust locally • “Processing Buddy” information • During processing, discover general trends in input/output nodes’ processing capabilities/output rates, etc • For instance, want to alert previous Freddy of poor PUT decisions • Design space is huge  large research area

Freddy Routing Policies • Simple (KISS): • Static • Random: Not as bad as you may think • Local Stat Monitoring (sampling) • More complex: • Queue lengths • Somewhat analogous to the “back-pressure” effect • Monitors DHT PUT ACKs • Load balancing through “learning” of global join key distribution • Piggyback stats on other messages • Don’t need global information, only stats about processing buddies (nodes with which we communicate) • Different sample than local – may or may not be better

Implementation & Experimental Setup • Design Decisions: • Simplicity is key • Roughly 300 of NCSS (PIER is about 5300) • Single query processing operator • Separate routing policy module loaded at query time • Possible routing orders determined by simple optimizer • Required generalizations to the PIER execution engine to deal with generic operators • Allow PIER to run any dataflow operator • Simulator with 256 nodes, 100 tuples/table/node • Feasibility, not scalability • In the absence of global (or stale) knowledge, a static optimizer could chose any join ordering  we compare Freddy performance to all possible static plans

3-way join • R join S join T • R join S is highly selective (drops 90%) • S join T is expensive (multiples tuple count by 25) • Possible static join orderings: T R R S S T

3 Way Join Results

4-way join • R join S join T join U • S join T is still expensive • Possible static join orderings: U R U R T U R S R S S T S T T U Note: A traditional optimizer can’t make this plan R S T U

4-Way Join

The Promise of Routing Policy • Illustrative example of how routing policy can improve performance • This not meant to be an exhaustive comparison of policies, rather to show the possibilities • EddyQL considers number of outstanding PUTs (queue length) to decide where to send

Conclusions andContinuing Work • Freddies provide adaptable query processing in a P2P system • Require no global knowledge • Baseline performance shows promise for smarter policies • In the future… • Explore Freddy performance in a dynamic environment • Explore more complex routing policies

Questions? Comments?Snide remarks for Ryan?Glorious praise for Shawn? Thanks!

Freddies: DHT-Based Adaptive Query Processing via Federated Eddies

Freddies: DHT-Based Adaptive Query Processing via Federated Eddies

Presentation Transcript

Ontology-Based Free-Form Query Processing for the Semantic Web

SRFERS Smart Search Federated Query

Search and Query: An {Over, Re}view

Chapter 3: Top-k Query Processing and Indexing

Query Processing

Query Processing

Top-k Query Processing

Adaptive Query Processing: Progress and Challenges

Adaptive Query Processing in the Looking Glass

Distributed Query Processing with OGSA-DQP

Adaptive Query Processing with Eddies

IS698: Database Management

Query Processing of Massive Trajectory Data based on MapReduce

Cost-based Query Scrambling for Initial Delays

Query Processing

Adaptive Query Processing

Telegraph Java Experiences

Online Query Processing A Tutorial

In-Network Query Processing

Query Processing with XML

Query Processing

Query Processing and Optimization