Adaptive Query Processing:Progress and Challenges Alon Halevy University of Washington [Gore] Joint work with Zack Ives, Dan Weld (later: Nimble Technology)
Data Integration Systems Uniform query capability across autonomous, heterogeneous data sources on LAN, WAN, or Internet: in enterprises, WWW, big science.
Recent Trends in Data Integration Research • Issues such as: architectures, query reformulation, wrapper construction are reasonably well understood (but still good work going on). • Query execution and optimization raise significant challenges. • Problems for traditional query processing model: • Few statistics (autonomous sources) • Unanticipated delays and failures (network-bound sources). • Conclusion (ours): cannot afford to separate optimization from execution. Need to be adaptive. • See IEEE Data Engineering Bulletin, June, 2000.
Outline • Tukwila (version 1): • Interleaving optimization and execution at the core. • The unsolved problem: when to switch? • The complicating new challenges: • XML, want first tuples fast. • Tukwila (version 2): • completely pipelined XML query processing. • Some experiences from Nimble
Tukwila: Version 1 • Key idea: build adaptive features into the core of the system. • Interleave planning an execution (replan when you know more about your data) • Rule-based mechanism for changing behavior. • Adaptive query operators: • Revival of the double-pipelined join. • Collectors (a.k.a. “smart union”). • See details in SIGMOD-99.
Tukwila Data Integration System Novel components: • Event handler • Optimization-execution loop
Handling Execution Events • Adaptive execution via event-condition-action rules • During execution, eventsgenerated Timeout, n tuples read, operator opens/closes, memory overflows, execution fragment completes, … • Events trigger rules: • Test conditions Memory free, tuples read, operator state, operator active, … • Execution actions Re-optimize, reduce memory, activate/deactivate operator, …
Interleaving Planning and Execution Re-optimize if at unexpected state: • Evaluate at key points, re-optimize un-executed portion of plan [Kabra/DeWitt SIGMOD98] • Plan has pipelined units, fragments • Send back statistics to optimizer. • Maintain optimizer state for later reuse. WHEN end_of_fragment(0) IF card(result) > 100,000 THEN re-optimize
Adaptive Operators: Double Pipelined Join Hybrid Hash Join • Partially pipelined: no output until inner read • Asymmetric (inner vs. outer) — optimization requires source behavior knowledge Double Pipelined Hash Join Enhancement to [Wilschut PDIS91]:uses multithreading, handles overflow • Outputs data immediately • Symmetric — requires less source knowledge to optimize
Adaptive Operators: Collector Utilize mirrors and overlapping sources to produce results quickly • Dynamically adjust to source speed & availability • Scale to many sources without exceeding net bandwidth • Based on policy expressed via rules WHEN timeout(CustReviews) DO activate(NYTimes), activate(alt.books) WHEN timeout(NYTimes) DO activate(alt.books)
Highlights from Version 1 • It worked well (graphs to prove it)! • Unified architecture that encompassed previous techniques: • Choose nodes (Cole & Graefe) • Mid-stream re-optimization (Kabra & DeWitt) • Query scrambling (Urhan, Franklin, Amsaleg) • Optimizer can have global view of different factors affecting adaptive behavior.
The Unsolved Problem • Find interleaving points? When to switch from optimization to execution? • Some straightforward solutions worked reasonably, but student who was supposed to solve the problem graduated prematurely. • Some work on this problem: • Rick Cole (Informix) • Benninghoff & Maier (OGI). • One solution being explored: execute first and break pipeline later as needed. • Another solution: change operator ordering in mid-flight (Eddies, Avnur & Hellerstein).
More Urgent Problems • Users want answers immediately: • Optimize time to first tuple • Give approximate results earlier. • XML emerges as a preferred platform for data integration: • But all existing XML query processors are based on first loading XML into a repository.
Tukwila Version 2 • Able to transform, integrate and query arbitrary XML documents. • Support for output of query results as early as possible: • Streaming model of XML query execution. • Efficient execution over remote sources that are subject to frequent updates. • Philosophy: how can we adapt relational and object-relational execution engines to work with XML?
Tukwila V2 Highlights • The X-scan operator that maps XML data into tuples of subtrees. • Support for efficient memory representation of subtrees (use references to minimize replication). • Special operators for combining and structuring bindings as XML output.
Example XML File <db> <book publisher="mkp"> <title>Readings in Database Systems</title> <editors> <name>Stonebraker</name> <name>Hellerstein</name> </editors> <isbn>123-456-X</isbn> </book><company ID="mkp"> <name>Morgan Kaufmann</title> <city>San Mateo</city> <state>CA</state> </company> </db>
Example Query WHERE <db> <book publisher=$pID> <title>$t</> </> ELEMENT_AS $b </> IN "books.xml", <db> <publication title=$t> <source ID=$pID>$p</> <price>$pr</> </> </> IN "amazon.xml", $pr < 49.95 CONSTRUCT <book> <name>$t</> <publisher>$p</> </>
Query Execution Plan
X-Scan • The operator at the leaves of the plan. • Given an XML stream and a set of regular expressions – produces a set of bindings. • Supports both trees and graph data. • Uses a set of state machines to traverse match the patterns. • Maintains a list to unseen element Ids, and resolves them upon arrival.
Other Features of Tukwila V.2 • X-scan: • Can also be made to preserve XML order. • Careful handling of cycles in the XML graph. • Can apply certain selections to the bindings. • Uses much of the code of Tukwila I. • No modifications to traditional operators. • XML output producing operators. • Nest operator.
In the “Pipeline” • Partial answers: no blocking. Produce approximate answers as data is streaming. • Policies for recovering from memory overflow [More Zack]. • Efficient updating of XML documents (and an XML update language) [w/Tatarinov] • Dan Suciu: a modular/composable toolset for manipulating XML. • Automatic generation of data source descriptions (Doan & Domingos)
Intermediate Conclusions • First scalable XML query processor for networked data. • Work done in relational query processing is very relevant to XML query processing. • We want to avoid decomposing XML data into relational structures.
Some Observations from Nimble • What is Nimble? • Founded in June, 1999 with Dan Weld. • Data integration engine built on an XML platform. • Query language is XML-QL. • Mostly geared to enterprise integration, some advanced web applications. • 70+ person company (and hiring!) • Ships in trucks (first customer is Paccar).
System Architecture XML Query XML Relational Data Warehouse/ Mart Legacy Flat File Web Pages Front-End Lens Builder™ User Applications Lens™ File InfoBrowser™ Software Developers Kit NIMBLE™ APIs Management Tools Integration Layer Nimble Integration Engine™ Metadata Server Cache Compiler Executor Security Tools Common XML View Integration Builder Concordance Developer Data Administrator
The Current State of Enterprise Information • Explosion of intranet and extranet information • 80% of corporate information is unmanaged • By 2004 30X more enterprise data than 1999 • The average company: • maintains 49 distinct enterprise applications • spends 35% of total IT budget on integration-related efforts Source: Gartner, 1999
Design Issues • Query language for XML: tracking the W3C committee. • The algebra: • Needs to handle XML, relational, hierarchical and support it all efficiently! • Need to distinguish physical from logical algebra. • Concordance tables need to be an integral part of the system. Need to think of data cleaning. • Need to deal with down times of data sources (or refusal times). • Need to provide range of options between on-demand querying and pre-materialization.
Non-Technical Issues • SQL not really a standard. • Legacy systems are not necessarily old. • IT managers skeptical of truths. • People are very confused out there. • Need a huge organization to support the effort.