XML + Query Processing: A Foundation for Intelligent Networks

XML + Query Processing: A Foundation for Intelligent Networks Michael Franklin UC Berkeley September 2003

Outline • Earlier (non-XML) Projects • Client-Server EXODUS -> SHORE • DIMSUM - Distributed Query Architecture • DBIS - Dissemination-Based Information Systems • Telegraph and TelegraphCQ • Lessons Learned • The XML-enabled Computing Landscape • Some Research Suggestions

Applications Applications Object Access/QP Object Access/QP Access Methods Access Methods Buffer Manager Buffer Manager Buffer Manager Transaction Mgr Disk Manager Client-Server Exodus • Issue: How to split the functionality of an OODB across Clients and Servers?

Applications Object Access/QP Access Methods Buffer Manager Buffer Manager Transaction Mgr Disk Manager Distribution of OODB Functions • Server is the owner of data. • Shared resources: data and log disks, server memory. • Clients cache second-class (i.e., soft state) copies to reduce latency. • Can share client caches too… • Query vs. Data Shipping. • For Data Shipping: • Object or Page granularity. • Ref: [Sigmod 91,92,94;VLDB 92,93] Client- Server Protocol

SHORE - A Peer Server (P2P?) Model • Follow-on to Exodus [Sigmod 94] • Among other things, took caching to its logical conclusion: • All can be clients and servers. • You manage the data you own (server) • You cache data owned by others (client) • Wide-area is a reasonable next step • But massive scale changes everything (more on this later).

So, What Happened? • Well, all the OODB/ORDB stuff • But isn’t XML DB just OODB redux? • More to the point: • Models were tightly-coupled: • Syncrhonous • Need intimate knowledge of the schema • Limited (and late) standardization • for query languages, data model, and schema interchange • This is bad for: • Scalability • Interoperability • Incremental Deployment • Resiliance to Change • Also, some people really did want queries (vs. navigation).

DIMSUM - Adding Queries to the Mix • Goal - mix declarative specification & caching. • raises mapping problems similar to materialized view maintenance, but more dynamic. • “Hybrid-Shipping” - Sometimes neither pure strategy is best. • Semantic Caching - remainder queries, semantic replacement functions, … • Query Scrambing - query re-optimization for wide-area delays (vague “deep web” theme) • XJoin - Adaptive, pipelined join operator. • Cache Investment - Multiple query cache optimization. • Ref: [Sigmod 96,98; VLDB 96,01; TODS 00]

So, What Happened? • Still Tightly Coupled: • Synchronous (modulo Query Scrambling delay tolerance). • Need to know (and exchange) schemata • Basically, a federated database approach with caching added. • But, federated databases still haven’t caught on. • Q: Why is data warehousing so popular? • Still, some interesting issues raised: • adaptivity for networked query processing. • semantic cache content descriptors raise duality of queries and data. • pipelined operators for incoming data.

DBIS Framework Dissemination-Based Information Systems Outgrowth of “Broadcast Disks” project. [SIGMOD 95] Framework in OOPSLA 97, SIGMOD 98 (Franklin & Zdonik) Toolkit Developed and Demonstrated at SIGMOD 99 The DBIS Framework is based on three fundamental principles: 1) No one data deliverymechanism is best for all situations (e.g., apps, workloads, topologies). 2) Network Transparency: Must allow different mechanisms for data delivery to be applied at different points in the system. 3) Topology, routing, and delivery mechanism should vary adaptively in response to system changes.

Dissemination Network Components profile query response profile query response Client Proxies Data Sources Information Brokers

Aperiodic Periodic Person- alized News Broad- cast disks publish/ subscribe polling w\snoop Email lists on- demand broadcast polling request/ response Unicast 1-to-n Unicast 1-to-n Data Delivery Mechanisms Push Pull Aperiodic Periodic Unicast 1-to-n Unicast 1-to-n Dimensions are largely orthogonal – all combinations are potentially useful.

Network Transparency Sources Brokers Clients • A fundamental principle for systems design: • Type of a link matters only to nodes on each end.

More on Brokers • Brokers are middleware components that can act as both clientsandservers. • Must support data caching • Needed to convert pushed-data to pulled-data • Also allows implementation of hierarchical caching • Profile Management • Allow informed data management: push, prefetch, staging, etc. • Profile Matching • Our assumptions were: • No profile language sufficient for all applications. • Need an API for adding app-specific profiling

So, What Happened? • Focus on combo of Push and Pull. • Big deal: Integration of Database and Networking • If I had a Euro for every review that said “why is this a db problem?” • Published in DB and Comms venues. • But, we were missing 2 big pieces of the puzzle: • How to deploy this stuff (in the routers?)? • What should the language for profiles and queries be? These have since been answered

Telegraph:Querying the Networked World • Increasingly ubiquitous networking at all scales. • ad hoc sensor nets, wireless, global Internet • Explosion in number, types, and locations of data sources and sinks. • mobile devices, P2P networks, data centers • Emerging software infrastructure to put it all together. “When processing, storage, and transmission cost micro-dollars, the the only real value is the data and its organization.” (Jim Gray’s 1998 Turing Award Paper)

Telegraph Overview • An adaptivesystem for large-scale shareddataflow processing. • Sharing and adaptivity go hand-in-hand • Based on an extensible set of operators: 1) Ingress (data access) operators • File readers, Sensor Proxies, Screen-Scrapers 2) Non-BlockingData processing operators • Selections (filters), XJoins, … 3)Adaptive Routing Operators • Eddies, STeMs, FLuX, etc. • Operators connected through “Fjords” [MF02] • queue-based framework unifying push&pull.

The Telegraph Project • We’ve explored sharing and adaptivity in … • Eddies: Continuously adaptive queries • Fjords: Inter-module communication • CACQ: Sharing, Tuple-lineage • PSoup: Query=Data duality • STeMs: Half-a-symmetric-join, tuple store • FLuX: Fault tolerance, load balancing • .. and built a first generation prototype [SIGMODRec01] • Built from scratch in Java • Rewrote as “TelegraphCQ” [CIDR 03] • In “C”, based on open-source PostgreSQL • Focus on continuous queries over streams • Released in July 2003

Shared Memory Query Plan Queue TelegraphCQBack End TelegraphCQBack End TelegraphCQ Front End Eddy Control Queue Planner Parser Listener Modules Modules Split Query Result Queues Mini-Executor CQEddy CQEddy Proxy } Split Split Catalog Scans Scans Shared Memory Buffer Pool Wrappers TelegraphCQ Wrapper ClearingHouse Disk The TelegraphCQ Architecture

Queries Need Windows: Landmark query

So, What Happened? • Decision was made to do relational first. • Enough hard problems w/o XML • Our early apps weren’t XML • Q: Will they eventually be? • Note: Streams and Aurora made same choice • Developed lots of stream-related technology • Project still going strong • Storage manager, archives, and historical queries • Adaptive Adaptivity • Performance Tunning • Query Language and Window semantics • Distribution

Summary So Far • 4 projects over 14 or so years. All exploring aspects of networked data management. • Exodus/SHORE - centrality of caching, work sharing and work splitting paradigms. • DIMSUM - Benefits and challenges of declarative specificaitons via queries. • DBIS - Push, Profiles, broader notion of integrating networking and data management. • Telegraph - Adaptivity, Sharing, CQs, Stream processing. But, they all suffer to some extent from the problem of tight coupling  in terms of both timing and semantics.

Meta Lessons Learned 1. You don’t have to predict the technology correctly to get a bunch of papers published. 2. Sometimes you actually get it right, but the timing is a bit off. A lot of pieces have to fall into place before a new technology or architecture clicks. XML is one such piece, and it’s a BIG one.

How to Make Systems More Network-Friendly • Messaging enables distributed communication that is loosely coupled. A component sends a message to a destination, and the recipient can retrieve the message from the destination. However, the sender and the receiver do not have to be available at the same time in order to communicate. In fact, the sender does not need to know anything about the receiver; nor does the receiver need to know anything about the sender. The sender and the receiver need to know only what message format and what destination to use. Java Message Service (JMS) API Tutorial Sun Microsystems

Preaching to the Choir • XML (not JMS!) solves both these issues. • Senders and Receivers can agree on message format (or at least figure most of it out). • Destinations should be encoded by value not by address. (Didn’t we learn anything during the OODB battles?). • Database people live and breathe both of these. So who better to fix the networked application infrastructure problem? (Ahem, but, better keep that slow DBMS out of the message flow! e.g., FedEx tracking involves 100,000,000 transactions a day, and RFId will be even more fun.)

XML Message Brokers • A platform for dynamic, loosely-coupled integration of enterprise applications and data. • Interaction accomplished through exchange of messages in the wide area. (e.g., Adam Bosworth’s VLDB 02 keynote: http://www.cs.ust.hk/vldb2002/VLDB2002-proceedings/slides/S01P01slides.pdf)

The challenge is to efficiently and quickly match incoming XML documents against the potentially huge set of user profiles. Underlying Technology: Filtering User Profiles Filtered Data XML Documents XML Conversion Filter Engine Users Data Sources

Our View on Message Brokers (YFilter) • Message Brokers perform three main tasks: • Filtering - matching of interests. • Transformation - format conversion for app integration and preferences. • Routing - moving bits through the overlay network • Must be lightweight and scalable. • Effectively they are high-function routers. • Large-scale deployments may entail handling 10’s or 100’s of thousands of queries (subscriptions) • XML is a natural substrate.

a * /a a *   //a /* //* Location steps NFA fragments * * YFilter:Shared Path MatchingYanlei Daio et al., ACM TODS, Dec. 2003 • For large-scale systems, shared processing is essential. • YFilter uses an NFA-based approach to share path matching workamong queries.

a a a b   * * Constructing a Query NFA Concatenate NFA fragments for location steps in a path expression. /a //b Query “/a//b”

{Q3, Q8} c {Q1} {Q3} b {Q2} {Q4} c c a a * b  c {Q6} * {Q5} c {Q7} * c Constructing the Combined NFA Q1=/a/b Q2=/a/c Q3=/a/b/c Q4=/a//b/c Q5=/a/*/b Q6=/a//c Q7=/a/*/*/c Q8=/a/b/c

{Q1} c 5 Runtime Stack {Q3, Q8} b 3 5 3 9 7 6 {Q2} {Q4} c 2 4 1 read <b> a b 8 7  2 1 read <c> 6 11 {Q6} * 10 9 {Q5} c 12 13 * {Q7} * c An XML fragment <a> c c 3 3 9 7 6 2 2 2 2 1 1 1 1 1 1 initial read <a> read </b> read </a> read </c> NFA Execution 10 12 6 NFA 6 8 11 9 7 match Q1 match Q3 Q8 Q5 Q4 Q6 <b> <c> </c> </b> </a>

Performance Overview • Sharing provides order-of-magnitude improvements. • In our experiments, even with 100,000 concurrent queries, filtering was faster than the parser. • No exponential blow-up of active states in NFA execution • Little sensitivity to occurence of ‘*’ and “//” • YFilter shows little sensitivity to these two parameters because effective prefix sharing keeps the machine size small • Efficient for query updates • Tens of milli-seconds for inserting 1000 queries, and stabilizes at 5 msec after 50,000 queries exist in the system.

Message Transformation • Shred FLWR expressions into paths that can be pushed down into the path matching engine. • Post-process the output using relational-style operators to produce customized messages. • Can apply MQO techniques to these post-plans • Three approaches (differ in the extent to which they push work to the engine) • PathSharing-F: For clause paths only • PathSharing-FW: For & Where clause paths • PathSharing-FWR: For, Where & Return • Inherent tension between path sharing and result customization! • See Yanlei Diao’s VLDB 03 paper (thursday afternoon)

Message Broker – Wrap Up Sharing is the key to performance • NFA provides excellent scalability/performance • PathSharing-FWR performs best, when combined with optimizations based on the queries and DTD. • When the post-processing is shared, even more scalability can be achieved. • This sharing is facilitated by using relational-like query plans. On-going work - How to deploy in the wide area?: • Distributed Filtering and Content Delivery Network • Combining distributed query processing and state-of-the-art application-level multicast protocols. • What semantics can/should be provided? For more information see: www.cs.berkeley.edu/~daioyl/yfilter

Beyond Message-Based Systems • Distributed systems need traceability • Particularly highly dynamic (loosely-coupled) ones • Need to carry provenance information with data • Workflow description • XML-based workflow languages with appropriate versioning models can provide the platform for the above. • Data needs to be long-lived - Archiving • Marked up data provides an opportunity for future interpretation? • Schema versioning needed for this. • Semantic Web? • Try it if you like…

Deep/Hidden Web Querying • XML is a great way to describe sources. • Routing queries to sources is the inverse of the data dissemination problem. • Yet another instance of the query and data duality. • Stream query processing can help here too.

Self-Publishing/Crawling • Following the query routing idea further… • Queries can be continuously crawling through the network acquiring new data. • This can be random or focused (e.g., navigation your Friendster chains). • Even more fun: Mutant Queries (Papadimos et al. OGI) • Queries are partially evaluated and bound as they traverse the network. • “Hybrid Shipping” on steroids

Topics in Need of Work • Query Languages and semantics in streaming, loosely-coupled, semi-structured environments. • Update consistency models, transactions, exactly-once delivery - How 80’s! • Dynamism and on-the-fly modifications • User interaction • Platform questions: In or out of the DBMS? • Making XML appropriate for other environments (e.g., sensor networks). • …

Conclusions • Two technologies are combining to make distribute/decentralized computing a reality: overlay networks and XML. • Query processing is a way to route data through a network by value. • This is the right way to build an overlay network. • We are the right people to do it. • XML is the common substrate that enables it. • My plan: revisit many earlier distributed data management ideas in light of this new reality. • And do some new stuff too!

XML + Query Processing: A Foundation for Intelligent Networks