Languages for Distributed Information Val Tannen Computer and Information Science Department

Languages for Distributed Information • Val Tannen • Computer and Information Science Department • University of Pennsylvania • Acknowledgements: • Philippa Gardner and Sergio Maffeis, Imperial College • Arnaud Sahuguet, Bell Labs (formerly at UPenn) • Zack Ives and Benjamin Pierce, UPenn

Distributed information • Past: because it did not fit on one system; data placement was a big topic. • Present: independent data sources agree to collaborate • Queries that require integration across multiple sources. • Heterogeneity. Is it solved by XML? Results are mixed. • Redundancy is what makes things really interesting!

query process A query arrives… and demands satisfaction Arrives in any node. Refers tonames. Names relate to data in various nodes. For example: node A:proj (join R (sel S)) Need to discover that R is at node A and S is at node B. Need a plan to compute the result of proj (join tableR@A (seltableS@B) and return the answer to the client that asked the query.

One language for queries, views, and query plans data language query language (SQL, XQuery) processes view language query plan language

Motives and intentions Existing view and plan languages are limited. Capture distributed query processing in a high-level language. (SQL vs. C++) Easier to program self-tuning features. Easier to generate automatically, eg., by optimizers. Easier to model toward formal verification. My ambition:) is to convince (1) database people that process calculi are useful, (2) process calculi people to look at some new problems.

Concurrency and process calculi Why process calculi? Because queries arrive asynchronously and must be processed concurrently within a node (and across nodes). Composition primitives: e1 | e2 parallel e1 ; e2 sequential e1 or e2 nondeterministic choice e1, e2 expressions denoting processes

Synchronization and communication CCS notifL ; e1 | waitL ; e2 e1 | e2 pi-calculus new c . ( send c v ; e1 | recv c x. e2(x) ) new c . e1 | e2(v) Anticipation: the blocking operational semantics does not capture well query execution. The old dataflow paradigm might be better for that. But channels are essential. blocking synchro + comm on channel c

Get from here… … to here plan: A proj join B sel Client S R Capturing query plans Original query may be “decomposed” into subordinate queries running at different nodes. Recall node A:proj (join R (sel S))

Query plan “development” node A:proj (join R (sel S)) > meta-step (discovery)> node A:proj (join R@A (sel S@B)) > meta-step (discovery)> node A:proj (jointableR (sel S@B)) > meta-step (optimization)> node A:new c .proj (jointableR (split c (sel S@B))) >operational semantics step> node A:new c .proj (jointableR (recv c)) |send c (sel S@B)

A new primitive: split new c . e1(split c (e2)) new c .e1(recv c)| send c e2 (We are using a simplified form of recvsince the value passed through the channel is consumed in just one place.) split is typically introduced by optimization steps. Alternative (query shipping vs. data shipping): node A:new c .proj (jointableR (sel (split c S@B)))

Distribution and migration Each node handles a “soup” of parallel processes: ( node A: e1 | e2 ) || ( node B: e3 | e4 | e5 ) Process migration: ( node A: e1 | migrate B e2) || ( node B: e3 ) ( node A: e1 ) || ( node B: e3 | e2 ) Will be used to move subordinate queries.

Query plan, continued node A:new c .proj (join tableR (recv c)) | send c (sel S@B) >meta-step (subordination)> node A:new c .proj (jointableR (recv c)) | migrateB (send c (sel S@B)) >operational semantics step> new c@A . ( node A:proj (jointableR (recv c)) || node B:send c@A (sel S@B) ) Global channels vs. located channels. We can get by with located channels.

A proj join B sel Client S R Distributed query evaluation >meta-step (discovery)> new c@A . ( node A:proj (jointableR (recv c)) || node B:send c@A (seltableS) ) Should not block send/recv on the entire table (all-push). Data should be streamed. All-pull (“iterators”) blocks recv/send on each tuple: not good. Push-pull (queued) is most reasonable.

“Lay down the pipes … Turn on the faucets” • The approach of ubQL [Sahuguet&T., 2001]. Separate • deployment phase: queries are split • channels are created • subordinate query processes migrate • carrying channel names with them • execution phase: data is streamed through channels • processed according to subord. queries • adaptive behavior: execution can be interrupted and reset • channels can be flushed and reset • alternative plans can be started

Adaptive query processing Another new primitive (like Cardelli’s “Web combinators”) adapt stream AltPlan stream if adequate AltPlan if drying up Example: new c@A . node A:proj (jointableR (adapt (recv c) AltPlan)) || node B:send c@A (seltableS) where, eg., AltPlan  new c’@A. split c’ (sel S@C) redundancy

These are also subordinate queries. Not quite decomposition… still easily expressible Distributed databases systems: 25+ years A lot of useful techniques, especially optimizations to reduce bandwidth. Eg., use of semijoins. proj join sel B Client A proj S R Essential limitation: need powerful, all-knowing node, to generate the query plans.

Where is the data? Discovery! • Distributed Database Systems centralized, complete, consistent knowledge. • No clue, go out and search everywhere (Gnutella)! • Keyword-node indexes (catalogs) in each node. • Finally something clever: Distributed Hash Tables • Layered organization using views • (simple version in Mediated Information Integration: Tsimmis, Kleisli, Information Manifold, K2, Garlic, etc.)

Views can organize distributed data A query = expr( R@B, S1@C) V = expr(S1@C, S2@C) composition-with-views B R = expr( localTables, S2@C) C S1 = expr( localTables ) S2 = expr( localTables ) rewriting-with-views

New kinds of views Generalize split: e1(spawn e2 e3) e1(e2) | e3 split c e  spawn (recv c) (send c e) View definition that invokes a remote continuous query: R = new cr. spawn (recv cr) (send cq cr) Continuous query “installed” elsewhere: recv* cq x. send x (sel tableT) cr is a data stream channel cq is a standard pi-calculus channel

Some tentative conclusions • Very useful process primitives: parallel and sequential composition, migration, channels. • Need a new behavior for certain channels: streaming data (at least at the language level; underneath we have bounded buffers/queues ). • Standard channel semantics is still useful for passing small values or channel names, eg., for continuous queries. • Some new primitives should be considered split/spawn, adapt.

Where to? Service calls, active (dynamic) data, done by Gardner and Maffeis. Clarify semantics of execution phase so we can apply verification techniques for adaptive query plans. In the presence of redundancy, building globally optimal plans seems hopeless. How to define local optimality? Is there even a concept of “good enough”? Combine layers of views with distributed hashing at the level of schema?

The End

An interesting complication: services Service calls in query: to applications that provide additional processing (that cannot be programmed in the query language). Eg., analysis tools, scientific (BLAST) or financial Service calls in data (active data): part of the data is to be retrieved only on demand Eg., Active XML Service definition, result computation, and result consumption all in potentially different nodes.

Mixing query features with process features In principle we can take any query language (SQL, OQL, XQuery) and add the process primitives we saw. Really? The query language better be nicely “compositional” and “referentially transparent”. SQL: lots of special cases… Even better is to also have a robust static type system. Things can get complicated: foreach r in Dept@Acollect foreach s in Emp@B where s.name=r.manager collect {s.age}

Where is the relevant data? (2) Basic algorithm can be iterated and made useful to databases (Ives). But, what to do with irregular pre-existing data with complex schema?

Some tentative conclusions (2) • Services are modeled in the same framework with little effort. Service call arguments can also be carried as part of migrating processes. • Did not show it, but subscriptions, continuous queries, caching, proxying (query redirecting) can be approached too. Useful feature from process calculi: replicated processes. • Well-designed query languages will mix nicely with process calculi primitives. • Operational semantics steps and meta-steps must be interleaved (this makes me uncomfortable!)

Languages for Distributed Information Val Tannen Computer and Information Science Department