On Efficient Matching of Streaming XML Documents and Queries

On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan1 P. Sailaja2 University of British Columbia, Canada Indian Inst. of Tech., Bombay, India (work performed while visiting IIT-Bombay).

Outline • Motivating Applications • Problem • Dual Index • Algorithms • Experiments • Summary & Future Work EDBT 2002, Prague.

Motivating Application 1 • Information dissemination in the large • Numerous data sources on the web • Traditional means: search and browse • Alternative – publish and subscribe • System matches (new) data to subscribers’ interests • Periodic notification EDBT 2002, Prague.

Motivating Application 2 • Supply chain automation • Catalog of products and services from suppliers (data) • Registered sets of requirements (subscriptions) from manufacturing units • Notify relevant consumers upon arrival of new data • Other applications include electronic auctioning, online shopping, etc. EDBT 2002, Prague.

Problem • Matching specifications (of products, services, etc.) to requirements (subscriptions) efficiently. • Specs – akin to data. • Requirements – queries. • Data may stream through. • Quickly determine which subscribers/users a piece of data is relevant to. EDBT 2002, Prague.

Problem • Traditional setting: • Large DB • One (at most a few) query at a time • Our problem: • A small DB (a tuple, XML doc, etc.) • Large no. of queries • Dual to traditional problem • Focus of this paper: data = XML docs • Queries = a fragment of XPath EDBT 2002, Prague.

Problem (Formalized) • Given • an XML document • a large number of XPath queries • Determine which queries are answered by each element (formalized using matching) • Query labeling: label each node with sets of queries answered by the subtree rooted there • Naïve Approach doesn’t scale w/ no. of queries • Main challenge: small (1 or 2) # passes over data tree EDBT 2002, Prague.

An Exampe Query • <Result> FOR $p IN document(“catalog.xml”)//part, $b in $p/brand, $q IN $p//part WHERE A2D IN $q/name AND AMD IN $q/brand RETURN $p </Result> EDBT 2002, Prague.

Problem (An Example) EDBT 2002, Prague.

Dual Index • Traditional index – quickly localize search for data matching query pattern • Dual index – for each primitive pattern, determine (sub)queries to which they are relevant • Choice of primitive patterns depends on type of data (e.g., XML vs. relational) • And on classes of queries considered (e.g., chains vs. trees) EDBT 2002, Prague.

Tree Dual Index • Primitive “access path” questions to be answered: • For a constant c, what are leaf appearances? • For a tag t, what are non-leaf appearances? What query nodes are its pc- and ad-children? • Example: Index entry for a: b 1 a 1 DI(a)[L]: (P, 3, {}), (Q, 6, {4,6}) c 2 a 6 c 3 * * b 2 c 3 a 4 DI(a)[N]: (P, 1, F, {2,3}, {}), (P, 4, T, {6}, {5}). c 6 a 4 b 5 b 5 P Q EDBT 2002, Prague.

Tree Labeling Algorithm – 3 Lists a 1 b 2 c 9 d 10 c 3 a 8 c 4 a 11 b 15 d 13 a 5 c 12 d 6 b 7 b 14 3 lists (conceptually) TML(u): (Query, query node, DN, ans-node) PL(u): (P,l,m,x): rel QL(u): Query Ids EDBT 2002, Prague.

Tree Labeling Algo. – TML base case a 1 (P,m,{v1, …, vk})  DI(t)[L] b 2 c 9  d 10 c 3 a 8 (P,v1,m,?), …, (P,vk,m,?)  TML(u), whenever u.tag= t; c 4 a 11 b 15 d 13 a 5 c 12 d 6 If vi=m, ? u. b 7 b 14 e.g.: DI(a)[L] has (Q,6,{4,6}). So, add (Q,4,6,?), & (Q,6,6,?) to TML() (Q,6,6,?)  (Q,6,6,i), i = 1,5, 8, 11. EDBT 2002, Prague.

Tree Labeling Algo. – TML  PL a 1 (P,l,m,x)  TML(u) b 2 c 9  d 10 c 3 a 8 (P,l,m,x):child  PL(parent(u)). (P,l,m,x):desc PL(anc(u)). c 4 a 11 b 15 d 13 a 5 c 12 d 6 e.g.: (Q,4,6,?)  PL(5) b 7 b 14 So, (Q,4,6,?):child  PL(4). And (Q,4,6,?):desc  PL(i), i= 3, 2, 1. Optimizations possible, but suppressed. EDBT 2002, Prague.

Tree Labeling Algo. – TML inductive case a 1 (P,l,B,C,D)  DI(t)[N] b 2 c 9  c  C: (P,c,m,y):child  PL(u) & d  D: (P,d,m,y):rel  PL(u) d 10 c 3 a 8 c 4 a 11 b 15  d 13 a 5 c 12 d 6 (P,l,m,x)  TML(u). If l=m, x  u. b 7 b 14 e.g.: (P,4,T,{6},{5})  DI(a)[N]. Similarly, (P,5,3,?):desc  PL(11) (P,6,3,?)  TML(12), so (P,6,3,?):child  PL(11). So, (P,4,3,?)  TML(11). EDBT 2002, Prague.

Tree Labeling Algo. – QL a 1 • TML, PL, feed each other. • QL – special case of TML • P  QL(u) iff • (P,1,m,x) TML(x). • e.g.: (P,1,3,9)  TML(1), • so P  QL(9). • & (Q,1,6,5)  TML(2), so • Q  QL(5). b 2 c 9 d 10 c 3 a 8 c 4 a 11 b 15 d 13 a 5 c 12 d 6 b 7 b 14 EDBT 2002, Prague.

Tree Labeling – Summary • labeling completed in two passes • pass 1: compute TML/PL (bottom-up) • pass 2: compute QL (top-down) • no. of I/O invocations is 2 * # data tree nodes. • Other algorithms in paper: • chain labeling • chain split labeling of trees EDBT 2002, Prague.

Experiments • matchMaker implementation: • JDK1.3 and C++ • storage – BerkeleyDB 3.17 • dual index stored in disk • lists manipulated in memory • Intel PIII, 1GB RAM, 512K cache, Linux 7.0 • Data sets: • generated using IBM’s XML Gen tool • conforming to GEDCOM DTD (geological data) (about 120 elements) EDBT 2002, Prague.

Experiments • document depth 10; avg fanout – [2, 5] • chain labeling algorithm is at least 5 times faster than query-at-a-time approach • For tree labeling, query-at-a-time doesn’t produce results in reasonable time! • Focus of experiments (for trees): • Direct tree labeling algorithm vs. chain split algorithm (not discussed) EDBT 2002, Prague.

Experiments EDBT 2002, Prague.

Related Work • Documents – user profile match (IR) • Notion of standing queries – long history: • E.g., Tapesty, TriggerMan, NiagaraCQ, etc. • Publish-and-subscribe – Fabret et al. 00, 01. • Patterns: boolean combo of relOp comp value • XFilter 00, 01. • Only determine if doc contains an answer • Multiple answers in one doc not considered EDBT 2002, Prague.

Related Work • XTrie approach • Decompose query tree into ad-free chains • Index using trie • Determine only if a doc contains an answer • Main distinguishing features of matchMaker: • Answers located • Multiple answers per doc • All proposed algorithms – guaranteed resource bounds (e.g., #passes, I/O) EDBT 2002, Prague.

Summary & Future Work • Matching large no. of queries to XML data trees (as they stream through) • Dual to usual query processing • Dual index (chains vs. trees) • Algorithms for query labeling of data trees • Making algorithms more efficient (single pass algorithm for chains: done) • Expanding classes of queries handled • Algebra for this dual query processing problem? EDBT 2002, Prague.

On Efficient Matching of Streaming XML Documents and Queries