250 likes | 361 Views
On Efficient Matching of Streaming XML Documents and Queries. Laks V.S. Lakshmanan 1 P. Sailaja 2 University of British Columbia, Canada Indian Inst. of Tech., Bombay, India (work performed while visiting IIT-Bombay). Outline. Motivating Applications Problem Dual Index Algorithms
E N D
On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan1 P. Sailaja2 University of British Columbia, Canada Indian Inst. of Tech., Bombay, India (work performed while visiting IIT-Bombay).
Outline • Motivating Applications • Problem • Dual Index • Algorithms • Experiments • Summary & Future Work EDBT 2002, Prague.
Motivating Application 1 • Information dissemination in the large • Numerous data sources on the web • Traditional means: search and browse • Alternative – publish and subscribe • System matches (new) data to subscribers’ interests • Periodic notification EDBT 2002, Prague.
Motivating Application 2 • Supply chain automation • Catalog of products and services from suppliers (data) • Registered sets of requirements (subscriptions) from manufacturing units • Notify relevant consumers upon arrival of new data • Other applications include electronic auctioning, online shopping, etc. EDBT 2002, Prague.
Problem • Matching specifications (of products, services, etc.) to requirements (subscriptions) efficiently. • Specs – akin to data. • Requirements – queries. • Data may stream through. • Quickly determine which subscribers/users a piece of data is relevant to. EDBT 2002, Prague.
Problem • Traditional setting: • Large DB • One (at most a few) query at a time • Our problem: • A small DB (a tuple, XML doc, etc.) • Large no. of queries • Dual to traditional problem • Focus of this paper: data = XML docs • Queries = a fragment of XPath EDBT 2002, Prague.
Problem (Formalized) • Given • an XML document • a large number of XPath queries • Determine which queries are answered by each element (formalized using matching) • Query labeling: label each node with sets of queries answered by the subtree rooted there • Naïve Approach doesn’t scale w/ no. of queries • Main challenge: small (1 or 2) # passes over data tree EDBT 2002, Prague.
An Exampe Query • <Result> FOR $p IN document(“catalog.xml”)//part, $b in $p/brand, $q IN $p//part WHERE A2D IN $q/name AND AMD IN $q/brand RETURN $p </Result> EDBT 2002, Prague.
Problem (An Example) EDBT 2002, Prague.
Dual Index • Traditional index – quickly localize search for data matching query pattern • Dual index – for each primitive pattern, determine (sub)queries to which they are relevant • Choice of primitive patterns depends on type of data (e.g., XML vs. relational) • And on classes of queries considered (e.g., chains vs. trees) EDBT 2002, Prague.
Tree Dual Index • Primitive “access path” questions to be answered: • For a constant c, what are leaf appearances? • For a tag t, what are non-leaf appearances? What query nodes are its pc- and ad-children? • Example: Index entry for a: b 1 a 1 DI(a)[L]: (P, 3, {}), (Q, 6, {4,6}) c 2 a 6 c 3 * * b 2 c 3 a 4 DI(a)[N]: (P, 1, F, {2,3}, {}), (P, 4, T, {6}, {5}). c 6 a 4 b 5 b 5 P Q EDBT 2002, Prague.
Tree Labeling Algorithm – 3 Lists a 1 b 2 c 9 d 10 c 3 a 8 c 4 a 11 b 15 d 13 a 5 c 12 d 6 b 7 b 14 3 lists (conceptually) TML(u): (Query, query node, DN, ans-node) PL(u): (P,l,m,x): rel QL(u): Query Ids EDBT 2002, Prague.
Tree Labeling Algo. – TML base case a 1 (P,m,{v1, …, vk}) DI(t)[L] b 2 c 9 d 10 c 3 a 8 (P,v1,m,?), …, (P,vk,m,?) TML(u), whenever u.tag= t; c 4 a 11 b 15 d 13 a 5 c 12 d 6 If vi=m, ? u. b 7 b 14 e.g.: DI(a)[L] has (Q,6,{4,6}). So, add (Q,4,6,?), & (Q,6,6,?) to TML() (Q,6,6,?) (Q,6,6,i), i = 1,5, 8, 11. EDBT 2002, Prague.
Tree Labeling Algo. – TML PL a 1 (P,l,m,x) TML(u) b 2 c 9 d 10 c 3 a 8 (P,l,m,x):child PL(parent(u)). (P,l,m,x):desc PL(anc(u)). c 4 a 11 b 15 d 13 a 5 c 12 d 6 e.g.: (Q,4,6,?) PL(5) b 7 b 14 So, (Q,4,6,?):child PL(4). And (Q,4,6,?):desc PL(i), i= 3, 2, 1. Optimizations possible, but suppressed. EDBT 2002, Prague.
Tree Labeling Algo. – TML inductive case a 1 (P,l,B,C,D) DI(t)[N] b 2 c 9 c C: (P,c,m,y):child PL(u) & d D: (P,d,m,y):rel PL(u) d 10 c 3 a 8 c 4 a 11 b 15 d 13 a 5 c 12 d 6 (P,l,m,x) TML(u). If l=m, x u. b 7 b 14 e.g.: (P,4,T,{6},{5}) DI(a)[N]. Similarly, (P,5,3,?):desc PL(11) (P,6,3,?) TML(12), so (P,6,3,?):child PL(11). So, (P,4,3,?) TML(11). EDBT 2002, Prague.
Tree Labeling Algo. – QL a 1 • TML, PL, feed each other. • QL – special case of TML • P QL(u) iff • (P,1,m,x) TML(x). • e.g.: (P,1,3,9) TML(1), • so P QL(9). • & (Q,1,6,5) TML(2), so • Q QL(5). b 2 c 9 d 10 c 3 a 8 c 4 a 11 b 15 d 13 a 5 c 12 d 6 b 7 b 14 EDBT 2002, Prague.
Tree Labeling – Summary • labeling completed in two passes • pass 1: compute TML/PL (bottom-up) • pass 2: compute QL (top-down) • no. of I/O invocations is 2 * # data tree nodes. • Other algorithms in paper: • chain labeling • chain split labeling of trees EDBT 2002, Prague.
Experiments • matchMaker implementation: • JDK1.3 and C++ • storage – BerkeleyDB 3.17 • dual index stored in disk • lists manipulated in memory • Intel PIII, 1GB RAM, 512K cache, Linux 7.0 • Data sets: • generated using IBM’s XML Gen tool • conforming to GEDCOM DTD (geological data) (about 120 elements) EDBT 2002, Prague.
Experiments • document depth 10; avg fanout – [2, 5] • chain labeling algorithm is at least 5 times faster than query-at-a-time approach • For tree labeling, query-at-a-time doesn’t produce results in reasonable time! • Focus of experiments (for trees): • Direct tree labeling algorithm vs. chain split algorithm (not discussed) EDBT 2002, Prague.
Experiments EDBT 2002, Prague.
Experiments EDBT 2002, Prague.
Experiments EDBT 2002, Prague.
Related Work • Documents – user profile match (IR) • Notion of standing queries – long history: • E.g., Tapesty, TriggerMan, NiagaraCQ, etc. • Publish-and-subscribe – Fabret et al. 00, 01. • Patterns: boolean combo of relOp comp value • XFilter 00, 01. • Only determine if doc contains an answer • Multiple answers in one doc not considered EDBT 2002, Prague.
Related Work • XTrie approach • Decompose query tree into ad-free chains • Index using trie • Determine only if a doc contains an answer • Main distinguishing features of matchMaker: • Answers located • Multiple answers per doc • All proposed algorithms – guaranteed resource bounds (e.g., #passes, I/O) EDBT 2002, Prague.
Summary & Future Work • Matching large no. of queries to XML data trees (as they stream through) • Dual to usual query processing • Dual index (chains vs. trees) • Algorithms for query labeling of data trees • Making algorithms more efficient (single pass algorithm for chains: done) • Expanding classes of queries handled • Algebra for this dual query processing problem? EDBT 2002, Prague.