Holistic Twig Joins: Optimal XML Pattern Matching

Holistic Twig Joins:Optimal XML Pattern Matching ACM SIGMOD 2002

In this lecture • The Problem • Idea • Preliminaries • PathStack Algorithm • TwigStack Algorithm • Conclusions

The problem • To find semantically connected data in the XML document in the efficient way. • There are many intermediate results produced that doesn’t participate in the final answers.

book author title fn ln XML jane doe The problem (example) • For example we have this XQuery expression: • book[ title = ‘XML’ ] // author [ fn = ‘jane’ and ln = ‘doe’] • We can translate it to the twig (small tree) pattern

The problem (example) • In order to solve this problem we have to • Find all binary relationships line (book, title) and (author, fn) • To connect all the patterns we have found to the compile answer. • The problem is that every book has title but there are only some of the with title ‘XML’, so we produce many intermediate answers that doesn’t participate in the final answer.

Idea • The main Idea of the paper is how to save intermediate results in a compact way. • To develop algorithm that will be independent of the size of intermediate results. • The is a family of stack based algorithms invented for this purpose.

Representing position of elements • Every node in the XML document is represented as • Leaf: 3-tuple (DocId, LeftPos, LevelNum) • Node: 3-tuple (DocId, LeftPos : RightPos, LevelNum)

book (1,1:31,1) authors title (1,5:30,2) (1,2:4,2) author author author XML (1,3,3) (1,6:13,2) (1,14:21,2) (1,22:29,2) fn ln fn ln fn ln (1,7:9,3) (1,10:12,3) (1,15:17,3) (1,18:20,3) (1,23:25,3) (1,26:28,2) jane poe john doe jane doe (1,8,4) (1,11,4) (1,16,4) (1,24,2) (1,19,4) (1,27,2) Representing position of elements • For example

Representing position of elements • For example

book (1,1:31,1) fn ln (1,7:9,3) (1,10:12,3) poe (1,11,4) Representing position of elements • Profits: • Easy to determine • ancestor-descendant relationship • a node n1(D1,L1:R1,N1) is descendant to node n2(D2,L2:R2,N2)iff D1=D2, L2<L1 and R1<R2 • parent-child relationship • a node n1(D1,L1:R1,N1) is parent to node n2(D2,L2:R2,N2)iff D1=D2, L2<L1 , R1<R2 and N1+1=N2

Representing position of elements • Available cases:

Matching stream • A stream Tq contains positional representations of the database nodes that match the node q • The nodes in the stream are sorted by the (DocId,LeftPos)

Tauthor Tjane Matching stream (example) book (1,1:31,1) • The operations available on the streams • eof, advance, next, nextL, nextR authors title (1,5:30,2) (1,2:4,2) author author author author XML author author (1,3,3) (1,14:21,2) (1,14:21,2) (1,22:29,2) (1,22:29,2) (1,6:13,2) (1,6:13,2) fn ln fn ln fn ln (1,23:25,3) (1,26:28,2) (1,15:17,3) (1,18:20,3) (1,7:9,3) (1,10:12,3) jane doe john doe jane jane poe jane (1,16,4) (1,24,2) (1,24,2) (1,19,4) (1,27,2) (1,8,4) (1,8,4) (1,11,4)

Linked stacks • Idea: • Repeatedly construct stacks that contain partial and total answers • Remove partial answers that couldn’t be extended to total answers

A1 B1 A B2 A2 A2 B C1 B1 A1 B2 C Stack encoding Query C1 Data A1 B1 C1 A1 B2 C1 Query results A2 B2 C1 Linked stacks (example)

Stack based algorithms • The stack based algorithms uses chain of linked stack to compactly represent partial and full results

TA TC TB A B C Query PathStack algorithm A1 A1 1:9 1:9 B1 B1 2:8 2:8 A2 A2 3:7 3:7 B2 B2 4:6 4:6 C1 C1 5 5 Data

TA TC TB A B C SC SB SA Query A1 B1 C1 A1 B2 C1 A2 B2 C1 PathStack algorithm A1 A1 B1 C1 1:9 B1 1:9 2:8 5 2:8 A2 B2 A2 3:7 4:6 3:7 B2 Stack encoding 4:6 C1 5 Data Always take an element with smallest LeftPos Query results

TA TC TB A B C SC SB SA Query A1 A1 B1 B1 C2 C1 A1 B2 C1 A2 B2 C1 PathStack algorithm Add C2 here RightPos < LeftPos A1 B2 A2 1:10 4:6 3:7 B1 B1 A1 2:9 2:9 1:10 A2 C2 C2 3:7 8 8 B2 Stack encoding 4:6 C1 5 Data

PathStack algorithm problems • To find a twig we have to divide it to many paths and • Again we have intermediate results that doesn’t participate in the final result authors Query author author author author (22:29) (6:13) (14:21) (5:30) fn ln fn ln fn ln fn ln (23:25) (26:28) (7:9) (10:12) (15:17) (18:20) jane doe jane doe jane poe john doe (24) (27) (8) (16) (11) (19)

TwigStack Algorithm • Idea • Before adding the node to the stack check that he has suns that satisfies the twig pattern. • When checking the sons theirs sons are checked to • Now we can be shure that every path result is joinable with at least one other path result and participates in at least one full answer.

TwigStack Algorithm authors author author author author (22:29) (6:13) (14:21) (5:30) fn ln fn ln fn ln fn ln (23:25) (26:28) (7:9) (10:12) (15:17) (18:20) jane doe jane doe jane poe john doe (24) (27) (8) (16) (11) (19)

In this lecture • The Problem • Idea • Preliminaries • PathStack Algorithm • TwigStackAlgorithm • Conclusions

Conclusions • The PathStack and TwigStack algorithms are effective in terms of amount of intermediate results • But: • They are only effective for founding ancestor-descendant relationships. • If we have also parent-son relationships in the twig then not all nodes that are inserted to the stacks participate in the final result.

Brake ?

Query Structured Text in an XML Database ACM SIGMOD 2003

In this lecture • Abstract • Introduction • Motivation • Algebra • Access methods • Conclusions

Abstract • XML documents often contain documents with structured text • It is important to integrate “information retrieval” style query evaluation • It is well studied for natural languages • But in the case of XML the data could reside in element descendants.

Introduction • Boolean style queries (XQuery) • Useful when users are aware of the underlying schema • But • Users often don’t know the schema • And collections of XML documents are frequently heterogeneous.

Introduction • So we have to use relevance ranking in order to define the IR on XML • Problem: traditional IR is “document-centric” • XML IR should • Be much more granulated • Take document structure into account • Allow more complex analysis then determination of relevance

We have the following XML document named article.xml Motivation #a1 article #a6 … #a2 #a10 #a3 article-title author chapter chapter #a7 #a4 #a5 … #a11 ct ct fname sname Internet Technologies #a12 #a14 #a16 Jane Doe Cashing and Replication Search and Retrieval section section section #a13 #a15 #a17 section-title section-title section-title Search Engine Information Retrieval #a20 #a18 Examples p p p #a19 semantic information retrieval techniques are also being incorporated into some search engines … Here are some IR based Search Engines: … …search engine NewSearch uses a new information retrieval technology

Motivation • Consider the query • Find document components in articles.xml that are about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary. • Using AND and OR predicated will not give us the desirable result

We have the following XML document named article.xml Motivation #a1 article #a6 … #a2 #a10 #a3 article-title author chapter chapter #a7 #a4 #a5 … #a11 ct ct fname sname Internet Technologies #a12 #a14 #a16 Jane Doe Cashing and Replication Search and Retrieval section section section #a13 #a15 #a17 section-title section-title section-title Search Engine Information Retrieval #a20 #a18 Examples p p p #a19 semantic information retrieval techniques are also being incorporated into some search engines … Here are some IR based Search Engines: … …search engine NewSearch uses an information retrieval technology

Motivation • Illustrating granulation problem • What elements to rank? • If we will rank article • The user will see all the article while the relevant information concentrated only in the third chapter • If we will rank paragraphs • The paragraphs of the last section will be returned separately • The semantic linkage is broken and has to be reconstructed by the user

Motivation • IR-style XML queries don’t have to be stand alone • If the use know the structure of the XML document he can add some structural constraints and limit the number of uninteresting results

Algebra • We want to fold into a database framework the notion of relevance scoring and ranking

article[3.6] #a1 author #a3 section[3.6] #a16 sname #a5 Algebra • Scored Data Tree • Definition: • A rooted ordered tree, such that each node has attribute-value pairs, including at least a tag and a real number valued score • A score of a tree is a score of a root node • Example:

Algebra • Scored Pattern Tree • Definition: • P = (T,F,S) • T=>node-labeled and edge-labeled tree • F=> formula of boolean combination of predicates applicable to nodes • S=> set of scoring function

Scored Pattern Tree Example: $1 pc ad* $2 $4 pc $3 Algebra T: Query2: Find document components in the artilce.xml that are part of an article written by an author with last name “Doe” and are about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary. F: $1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe” S: $4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})} $1.score = $4.score

Algebra • Common operators • Selection => Scored Selection • Projection => Scored Projection • Join => Scored Join • New Operators • Threshold • Pick

$1 pc ad* $2 $4 pc $3 article[3.6] #a3 article[3.6] #a1 article[3.6] #a3 article[3.6] #a1 article[3.6] #a3 author #a3 author #a3 author #a23 author #a23 author #a23 section[3.6] #a16 section[3.6] #a36 section[3.6] #a16 section[3.6] #a36 section[3.6] #a36 F: $1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe” T: sname #a25 sname #a25 sname #a5 sname #a5 sname #a25 S: $4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})} $1.score = $4.score Algebra (New Operators) Threshold TC %a > ...

$1 pc ad* $2 $4 pc $3 article[3.6] #a1 article[3.6] #a3 article[3.6] #a3 article[3.6] #a1 article[3.6] #a2 author #a23 author #a3 author #a3 author #a13 author #a23 section[3.6] #a16 section[3.6] #a36 section[3.6] #a36 section[3.6] #a26 section[3.6] #a16 F: $1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe” T: sname #a25 sname #a15 sname #a5 sname #a5 sname #a25 S: $4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})} $1.score = $4.score Algebra (New Operators) Pick PC

Algebra (New Operators) • Pick Example: article[5.6] #a1 • Pick Condition • Data is relevant if: • score > 0.8 • more then 50% of children are relevant • it’s direct parent node is not picked sname #a5 article[5.6] #a1 title[0.6] #a2 chapter[5.0] #a10 section[0.8] #a12 section[0.6] #a14 section[3.6] #a16 title[0.8] #a13 title[0.6] #a15 p[0.8] #a18 p[1.4] #a19 p[1.4] #a20 Data Tree

Holistic Twig Joins: Optimal XML Pattern Matching