1 / 63

Holistic Twig Joins: Optimal XML Pattern Matching

Holistic Twig Joins: Optimal XML Pattern Matching. ACM SIGMOD 2002. In this lecture. The Problem Idea Preliminaries PathStack Algorithm TwigStack Algorithm Conclusions. The problem. To find semantically connected data in the XML document in the efficient way.

Download Presentation

Holistic Twig Joins: Optimal XML Pattern Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Holistic Twig Joins:Optimal XML Pattern Matching ACM SIGMOD 2002

  2. In this lecture • The Problem • Idea • Preliminaries • PathStack Algorithm • TwigStack Algorithm • Conclusions

  3. The problem • To find semantically connected data in the XML document in the efficient way. • There are many intermediate results produced that doesn’t participate in the final answers.

  4. book author title fn ln XML jane doe The problem (example) • For example we have this XQuery expression: • book[ title = ‘XML’ ] // author [ fn = ‘jane’ and ln = ‘doe’] • We can translate it to the twig (small tree) pattern

  5. The problem (example) • In order to solve this problem we have to • Find all binary relationships line (book, title) and (author, fn) • To connect all the patterns we have found to the compile answer. • The problem is that every book has title but there are only some of the with title ‘XML’, so we produce many intermediate answers that doesn’t participate in the final answer.

  6. In this lecture • The Problem • Idea • Preliminaries • PathStack Algorithm • TwigStack Algorithm • Conclusions

  7. Idea • The main Idea of the paper is how to save intermediate results in a compact way. • To develop algorithm that will be independent of the size of intermediate results. • The is a family of stack based algorithms invented for this purpose.

  8. In this lecture • The Problem • Idea • Preliminaries • PathStack Algorithm • TwigStack Algorithm • Conclusions

  9. Representing position of elements • Every node in the XML document is represented as • Leaf: 3-tuple (DocId, LeftPos, LevelNum) • Node: 3-tuple (DocId, LeftPos : RightPos, LevelNum)

  10. book (1,1:31,1) authors title (1,5:30,2) (1,2:4,2) author author author XML (1,3,3) (1,6:13,2) (1,14:21,2) (1,22:29,2) fn ln fn ln fn ln (1,7:9,3) (1,10:12,3) (1,15:17,3) (1,18:20,3) (1,23:25,3) (1,26:28,2) jane poe john doe jane doe (1,8,4) (1,11,4) (1,16,4) (1,24,2) (1,19,4) (1,27,2) Representing position of elements • For example

  11. Representing position of elements • For example

  12. book (1,1:31,1) fn ln (1,7:9,3) (1,10:12,3) poe (1,11,4) Representing position of elements • Profits: • Easy to determine • ancestor-descendant relationship • a node n1(D1,L1:R1,N1) is descendant to node n2(D2,L2:R2,N2)iff D1=D2, L2<L1 and R1<R2 • parent-child relationship • a node n1(D1,L1:R1,N1) is parent to node n2(D2,L2:R2,N2)iff D1=D2, L2<L1 , R1<R2 and N1+1=N2

  13. Representing position of elements • Available cases:

  14. Matching stream • A stream Tq contains positional representations of the database nodes that match the node q • The nodes in the stream are sorted by the (DocId,LeftPos)

  15. Tauthor Tjane Matching stream (example) book (1,1:31,1) • The operations available on the streams • eof, advance, next, nextL, nextR authors title (1,5:30,2) (1,2:4,2) author author author author XML author author (1,3,3) (1,14:21,2) (1,14:21,2) (1,22:29,2) (1,22:29,2) (1,6:13,2) (1,6:13,2) fn ln fn ln fn ln (1,23:25,3) (1,26:28,2) (1,15:17,3) (1,18:20,3) (1,7:9,3) (1,10:12,3) jane doe john doe jane jane poe jane (1,16,4) (1,24,2) (1,24,2) (1,19,4) (1,27,2) (1,8,4) (1,8,4) (1,11,4)

  16. Linked stacks • Idea: • Repeatedly construct stacks that contain partial and total answers • Remove partial answers that couldn’t be extended to total answers

  17. A1 B1 A B2 A2 A2 B C1 B1 A1 B2 C Stack encoding Query C1 Data A1 B1 C1 A1 B2 C1 Query results A2 B2 C1 Linked stacks (example)

  18. In this lecture • The Problem • Idea • Preliminaries • PathStack Algorithm • TwigStack Algorithm • Conclusions

  19. Stack based algorithms • The stack based algorithms uses chain of linked stack to compactly represent partial and full results

  20. TA TC TB A B C Query PathStack algorithm A1 A1 1:9 1:9 B1 B1 2:8 2:8 A2 A2 3:7 3:7 B2 B2 4:6 4:6 C1 C1 5 5 Data

  21. TA TC TB A B C SC SB SA Query A1 B1 C1 A1 B2 C1 A2 B2 C1 PathStack algorithm A1 A1 B1 C1 1:9 B1 1:9 2:8 5 2:8 A2 B2 A2 3:7 4:6 3:7 B2 Stack encoding 4:6 C1 5 Data Always take an element with smallest LeftPos Query results

  22. TA TC TB A B C SC SB SA Query A1 A1 B1 B1 C2 C1 A1 B2 C1 A2 B2 C1 PathStack algorithm Add C2 here RightPos < LeftPos A1 B2 A2 1:10 4:6 3:7 B1 B1 A1 2:9 2:9 1:10 A2 C2 C2 3:7 8 8 B2 Stack encoding 4:6 C1 5 Data

  23. PathStack algorithm problems • To find a twig we have to divide it to many paths and • Again we have intermediate results that doesn’t participate in the final result authors Query author author author author (22:29) (6:13) (14:21) (5:30) fn ln fn ln fn ln fn ln (23:25) (26:28) (7:9) (10:12) (15:17) (18:20) jane doe jane doe jane poe john doe (24) (27) (8) (16) (11) (19)

  24. In this lecture • The Problem • Idea • Preliminaries • PathStack Algorithm • TwigStack Algorithm • Conclusions

  25. TwigStack Algorithm • Idea • Before adding the node to the stack check that he has suns that satisfies the twig pattern. • When checking the sons theirs sons are checked to • Now we can be shure that every path result is joinable with at least one other path result and participates in at least one full answer.

  26. TwigStack Algorithm authors author author author author (22:29) (6:13) (14:21) (5:30) fn ln fn ln fn ln fn ln (23:25) (26:28) (7:9) (10:12) (15:17) (18:20) jane doe jane doe jane poe john doe (24) (27) (8) (16) (11) (19)

  27. In this lecture • The Problem • Idea • Preliminaries • PathStack Algorithm • TwigStackAlgorithm • Conclusions

  28. Conclusions • The PathStack and TwigStack algorithms are effective in terms of amount of intermediate results • But: • They are only effective for founding ancestor-descendant relationships. • If we have also parent-son relationships in the twig then not all nodes that are inserted to the stacks participate in the final result.

  29. Brake ?

  30. Query Structured Text in an XML Database ACM SIGMOD 2003

  31. In this lecture • Abstract • Introduction • Motivation • Algebra • Access methods • Conclusions

  32. Abstract • XML documents often contain documents with structured text • It is important to integrate “information retrieval” style query evaluation • It is well studied for natural languages • But in the case of XML the data could reside in element descendants.

  33. In this lecture • Abstract • Introduction • Motivation • Algebra • Access methods • Conclusions

  34. Introduction • Boolean style queries (XQuery) • Useful when users are aware of the underlying schema • But • Users often don’t know the schema • And collections of XML documents are frequently heterogeneous.

  35. Introduction • So we have to use relevance ranking in order to define the IR on XML • Problem: traditional IR is “document-centric” • XML IR should • Be much more granulated • Take document structure into account • Allow more complex analysis then determination of relevance

  36. In this lecture • Abstract • Introduction • Motivation • Algebra • Access methods • Conclusions

  37. We have the following XML document named article.xml Motivation #a1 article #a6 … #a2 #a10 #a3 article-title author chapter chapter #a7 #a4 #a5 … #a11 ct ct fname sname Internet Technologies #a12 #a14 #a16 Jane Doe Cashing and Replication Search and Retrieval section section section #a13 #a15 #a17 section-title section-title section-title Search Engine Information Retrieval #a20 #a18 Examples p p p #a19 semantic information retrieval techniques are also being incorporated into some search engines … Here are some IR based Search Engines: … …search engine NewSearch uses a new information retrieval technology

  38. Motivation • Consider the query • Find document components in articles.xml that are about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary. • Using AND and OR predicated will not give us the desirable result

  39. We have the following XML document named article.xml Motivation #a1 article #a6 … #a2 #a10 #a3 article-title author chapter chapter #a7 #a4 #a5 … #a11 ct ct fname sname Internet Technologies #a12 #a14 #a16 Jane Doe Cashing and Replication Search and Retrieval section section section #a13 #a15 #a17 section-title section-title section-title Search Engine Information Retrieval #a20 #a18 Examples p p p #a19 semantic information retrieval techniques are also being incorporated into some search engines … Here are some IR based Search Engines: … …search engine NewSearch uses an information retrieval technology

  40. Motivation • Illustrating granulation problem • What elements to rank? • If we will rank article • The user will see all the article while the relevant information concentrated only in the third chapter • If we will rank paragraphs • The paragraphs of the last section will be returned separately • The semantic linkage is broken and has to be reconstructed by the user

  41. Motivation • IR-style XML queries don’t have to be stand alone • If the use know the structure of the XML document he can add some structural constraints and limit the number of uninteresting results

  42. In this lecture • Abstract • Introduction • Motivation • Algebra • Access methods • Conclusions

  43. Algebra • We want to fold into a database framework the notion of relevance scoring and ranking

  44. article[3.6] #a1 author #a3 section[3.6] #a16 sname #a5 Algebra • Scored Data Tree • Definition: • A rooted ordered tree, such that each node has attribute-value pairs, including at least a tag and a real number valued score • A score of a tree is a score of a root node • Example:

  45. Algebra • Scored Pattern Tree • Definition: • P = (T,F,S) • T=>node-labeled and edge-labeled tree • F=> formula of boolean combination of predicates applicable to nodes • S=> set of scoring function

  46. Scored Pattern Tree Example: $1 pc ad* $2 $4 pc $3 Algebra T: Query2: Find document components in the artilce.xml that are part of an article written by an author with last name “Doe” and are about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary. F: $1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe” S: $4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})} $1.score = $4.score

  47. Algebra • Common operators • Selection => Scored Selection • Projection => Scored Projection • Join => Scored Join • New Operators • Threshold • Pick

  48. $1 pc ad* $2 $4 pc $3 article[3.6] #a3 article[3.6] #a1 article[3.6] #a3 article[3.6] #a1 article[3.6] #a3 author #a3 author #a3 author #a23 author #a23 author #a23 section[3.6] #a16 section[3.6] #a36 section[3.6] #a16 section[3.6] #a36 section[3.6] #a36 F: $1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe” T: sname #a25 sname #a25 sname #a5 sname #a5 sname #a25 S: $4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})} $1.score = $4.score Algebra (New Operators) Threshold TC %a > ...

  49. $1 pc ad* $2 $4 pc $3 article[3.6] #a1 article[3.6] #a3 article[3.6] #a3 article[3.6] #a1 article[3.6] #a2 author #a23 author #a3 author #a3 author #a13 author #a23 section[3.6] #a16 section[3.6] #a36 section[3.6] #a36 section[3.6] #a26 section[3.6] #a16 F: $1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe” T: sname #a25 sname #a15 sname #a5 sname #a5 sname #a25 S: $4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})} $1.score = $4.score Algebra (New Operators) Pick PC

  50. Algebra (New Operators) • Pick Example: article[5.6] #a1 • Pick Condition • Data is relevant if: • score > 0.8 • more then 50% of children are relevant • it’s direct parent node is not picked sname #a5 article[5.6] #a1 title[0.6] #a2 chapter[5.0] #a10 section[0.8] #a12 section[0.6] #a14 section[3.6] #a16 title[0.8] #a13 title[0.6] #a15 p[0.8] #a18 p[1.4] #a19 p[1.4] #a20 Data Tree

More Related