1 / 44

Efficient Keyword Search Over Virtual XML Views

Technion - Israel Institute of Technology. Efficient Keyword Search Over Virtual XML Views. Computer Science Department. Tal Herscovitz. Authors: Feng Shao , Lin Guo , Chavadar Botev , Anand Bhaskar , Muthiah Chettiar , Fan Yang. Outline. Motivation and Problem Definition

yaron
Download Presentation

Efficient Keyword Search Over Virtual XML Views

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Technion - Israel Institute of Technology Efficient Keyword Search Over Virtual XML Views Computer Science Department Tal Herscovitz Authors: FengShao, Lin Guo, ChavadarBotev, AnandBhaskar, MuthiahChettiar, Fan Yang

  2. Outline • Motivation and Problem Definition • Existing Data and Data Structures • Algorithm • Experiments

  3. Personalized Portal • my.yahoo.com

  4. The Problem… • Traditional information retrieval systems rely heavily on the assumption that the set of documents being searched is materialized.

  5. Materialized XML Views?

  6. Materialized XML Views? Tradeoff

  7. Problem Example <books> <book><isbn>111-11-1111</isbn> <title>XML Web Services </title> <publisher>Prentice Hall </publisher> <year> 2004 </year> </book> <book><isbn>222-22-2222</isbn> <title>Artificial Intelligence </title> <publisher> Prentice Hall </publisher> <year> 2002 </year> </book> ... </books> <reviews> <review><isbn>111-11-1111</isbn> <rate> Excellent </rate> <content>…about search…</content> <reviewer>John</reviewer> </review> <review> <isbn>111-11-1111</isbn> <rate> Good </rate> <content> Easy to read…</content> <reviewer>Alex</reviewer> </review> ... </reviews>

  8. Problem Example let $view := for book in fn:doc(books.xml)/books//book where book/year > 1995 return <bookrevs> <book> {$book/title} </book>, {for $rev in fn:doc(reviews.xml)/reviews//review where rev/isbn = $book/isbn return rev/content} </bookrevs> for $bookrev in $view where bookrevftcontains('XML'& Search') return bookrev

  9. Problem Example <bookrevs> <book isbn=“111-11-1111”> <title>XML Web Services</title> <review><content>...about search...</content></review> <review><content> Easy to read... </content></review> ... </book> <book isbn=“222-22-2222”> <title> Artificial Intelligence </title> <review>...</review>... </book> </bookrevs>

  10. Challenges • How do we efficiently compute statistics on the view from the statistics on the base data, so that the resulting scores and rank order of the query results is exactly the same as when the view is materialized? Materialized view Rank Base data Virtual view Rank Rank

  11. Problem Definition Input A set of keywords {Q={k1, k2, … , kn An XML view V over an XML database D Ranked keyword search over virtual XML views k view elements with highest scores Output

  12. Outline • Motivation and Problem Definition • Existing Data and Data Structures • Algorithm • Experiments

  13. Scoring System • tf(e,k) • Number of distinct occurrences of keyword k in element e and its descendants (eV(D)). • idf(k) • The ratio of the number of elements in the view result (eV(D)) to the number of elements in V(D) that contain the keyword k.

  14. Dewey ID • Dewey IDs is a hierarchical numbering method where the ID of an element contains the ID of its parent element as a prefix.

  15. Path Index B+ Tree

  16. Inverted Index B+ tree index Jane 1.2.3 1 1.7.3 1 XQFT 1.1.2 2 … … … (ID, tf)

  17. Outline • Motivation and Problem Definition • Existing Data and Data Structures • Algorithm • Experiments

  18. Algorithm – 3 Steps

  19. Step 1 QPT – Query Pattern Tree • Single line - parent/child relationship • Double line - ancestor/decendant relationship • Solid line - mandatory edge • Dotted line - optional edge • Nodes might have a predicate • C - the content of the node is propagated to the view output • V - the value of the node is required to evaluate the view

  20. Step 2 PDT - Pruned Document Tree <books> <book> <isbn id=”1.2.1”>121-23-1321</isbn> <title id="1.2.3" kwd1=”xml” tf1=”1" kwd2=”search” tf2=”0"/> <year id=”1.2.6”>1996</year> </book> ... </books> <reviews> <review> <isbn id=”2.2.1”>121-23-1321</isbn> <content id="2.1.3" kwd1=”xml” tf1=”0” kwd2=”search” tf2=”2"/> </review> ... </reviews> QPT PDT

  21. Step 2 PDT Constraints • Each element e in the document corresponding to a node n in the QPT is selected only if:

  22. Step 2 PDT Creation 1: GeneratePDT (QPT qpt, PathIndex pindex, KeywordSet kwds, InvertedIndex iindex): PDT 2: pdt ← ∅ 3: (pathLists, invLists) ← PrepareLists(qpt, pindex, iindex, kwds) 4: for idlist ∈ pathLists do 5: AddCTNode(CT.root, GetMinEntry(idlist), 0) 6: end for 7: while CT.hasMoreNodes() do 8: for all n ∈ CT.MinIDPath do 9: q ← n.QPTNode 10: if pathLists(q).hasNextID() ∧ there do not exist ≥ 2 IDs in pathLists(q) and also in CT then 11: AddCTNode(CT.root,pathLists(q).NextMin(),0) 12: end if 13: end for 14: CreatePDTNodes(CT.root, qpt, pdt) 15: end while 16: return pdt

  23. Step 2 Prepare Lists Algorithm • Goal: prepare a list of Dewey IDs and elements required for PDT. • QPT nodes that don’t have mandatory child edges • Nodes with ’v’ annotation • Nodes that satisfy their predicate

  24. Step 2 Prepare Lists Algorithm • (books//book/isbn, (1.1.1:“111-11-1111”), (1.2.1:“121-23-1321”),... ) • (books//book/title,1.1.4, 1.2.3, 1.9.3, …) • (books//book/year, (1.2.6, 1.5.1:“1996”), (1.6.1:”1997"), …)

  25. Step 2 Prepare Lists Algorithm • Return the relevant inverted index indices to obtain scoring information XML 1.2.3 1 1.3.4 2 Search 2.1.3 2 … … … • (“xml”,(1.2.3:1),, (1.3.4:2), …) • (“search”,(2.1.3:2), (2.5.1:1), …)

  26. Step 2 Prepare Lists Output • For the running example, Prepare Lists will return: • PrepareList():pathLists • (books//book/isbn, (1.1.1:“111-11-1111”), (1.2.1:“121-23-1321”),... ) • (books//book/title,1.1.4, 1.2.3, 1.9.3, …) • (books//book/year, (1.2.6, 1.5.1:“1996”), (1.6.1:”1997"), …) • PrepareList():invLists • (“xml”,(1.2.3:1),, (1.3.4:2), …) (“search”,(2.1.3:2), (2.5.1:1), …)

  27. Step 2 PDT Creation 1: GeneratePDT (QPT qpt, PathIndex pindex, KeywordSet kwds, InvertedIndex iindex): PDT 2: pdt ← ∅ 3: (pathLists, invLists) ← PrepareLists(qpt, pindex, iindex, kwds) 4: for idlist ∈ pathLists do 5: AddCTNode(CT.root, GetMinEntry(idlist), 0) 6: end for 7: while CT.hasMoreNodes() do 8: for all n ∈ CT.MinIDPath do 9: q ← n.QPTNode 10: if pathLists(q).hasNextID() ∧ there do not exist ≥ 2 IDs in pathLists(q) and also in CT then 11: AddCTNode(CT.root,pathLists(q).NextMin(),0) 12: end if 13: end for 14: CreatePDTNodes(CT.root, qpt, pdt) 15: end while 16: return pdt

  28. Step 2 Candidate Tree • Each node cn in the CT stores sufficient information to efficiently check ancestor and descendant constraints • ID - the unique identifier of cn, which always corresponds to a prefix of a Dewey ID in pathLists • QNode - the QPT node to which cn.ID corresponds

  29. Step 2 Candidate Tree • ParentList (PL) - a list of cn’s ancestors whose QNode’s are the parent node of cn.Qnode • DescendantMap (DM) - maps each mandatory child/descendant of cn.Qnode to 1 if it exists or 0 if not • PdtCache - the cache storing cn’s descendants that satisfy descendant restrictions but whose ancestor restrictions are yet to be checked

  30. Step 2 Candidate Tree Example

  31. Step 2 AddCTNode Algorithm • A prefix is added to the CT if it has a corresponding QPT node and is not already in the CT • If a prefix is associated with a ’c’ annotation, the tf values are retrieved from the inverted lists

  32. Step 2 PDT Creation 1: GeneratePDT (QPT qpt, PathIndex pindex, KeywordSet kwds, InvertedIndex iindex): PDT 2: pdt ← ∅ 3: (pathLists, invLists) ← PrepareLists(qpt, pindex, iindex, kwds) 4: for idlist ∈ pathLists do 5: AddCTNode(CT.root, GetMinEntry(idlist), 0) 6: end for 7: while CT.hasMoreNodes() do 8: for all n ∈ CT.MinIDPath do 9: q ← n.QPTNode 10: if pathLists(q).hasNextID() ∧ there do not exist ≥ 2 IDs in pathLists(q) and also in CT then 11: AddCTNode(CT.root,pathLists(q).NextMin(),0) 12: end if 13: end for 14: CreatePDTNodes(CT.root, qpt, pdt) 15: end while 16: return pdt

  33. Step 2 The Main Loop • Adds new Dewey IDs to the CT • Creates PDT nodes using CT nodes • Every iteration ensures that the Dewey IDs that are processed and known to be PDT nodes, are either in the CT or in the result PDT • The result PDT only contains IDs that satisfy the PDT definition

  34. Step 2 The Main Loop • The main loop has 3 stages:

  35. Step 2 The Main Loop - Stage A 1: GeneratePDT (QPT qpt, PathIndex pindex, KeywordSet kwds, InvertedIndex iindex): PDT 2: pdt ← ∅ 3: (pathLists, invLists) ← PrepareLists(qpt, pindex, iindex, kwds) 4: for idlist ∈ pathLists do 5: AddCTNode(CT.root, GetMinEntry(idlist), 0) 6: end for 7: while CT.hasMoreNodes() do 8: for all n ∈ CT.MinIDPath do 9: q ← n.QPTNode 10: if pathLists(q).hasNextID() ∧ there do not exist ≥ 2 IDs in pathLists(q) and also in CT then 11: AddCTNode(CT.root,pathLists(q).NextMin(),0) 12: end if 13: end for 14: CreatePDTNodes(CT.root, qpt, pdt) 15: end while 16: return pdt

  36. Step 2 The Main Loop - Stage A • The algorithm adds the minimum IDs in pathLists corresponding to the QPT nodes • (books//book/isbn, (1.1.1:“111-11-1111”), • (1.2.1:“121-23-1321”),... )

  37. Step 2 The Main Loop - Stages B,C 1: GeneratePDT (QPT qpt, PathIndex pindex, KeywordSet kwds, InvertedIndex iindex): PDT 2: pdt ← ∅ 3: (pathLists, invLists) ← PrepareLists(qpt, pindex, iindex, kwds) 4: for idlist ∈ pathLists do 5: AddCTNode(CT.root, GetMinEntry(idlist), 0) 6: end for 7: while CT.hasMoreNodes() do 8: for all n ∈ CT.MinIDPath do 9: q ← n.QPTNode 10: if pathLists(q).hasNextID() ∧ there do not exist ≥ 2 IDs in pathLists(q) and also in CT then 11: AddCTNode(CT.root,pathLists(q).NextMin(),0) 12: end if 13: end for 14: CreatePDTNodes(CT.root, qpt, pdt) 15: end while 16: return pdt

  38. Step 2 The Main Loop - Stage B • The algorithm creates PDT nodes using CT nodes in CT.MinIDPath • From top down: • If the node satisfies the descendant constraints (DM check) then add it to its parent PdtCache • Recursively invoke CreatePDTNodes on the element PdtCache:isbn, 1.1.1

  39. Step 2 The Main Loop - Stage C • The algorithm starts removing nodes from bottom up • For example, after processing and removing node “title”, we will remove node “book” because it doesn’t have children and it doesn’t satisfy descendant constraints. PdtCache:isbn, 1.1.1title, 1.1.4

  40. Step 2 The Main Loop - Stage C PdtCache:isbn, 1.2.1title, 1.2.3year, 1.2.6 PdtCache:book, 1.2 Before removing book 1.2 PdtCache:book, 1.2isbn, 1.2.1title, 1.2.3year, 1.2.6 After removing book 1.2 Propagating nodes in pdt cache

  41. Step 2 The Main Loop - Stage C • Since nodes are processed in id order, a node’s descendant constraints will never be satisfied in the future • Next, we check if nodes satisfy ancestor constraints, which is done by checking nodes in their parent lists. • If those parent nodes are known to be non-PDT nodes, then we can conclude that the nodes in the cache will not satisfy ancestor restrictions, and can hence be removed. • Otherwise the cache node still has other parents, which could be PDT nodes, and will thus be propagated to the PdtCache of the ancestor.

  42. Step 3 Query Evaluation • Once the PDTs are generated, they are fed to a traditional evaluator to produce the temporary results, which are then sent to the Scoring & Materialization Module. • tf values are encoded as XML attributes • tf-idf scores are calculated for each PDT element using tf values • The Scoring & Materialization Module then identifies the view results with top-k scores. • The contents of these results areretrieved from the documentstorage system

  43. Outline • Motivation and Problem Definition • Existing Data and Data Structures • Algorithm • Experiments

  44. Experiments

More Related