Efficient Keyword Search Over Virtual XML Views

Technion - Israel Institute of Technology Efficient Keyword Search Over Virtual XML Views Computer Science Department Tal Herscovitz Authors: FengShao, Lin Guo, ChavadarBotev, AnandBhaskar, MuthiahChettiar, Fan Yang

Outline • Motivation and Problem Definition • Existing Data and Data Structures • Algorithm • Experiments

Personalized Portal • my.yahoo.com

The Problem… • Traditional information retrieval systems rely heavily on the assumption that the set of documents being searched is materialized.

Materialized XML Views?

Materialized XML Views? Tradeoff

Problem Example <books> <book><isbn>111-11-1111</isbn> <title>XML Web Services </title> <publisher>Prentice Hall </publisher> <year> 2004 </year> </book> <book><isbn>222-22-2222</isbn> <title>Artificial Intelligence </title> <publisher> Prentice Hall </publisher> <year> 2002 </year> </book> ... </books> <reviews> <review><isbn>111-11-1111</isbn> <rate> Excellent </rate> <content>…about search…</content> <reviewer>John</reviewer> </review> <review> <isbn>111-11-1111</isbn> <rate> Good </rate> <content> Easy to read…</content> <reviewer>Alex</reviewer> </review> ... </reviews>

Problem Example let $view := for book in fn:doc(books.xml)/books//book where book/year > 1995 return <bookrevs> <book> {$book/title} </book>, {for $rev in fn:doc(reviews.xml)/reviews//review where rev/isbn = $book/isbn return rev/content} </bookrevs> for $bookrev in $view where bookrevftcontains('XML'& Search') return bookrev

Problem Example <bookrevs> <book isbn=“111-11-1111”> <title>XML Web Services</title> <review><content>...about search...</content></review> <review><content> Easy to read... </content></review> ... </book> <book isbn=“222-22-2222”> <title> Artificial Intelligence </title> <review>...</review>... </book> </bookrevs>

Challenges • How do we efficiently compute statistics on the view from the statistics on the base data, so that the resulting scores and rank order of the query results is exactly the same as when the view is materialized? Materialized view Rank Base data Virtual view Rank Rank

Problem Definition Input A set of keywords {Q={k1, k2, … , kn An XML view V over an XML database D Ranked keyword search over virtual XML views k view elements with highest scores Output

Scoring System • tf(e,k) • Number of distinct occurrences of keyword k in element e and its descendants (eV(D)). • idf(k) • The ratio of the number of elements in the view result (eV(D)) to the number of elements in V(D) that contain the keyword k.

Dewey ID • Dewey IDs is a hierarchical numbering method where the ID of an element contains the ID of its parent element as a prefix.

Path Index B+ Tree

Inverted Index B+ tree index Jane 1.2.3 1 1.7.3 1 XQFT 1.1.2 2 … … … (ID, tf)

Algorithm – 3 Steps

Step 1 QPT – Query Pattern Tree • Single line - parent/child relationship • Double line - ancestor/decendant relationship • Solid line - mandatory edge • Dotted line - optional edge • Nodes might have a predicate • C - the content of the node is propagated to the view output • V - the value of the node is required to evaluate the view

Step 2 PDT - Pruned Document Tree <books> <book> <isbn id=”1.2.1”>121-23-1321</isbn> <title id="1.2.3" kwd1=”xml” tf1=”1" kwd2=”search” tf2=”0"/> <year id=”1.2.6”>1996</year> </book> ... </books> <reviews> <review> <isbn id=”2.2.1”>121-23-1321</isbn> <content id="2.1.3" kwd1=”xml” tf1=”0” kwd2=”search” tf2=”2"/> </review> ... </reviews> QPT PDT

Step 2 PDT Constraints • Each element e in the document corresponding to a node n in the QPT is selected only if:

Step 2 PDT Creation 1: GeneratePDT (QPT qpt, PathIndex pindex, KeywordSet kwds, InvertedIndex iindex): PDT 2: pdt ← ∅ 3: (pathLists, invLists) ← PrepareLists(qpt, pindex, iindex, kwds) 4: for idlist ∈ pathLists do 5: AddCTNode(CT.root, GetMinEntry(idlist), 0) 6: end for 7: while CT.hasMoreNodes() do 8: for all n ∈ CT.MinIDPath do 9: q ← n.QPTNode 10: if pathLists(q).hasNextID() ∧ there do not exist ≥ 2 IDs in pathLists(q) and also in CT then 11: AddCTNode(CT.root,pathLists(q).NextMin(),0) 12: end if 13: end for 14: CreatePDTNodes(CT.root, qpt, pdt) 15: end while 16: return pdt

Step 2 Prepare Lists Algorithm • Goal: prepare a list of Dewey IDs and elements required for PDT. • QPT nodes that don’t have mandatory child edges • Nodes with ’v’ annotation • Nodes that satisfy their predicate

Step 2 Prepare Lists Algorithm • (books//book/isbn, (1.1.1:“111-11-1111”), (1.2.1:“121-23-1321”),... ) • (books//book/title,1.1.4, 1.2.3, 1.9.3, …) • (books//book/year, (1.2.6, 1.5.1:“1996”), (1.6.1:”1997"), …)

Step 2 Prepare Lists Algorithm • Return the relevant inverted index indices to obtain scoring information XML 1.2.3 1 1.3.4 2 Search 2.1.3 2 … … … • (“xml”,(1.2.3:1),, (1.3.4:2), …) • (“search”,(2.1.3:2), (2.5.1:1), …)

Step 2 Prepare Lists Output • For the running example, Prepare Lists will return: • PrepareList():pathLists • (books//book/isbn, (1.1.1:“111-11-1111”), (1.2.1:“121-23-1321”),... ) • (books//book/title,1.1.4, 1.2.3, 1.9.3, …) • (books//book/year, (1.2.6, 1.5.1:“1996”), (1.6.1:”1997"), …) • PrepareList():invLists • (“xml”,(1.2.3:1),, (1.3.4:2), …) (“search”,(2.1.3:2), (2.5.1:1), …)

Step 2 Candidate Tree • Each node cn in the CT stores sufficient information to efficiently check ancestor and descendant constraints • ID - the unique identifier of cn, which always corresponds to a prefix of a Dewey ID in pathLists • QNode - the QPT node to which cn.ID corresponds

Step 2 Candidate Tree • ParentList (PL) - a list of cn’s ancestors whose QNode’s are the parent node of cn.Qnode • DescendantMap (DM) - maps each mandatory child/descendant of cn.Qnode to 1 if it exists or 0 if not • PdtCache - the cache storing cn’s descendants that satisfy descendant restrictions but whose ancestor restrictions are yet to be checked

Step 2 Candidate Tree Example

Step 2 AddCTNode Algorithm • A prefix is added to the CT if it has a corresponding QPT node and is not already in the CT • If a prefix is associated with a ’c’ annotation, the tf values are retrieved from the inverted lists

Step 2 The Main Loop • Adds new Dewey IDs to the CT • Creates PDT nodes using CT nodes • Every iteration ensures that the Dewey IDs that are processed and known to be PDT nodes, are either in the CT or in the result PDT • The result PDT only contains IDs that satisfy the PDT definition

Step 2 The Main Loop • The main loop has 3 stages:

Step 2 The Main Loop - Stage A 1: GeneratePDT (QPT qpt, PathIndex pindex, KeywordSet kwds, InvertedIndex iindex): PDT 2: pdt ← ∅ 3: (pathLists, invLists) ← PrepareLists(qpt, pindex, iindex, kwds) 4: for idlist ∈ pathLists do 5: AddCTNode(CT.root, GetMinEntry(idlist), 0) 6: end for 7: while CT.hasMoreNodes() do 8: for all n ∈ CT.MinIDPath do 9: q ← n.QPTNode 10: if pathLists(q).hasNextID() ∧ there do not exist ≥ 2 IDs in pathLists(q) and also in CT then 11: AddCTNode(CT.root,pathLists(q).NextMin(),0) 12: end if 13: end for 14: CreatePDTNodes(CT.root, qpt, pdt) 15: end while 16: return pdt

Step 2 The Main Loop - Stage A • The algorithm adds the minimum IDs in pathLists corresponding to the QPT nodes • (books//book/isbn, (1.1.1:“111-11-1111”), • (1.2.1:“121-23-1321”),... )

Step 2 The Main Loop - Stages B,C 1: GeneratePDT (QPT qpt, PathIndex pindex, KeywordSet kwds, InvertedIndex iindex): PDT 2: pdt ← ∅ 3: (pathLists, invLists) ← PrepareLists(qpt, pindex, iindex, kwds) 4: for idlist ∈ pathLists do 5: AddCTNode(CT.root, GetMinEntry(idlist), 0) 6: end for 7: while CT.hasMoreNodes() do 8: for all n ∈ CT.MinIDPath do 9: q ← n.QPTNode 10: if pathLists(q).hasNextID() ∧ there do not exist ≥ 2 IDs in pathLists(q) and also in CT then 11: AddCTNode(CT.root,pathLists(q).NextMin(),0) 12: end if 13: end for 14: CreatePDTNodes(CT.root, qpt, pdt) 15: end while 16: return pdt

Step 2 The Main Loop - Stage B • The algorithm creates PDT nodes using CT nodes in CT.MinIDPath • From top down: • If the node satisfies the descendant constraints (DM check) then add it to its parent PdtCache • Recursively invoke CreatePDTNodes on the element PdtCache:isbn, 1.1.1

Step 2 The Main Loop - Stage C • The algorithm starts removing nodes from bottom up • For example, after processing and removing node “title”, we will remove node “book” because it doesn’t have children and it doesn’t satisfy descendant constraints. PdtCache:isbn, 1.1.1title, 1.1.4

Step 2 The Main Loop - Stage C PdtCache:isbn, 1.2.1title, 1.2.3year, 1.2.6 PdtCache:book, 1.2 Before removing book 1.2 PdtCache:book, 1.2isbn, 1.2.1title, 1.2.3year, 1.2.6 After removing book 1.2 Propagating nodes in pdt cache

Step 2 The Main Loop - Stage C • Since nodes are processed in id order, a node’s descendant constraints will never be satisfied in the future • Next, we check if nodes satisfy ancestor constraints, which is done by checking nodes in their parent lists. • If those parent nodes are known to be non-PDT nodes, then we can conclude that the nodes in the cache will not satisfy ancestor restrictions, and can hence be removed. • Otherwise the cache node still has other parents, which could be PDT nodes, and will thus be propagated to the PdtCache of the ancestor.

Step 3 Query Evaluation • Once the PDTs are generated, they are fed to a traditional evaluator to produce the temporary results, which are then sent to the Scoring & Materialization Module. • tf values are encoded as XML attributes • tf-idf scores are calculated for each PDT element using tf values • The Scoring & Materialization Module then identifies the view results with top-k scores. • The contents of these results areretrieved from the documentstorage system

Experiments

Efficient Keyword Search Over Virtual XML Views