efficient keyword search over virtual xml views
Download
Skip this Video
Download Presentation
Efficient Keyword Search Over Virtual XML Views

Loading in 2 Seconds...

play fullscreen
1 / 44

Efficient Keyword Search Over Virtual XML Views - PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on

Technion - Israel Institute of Technology. Efficient Keyword Search Over Virtual XML Views. Computer Science Department. Tal Herscovitz. Authors: Feng Shao , Lin Guo , Chavadar Botev , Anand Bhaskar , Muthiah Chettiar , Fan Yang. Outline. Motivation and Problem Definition

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Efficient Keyword Search Over Virtual XML Views' - yaron


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
efficient keyword search over virtual xml views

Technion - Israel Institute of Technology

Efficient Keyword Search Over Virtual XML Views

Computer Science Department

Tal Herscovitz

Authors:

FengShao, Lin Guo, ChavadarBotev, AnandBhaskar, MuthiahChettiar, Fan Yang

outline
Outline
  • Motivation and Problem Definition
  • Existing Data and Data Structures
  • Algorithm
  • Experiments
the problem
The Problem…
  • Traditional information retrieval systems rely heavily on the assumption that the set of documents being searched is materialized.
problem example
Problem Example

<books>

<book><isbn>111-11-1111</isbn>

<title>XML Web Services </title>

<publisher>Prentice Hall </publisher>

<year> 2004 </year>

</book>

<book><isbn>222-22-2222</isbn>

<title>Artificial Intelligence </title>

<publisher> Prentice Hall </publisher>

<year> 2002 </year>

</book>

...

</books>

<reviews>

<review><isbn>111-11-1111</isbn>

<rate> Excellent </rate>

<content>…about search…</content>

<reviewer>John</reviewer>

</review>

<review> <isbn>111-11-1111</isbn>

<rate> Good </rate>

<content> Easy to read…</content>

<reviewer>Alex</reviewer>

</review>

...

</reviews>

problem example1
Problem Example

let $view :=

for book in fn:doc(books.xml)/books//book

where book/year > 1995

return <bookrevs>

<book> {$book/title} </book>,

{for $rev in fn:doc(reviews.xml)/reviews//review

where rev/isbn = $book/isbn

return rev/content}

</bookrevs>

for $bookrev in $view

where bookrevftcontains(\'XML\'& Search\')

return bookrev

problem example2
Problem Example

<bookrevs>

<book isbn=“111-11-1111”>

<title>XML Web Services</title>

<review><content>...about search...</content></review>

<review><content> Easy to read... </content></review>

...

</book>

<book isbn=“222-22-2222”>

<title> Artificial Intelligence </title>

<review>...</review>...

</book>

</bookrevs>

challenges
Challenges
  • How do we efficiently compute statistics on the view from the statistics on the base data, so that the resulting scores and rank order of the query results is exactly the same as when the view is materialized?

Materialized view

Rank

Base data

Virtual view

Rank

Rank

problem definition
Problem Definition

Input

A set of keywords {Q={k1, k2, … , kn

An XML view V over an XML database D

Ranked keyword search over virtual XML views

k view elements with highest scores

Output

outline1
Outline
  • Motivation and Problem Definition
  • Existing Data and Data Structures
  • Algorithm
  • Experiments
scoring system
Scoring System
  • tf(e,k)
    • Number of distinct occurrences of keyword k in element e and its descendants (eV(D)).
  • idf(k)
    • The ratio of the number of elements in the view result (eV(D)) to the number of elements in V(D) that contain the keyword k.
dewey id
Dewey ID
  • Dewey IDs is a hierarchical numbering method where the ID of an element contains the ID of its parent element as a prefix.
path index
Path Index

B+ Tree

inverted index
Inverted Index

B+ tree index

Jane

1.2.3

1

1.7.3

1

XQFT

1.1.2

2

(ID, tf)

outline2
Outline
  • Motivation and Problem Definition
  • Existing Data and Data Structures
  • Algorithm
  • Experiments
qpt query pattern tree

Step

1

QPT – Query Pattern Tree
  • Single line - parent/child relationship
  • Double line - ancestor/decendant relationship
  • Solid line - mandatory edge
  • Dotted line - optional edge
  • Nodes might have a predicate
  • C - the content of the node is propagated to the view output
  • V - the value of the node is required to evaluate the view
pdt pruned document tree

Step

2

PDT - Pruned Document Tree

<books>

<book>

<isbn id=”1.2.1”>121-23-1321</isbn>

<title id="1.2.3" kwd1=”xml” tf1=”1"

kwd2=”search” tf2=”0"/>

<year id=”1.2.6”>1996</year>

</book>

...

</books>

<reviews>

<review>

<isbn id=”2.2.1”>121-23-1321</isbn>

<content id="2.1.3" kwd1=”xml”

tf1=”0” kwd2=”search” tf2=”2"/>

</review>

...

</reviews>

QPT

PDT

pdt constraints

Step

2

PDT Constraints
  • Each element e in the document corresponding to a node n in the QPT is selected only if:
pdt creation

Step

2

PDT Creation

1: GeneratePDT (QPT qpt, PathIndex pindex, KeywordSet

kwds, InvertedIndex iindex): PDT

2: pdt ← ∅

3: (pathLists, invLists) ← PrepareLists(qpt, pindex, iindex, kwds)

4: for idlist ∈ pathLists do

5: AddCTNode(CT.root, GetMinEntry(idlist), 0)

6: end for

7: while CT.hasMoreNodes() do

8: for all n ∈ CT.MinIDPath do

9: q ← n.QPTNode

10: if pathLists(q).hasNextID() ∧ there do not exist

≥ 2 IDs in pathLists(q) and also in CT then

11: AddCTNode(CT.root,pathLists(q).NextMin(),0)

12: end if

13: end for

14: CreatePDTNodes(CT.root, qpt, pdt)

15: end while

16: return pdt

prepare lists algorithm

Step

2

Prepare Lists Algorithm
  • Goal: prepare a list of Dewey IDs and elements required for PDT.
    • QPT nodes that don’t have mandatory child edges
    • Nodes with ’v’ annotation
    • Nodes that satisfy their predicate
prepare lists algorithm1

Step

2

Prepare Lists Algorithm
  • (books//book/isbn, (1.1.1:“111-11-1111”), (1.2.1:“121-23-1321”),... )
  • (books//book/title,1.1.4, 1.2.3, 1.9.3, …)
  • (books//book/year, (1.2.6, 1.5.1:“1996”), (1.6.1:”1997"), …)
prepare lists algorithm2

Step

2

Prepare Lists Algorithm
  • Return the relevant inverted index indices to obtain scoring information

XML

1.2.3

1

1.3.4

2

Search

2.1.3

2

  • (“xml”,(1.2.3:1),, (1.3.4:2), …)
  • (“search”,(2.1.3:2), (2.5.1:1), …)
prepare lists output

Step

2

Prepare Lists Output
  • For the running example, Prepare Lists will return:
  • PrepareList():pathLists
    • (books//book/isbn, (1.1.1:“111-11-1111”), (1.2.1:“121-23-1321”),... )
    • (books//book/title,1.1.4, 1.2.3, 1.9.3, …)
    • (books//book/year, (1.2.6, 1.5.1:“1996”), (1.6.1:”1997"), …)
  • PrepareList():invLists
    • (“xml”,(1.2.3:1),, (1.3.4:2), …) (“search”,(2.1.3:2), (2.5.1:1), …)
pdt creation1

Step

2

PDT Creation

1: GeneratePDT (QPT qpt, PathIndex pindex, KeywordSet

kwds, InvertedIndex iindex): PDT

2: pdt ← ∅

3: (pathLists, invLists) ← PrepareLists(qpt, pindex, iindex, kwds)

4: for idlist ∈ pathLists do

5: AddCTNode(CT.root, GetMinEntry(idlist), 0)

6: end for

7: while CT.hasMoreNodes() do

8: for all n ∈ CT.MinIDPath do

9: q ← n.QPTNode

10: if pathLists(q).hasNextID() ∧ there do not exist

≥ 2 IDs in pathLists(q) and also in CT then

11: AddCTNode(CT.root,pathLists(q).NextMin(),0)

12: end if

13: end for

14: CreatePDTNodes(CT.root, qpt, pdt)

15: end while

16: return pdt

candidate tree

Step

2

Candidate Tree
  • Each node cn in the CT stores sufficient information to efficiently check ancestor and descendant constraints
    • ID - the unique identifier of cn, which always corresponds to a prefix of a Dewey ID in pathLists
    • QNode - the QPT node to which cn.ID corresponds
candidate tree1

Step

2

Candidate Tree
  • ParentList (PL) - a list of cn’s ancestors whose QNode’s are the parent node of cn.Qnode
  • DescendantMap (DM) - maps each mandatory child/descendant of cn.Qnode to 1 if it exists or 0 if not
  • PdtCache - the cache storing cn’s descendants that satisfy descendant restrictions but whose ancestor restrictions are yet to be checked
addctnode algorithm

Step

2

AddCTNode Algorithm
  • A prefix is added to the CT if it has a corresponding QPT node and is not already in the CT
  • If a prefix is associated with a ’c’ annotation, the tf values are retrieved from the inverted lists
pdt creation2

Step

2

PDT Creation

1: GeneratePDT (QPT qpt, PathIndex pindex, KeywordSet

kwds, InvertedIndex iindex): PDT

2: pdt ← ∅

3: (pathLists, invLists) ← PrepareLists(qpt, pindex, iindex, kwds)

4: for idlist ∈ pathLists do

5: AddCTNode(CT.root, GetMinEntry(idlist), 0)

6: end for

7: while CT.hasMoreNodes() do

8: for all n ∈ CT.MinIDPath do

9: q ← n.QPTNode

10: if pathLists(q).hasNextID() ∧ there do not exist

≥ 2 IDs in pathLists(q) and also in CT then

11: AddCTNode(CT.root,pathLists(q).NextMin(),0)

12: end if

13: end for

14: CreatePDTNodes(CT.root, qpt, pdt)

15: end while

16: return pdt

the main loop

Step

2

The Main Loop
  • Adds new Dewey IDs to the CT
  • Creates PDT nodes using CT nodes
  • Every iteration ensures that the Dewey IDs that are processed and known to be PDT nodes, are either in the CT or in the result PDT
  • The result PDT only contains IDs that satisfy the PDT definition
the main loop1

Step

2

The Main Loop
  • The main loop has 3 stages:
the main loop stage a

Step

2

The Main Loop - Stage A

1: GeneratePDT (QPT qpt, PathIndex pindex, KeywordSet

kwds, InvertedIndex iindex): PDT

2: pdt ← ∅

3: (pathLists, invLists) ← PrepareLists(qpt, pindex, iindex, kwds)

4: for idlist ∈ pathLists do

5: AddCTNode(CT.root, GetMinEntry(idlist), 0)

6: end for

7: while CT.hasMoreNodes() do

8: for all n ∈ CT.MinIDPath do

9: q ← n.QPTNode

10: if pathLists(q).hasNextID() ∧ there do not exist

≥ 2 IDs in pathLists(q) and also in CT then

11: AddCTNode(CT.root,pathLists(q).NextMin(),0)

12: end if

13: end for

14: CreatePDTNodes(CT.root, qpt, pdt)

15: end while

16: return pdt

the main loop stage a1

Step

2

The Main Loop - Stage A
  • The algorithm adds the minimum IDs in pathLists corresponding to the QPT nodes
  • (books//book/isbn, (1.1.1:“111-11-1111”),
  • (1.2.1:“121-23-1321”),... )
the main loop stages b c

Step

2

The Main Loop - Stages B,C

1: GeneratePDT (QPT qpt, PathIndex pindex, KeywordSet

kwds, InvertedIndex iindex): PDT

2: pdt ← ∅

3: (pathLists, invLists) ← PrepareLists(qpt, pindex, iindex, kwds)

4: for idlist ∈ pathLists do

5: AddCTNode(CT.root, GetMinEntry(idlist), 0)

6: end for

7: while CT.hasMoreNodes() do

8: for all n ∈ CT.MinIDPath do

9: q ← n.QPTNode

10: if pathLists(q).hasNextID() ∧ there do not exist

≥ 2 IDs in pathLists(q) and also in CT then

11: AddCTNode(CT.root,pathLists(q).NextMin(),0)

12: end if

13: end for

14: CreatePDTNodes(CT.root, qpt, pdt)

15: end while

16: return pdt

the main loop stage b

Step

2

The Main Loop - Stage B
  • The algorithm creates PDT nodes using CT nodes in CT.MinIDPath
  • From top down:
    • If the node satisfies the descendant constraints (DM check) then add it to its parent PdtCache
    • Recursively invoke CreatePDTNodes on the element

PdtCache:isbn, 1.1.1

the main loop stage c

Step

2

The Main Loop - Stage C
  • The algorithm starts removing nodes from bottom up
  • For example, after processing and removing node “title”, we will remove node “book” because it doesn’t have children and it doesn’t satisfy descendant constraints.

PdtCache:isbn, 1.1.1title, 1.1.4

the main loop stage c1

Step

2

The Main Loop - Stage C

PdtCache:isbn, 1.2.1title, 1.2.3year, 1.2.6

PdtCache:book, 1.2

Before removing book 1.2

PdtCache:book, 1.2isbn, 1.2.1title, 1.2.3year, 1.2.6

After removing book 1.2 Propagating nodes in pdt cache

the main loop stage c2

Step

2

The Main Loop - Stage C
  • Since nodes are processed in id order, a node’s descendant constraints will never be satisfied in the future
  • Next, we check if nodes satisfy ancestor constraints, which is done by checking nodes in their parent lists.
  • If those parent nodes are known to be non-PDT nodes, then we can conclude that the nodes in the cache will not satisfy ancestor restrictions, and can hence be removed.
  • Otherwise the cache node still has other parents, which could be PDT nodes, and will thus be propagated to the PdtCache of the ancestor.
query evaluation

Step

3

Query Evaluation
  • Once the PDTs are generated, they are fed to a traditional evaluator to produce the temporary results, which are then sent to the Scoring & Materialization Module.
  • tf values are encoded as XML attributes
  • tf-idf scores are calculated for each PDT element using tf values
  • The Scoring & Materialization Module then identifies the view results with top-k scores.
  • The contents of these results areretrieved from the documentstorage system
outline3
Outline
  • Motivation and Problem Definition
  • Existing Data and Data Structures
  • Algorithm
  • Experiments
ad