Efficient keyword search over virtual xml views
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

Efficient Keyword Search over Virtual XML Views PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on
  • Presentation posted in: General

Efficient Keyword Search over Virtual XML Views. Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University Jayavel Shanmugasundaram Yahoo! Research 2008. 02. 14. Summarized by Dongmin Shin , IDS Lab., Seoul National University

Download Presentation

Efficient Keyword Search over Virtual XML Views

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Efficient keyword search over virtual xml views

Efficient Keyword Search over Virtual XML Views

FengShao and Lin Guo and ChavdarBotev

and AnandBhaskar and MuthiahChettiar and Fan Yang

Cornell University

JayavelShanmugasundaram

Yahoo! Research

2008. 02. 14.

Summarized by Dongmin Shin, IDS Lab., Seoul National University

Presented by Dongmin Shin, IDS Lab., Seoul National University


Index

Index

Introduction

Background

System Overview

QPT GenerationModule

PDT Generation Module

Experiments

Conclusion and Future Work


Index1

Index

Introduction

Background

System Overview

QPT GenerationModule

PDT Generation Module

Experiments

Conclusion and Future Work


Introduction

Introduction

  • The set of documents being searched

    is materialized.

Fundamental assumption of traditional information retrieval systems


Introduction1

Introduction

  • Aggregator may not have resources to materialize all the data

  • If the view is materialized, the contents of the view may be

  • out-of-date or maintaining the view may be expensive

  • The data sources may not wish to provide the entire data

The view is often virtual (unmaterialized)

But


Introduction2

Introduction

  • Need

Efficiently evaluating keyword search queries

over virtual XML views


Index2

Index

Introduction

Background

System Overview

QPT GenerationModule

PDT Generation Module

Experiments

Conclusion and Future Work


Background

Background


Background1

Background

TF-IDF method

XML Scoring

tf(e,k) : the number of distinct occurrences of the keyword k in element e and its descendants

idf(k) =

score(e,Q) =


Index3

Index

Introduction

Background

System Overview

QPT GenerationModule

PDT Generation Module

Experiments

Conclusion and Future Work


System overview

System Overview

(2) The parser redirects the query to the Query Pattern Tree(QPT) GenerationModule

(3) QPT issent to the Pruned Document Tree(PDT) Generation Module

(4) Generate PDTs using only the path indices and inverted list indices

(5) Rewritten query and PDTs are sent to Evaluator

(6) Produce the view that contains all view elements with pruned content

(7) Elements are scored, only those with highest scores are fully materialized using document storage

(1) Keyword queries over virtual views


System overview1

System Overview

  • XML Storage

    • Dewey IDs

      • Popular id format

      • Hierarchical numbering scheme

      • ID of an element contains the ID of its parent


System overview2

System Overview

  • XML Indexing

    • Path indices

      • Evaluate XML path and twig(i.e., branching path)

      • Store XML paths with values in a relational table

      • Use indices such as B+-tree

      • One row for each unique

        (Path, Value) pair

      • IDList : the list of ids of

        all elements on the path

      • B+-tree index is built on the (Path, Value) pair


System overview3

System Overview

  • Inverted list indices

    • Store the list of XML elements that directly contain the keyword

      for each keyword in the document collection


Index4

Index

Introduction

Background

System Overview

QPT GenerationModule

PDT Generation Module

Experiments

Conclusion and Future Work


Qpt generation module

QPT GenerationModule


Index5

Index

Introduction

Background

System Overview

QPT GenerationModule

PDT Generation Module

Experiments

Conclusion and Future Work


Pdt generation module

PDT Generation Module

  • Output

    • Only contains elements that correspond to nodes in the QPT

    • Onlycontains element values that are required during query evaluation

  • Advantage

    • Query evaluation is likely to be more efficient and scalable

    • Allows us to use the regular(unmodified) query evaluator


Pdt generation module1

PDT Generation Module

  • Key Idea

    • An element e in the document corresponding to a node n in the QPT is selected for inclusion only if it satisfies three types of constraints

      • Ancestor constraint – an ancestor element of e that corresponds to the parent of n in the QPT should also be selected

      • Descendant constraint – for each mandatory edge from n to a child of n in the QPT, at least one child/descendant element of e corresponding to that child of n should also be selected

      • Predicate Constraint – if e is a leaf node, it satisfies all predicates associated with n


Pdt generation module2

PDT Generation Module

  • PrepareList

    (1) Issues a lookup on path indices for each QPT node that has no mandatory child edges

    (2) Identifies nodes that have a ‘v’ annotation to obtain values and ids

    (3) Looks up inverted lists indices and retrieves the list of Dewey IDs containing the keywords along with tf values


Pdt generation module3

PDT Generation Module

Candidate Tree(CT)


Pdt generation module4

PDT Generation Module

  • Step 1 : adding new IDs

    • Adds the current minimum IDs in pathLists


Pdt generation module5

PDT Generation Module

  • Step 2 : creating PDT nodes

    • Create PDT nodes using CT nodes

    • Top-down

    • Check DM value of each CT node

      • if it is “1”, create it in pdt cache

      • If not, check children of that node

        • If DM value of that children node is “1”, create is in pdt cache of parent node


Pdt generation module6

PDT Generation Module

  • Step 3 : removing CT nodes

    • Bottom-up

    • Check if each node satisfies ancestor constraints

      • If not, remove

      • If so, propagate to the pdt cache of the ancestor

    • If some node has no children and does not satisfy descendant constraints, remove


Pdt generation module7

PDT Generation Module

  • When we remove the root node “books”, all IDs in its pdt cache will be propagated to the result PDT


Pdt generation module8

PDT Generation Module


Index6

Index

Introduction

Background

System Overview

QPT GenerationModule

PDT Generation Module

Experiments

Conclusion and Future Work


Experiments

Experiments

  • 500MB INEX dataset

  • Varying parameters

    • Size of data, # keywords, selectivity of keywords

    • # of joins, join selectivity, level of nestings

    • # of results, Avg. size of view element

  • Four alternative approaches

    • Baseline

    • GTP : general solution to integrate structure and keyword search queries

    • Efficient : proposed architecture

    • Proj : techniques of projecting XML documents


Experiments1

Experiments

  • The cost of generating PDTs scales gracefully

  • Overhead of post-processing(scoring and materializing) is negligible

  • The cost of the query evaluator dominates the entire cost

EFFICIENT is a scalable and efficient soultion


Experiments2

Experiments

  • Run time for EFFICIENT increases

    • Because the cost of the query evaluation increases

  • Run time for EFFICIENT increases slightly

    • Because it accesses more inverted lists to retrieve tf values


Index7

Index

Introduction

Background

System Overview

QPT GenerationModule

PDT Generation Module

Experiments

Conclusion and Future Work


Conclusion and future work

Conclusion and Future Work

  • Conclusion

    • A general technique for evaluating keyword search queries over views

    • Efficient over a wide range of parameters

  • Future Work

    • Instead of using the regular query evaluator, we could use the techniques proposed for ranked query evaluation

    • Views may contain non-monotonic operators such as group-by


  • Login