relax and adapt computing top k matches to xpath queries l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Relax and Adapt: Computing Top -k Matches to XPath Queries PowerPoint Presentation
Download Presentation
Relax and Adapt: Computing Top -k Matches to XPath Queries

Loading in 2 Seconds...

play fullscreen
1 / 21

Relax and Adapt: Computing Top -k Matches to XPath Queries - PowerPoint PPT Presentation


  • 294 Views
  • Uploaded on

Relax and Adapt: Computing Top -k Matches to XPath Queries Amélie Marian (Columbia University) Joint work with: Sihem Amer-Yahia (AT&T Research) Nick Koudas (University of Toronto) Divesh Srivastava (AT&T Research) book info edition (paperback) author (Dickens) title

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Relax and Adapt: Computing Top -k Matches to XPath Queries' - emily


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
relax and adapt computing top k matches to xpath queries

Relax and Adapt: Computing Top-k Matches to XPath Queries

Amélie Marian (Columbia University)

Joint work with:

Sihem Amer-Yahia (AT&T Research)

Nick Koudas (University of Toronto)

Divesh Srivastava (AT&T Research)

example

book

info

edition

(paperback)

author

(Dickens)

title

(Great Expectations)

book

info

edition

(paperback)

author

(Dickens)

title

(Great Expectations)

Example

book

book

  • Heterogeneous XML Data about books
  • Query:

book[./info/title=“Great Expectations”] and

[./info/author=“Dickens”] and [./edition=“paperback”]

info

info

author

(Dickens)

title

(Great Expectations)

edition

(paperback)

title

(Great Expectations)

author

(Dickens)

Query root node:

Distinguished node

Amélie Marian - Columbia University

xml query relaxation

book

book

info

edition

(paperback)

info

edition

(paperback)

author

(Dickens)

author

(Dickens)

title

(Great Expectations)

title

(Great Expectations)

XML Query Relaxation

Query

[Amer-Yahia et al. EDBT’02]

  • Tree pattern relaxations:
    • Leaf node deletion
    • Edge generalization
    • Subtree promotion

book

book

Data

edition?

info

info

author

(Dickens)

title

(Great Expectations)

edition

(paperback)

title

(Great Expectations)

author

(Dickens)

Amélie Marian - Columbia University

top k queries over xml data motivations and challenges
Top-k Queries over XML Data:Motivations and Challenges
  • Structure heterogeneity
    • Efficient identification of approximate matches
  • Top-k
    • Ranking of approximate matches based on similarity to query
    • Early pruning
  • Query processing cost
    • Cost increases with number of matches evaluated
  • Data explosion
    • Many approximate matches
    • XML path queries akin to joins
    • Prioritization to increase pruning

Amélie Marian - Columbia University

contributions
Contributions
  • Whirlpool: adaptive architecture and top-k query processing strategy for XPath queries
    • Goal: early pruning of non-top-k partial matches
    • Approach: partial matches may follow different plans, and may be at different stages of query execution
  • Real prototype implementation of Whirlpool
    • Instantiation of Whirlpool for various “routing strategies” and “prioritization” alternatives

Amélie Marian - Columbia University

closely related work
Closely Related Work
  • Adaptive query processing
    • Eddies:
      • Dynamic query join plans to adapt to processing environment
      • No pruning
  • Adaptive top-k query processing
    • Upper:
      • Prioritization of partial matches based on maximum possible scores
      • Adaptive routing based on scores
      • No joins

[Avnur and Hellerstein. SIGMOD’00]

[Bruno et al. ICDE’01]

Amélie Marian - Columbia University

outline
Outline
  • Whirlpool Architecture
  • Query Processing
    • Strategy
    • Alternatives
  • Evaluation Settings
  • Evaluation Results

Amélie Marian - Columbia University

whirlpool architecture
Whirlpool Architecture

book

info

edition

(paperback)

Router

author

(Dickens)

title

(Great Expectations)

book server

edition server

title server

info server

author server

Top-k Set

Amélie Marian - Columbia University

whirlpool architecture components
Whirlpool Architecture:Components
  • Top-k Set
    • Only one match with a given root node
    • Used for pruning
    • Complete matches are not processed further, incomplete matches are sent to the router
  • Router
    • Router Queue is based on partial matches maximum possible final scores
    • Dynamically choose which server to send partial match based on routing strategy

Amélie Marian - Columbia University

whirlpool architecture components10
Whirlpool Architecture:Components
  • Root server:
    • Generates candidate matches
  • Node servers:
    • Maintain priority queue of partial matches
    • For each partial match that is processed:
      • Compute a set of extended partial (or complete) matches
      • Compute scores of new matches
      • Checks partial matches against current top-k set

Amélie Marian - Columbia University

query processing alternatives
Query Processing Alternatives
  • Prioritization Strategies (at each server)
    • FIFO
    • Current Score
    • Maximum Possible Next Score
    • Maximum Possible Final Score
  • Routing Decisions (at the router)
    • Static
    • Score-based
      • Likely to increase score the most
      • Likely to increase score the least
    • Size-based
      • Likely to produce the fewest matches

Amélie Marian - Columbia University

evaluation strategies
Evaluation Strategies
  • Lockstep (Static)
    • Partial matches follow same execution plan
    • Partial matches have gone through exactly the same number of operations
  • Whirlpool Single-threaded (Adaptive)
    • Partial matches adaptively routed
    • Process the partial match with the highest maximum final score (Query processing similar to Upper)
    • Only one partial match processed at a time
  • Whirlpool Multi-threaded (Adaptive)
    • Prioritization strategy at server decides which partial match to process next at server
    • System determines which server to process next

Amélie Marian - Columbia University

evaluation metrics
Evaluation Metrics
  • Parameters:
    • Query size
    • Document size
    • k
    • Parallelism
    • Scoring function (tf.idf proposed in paper)
  • Measures:
    • Query execution time
    • Number of server operations
    • Number of partial matches created

Amélie Marian - Columbia University

evaluation setting
Evaluation Setting
  • C++ implementation, with POSIX threads
  • Default machine:
    • Red Hat 7.1 Linux
    • 1.4GHz dual processor
    • 2Gb RAM
  • XML Documents generated using XMark generating tool
  • XPath Queries chosen from XMark to illustrate different relaxations
  • XML nodes stored using Dewey encoding

Amélie Marian - Columbia University

comparison of adaptive routing strategies
Comparison of Adaptive Routing Strategies

Whirlpool-S and Whirlpool-M perform approximately the same number of server operations

Amélie Marian - Columbia University

static routing strategies vs best adaptive
Static Routing Strategies vs. Best Adaptive

Amélie Marian - Columbia University

effect of parallelism
Effect of Parallelism

Amélie Marian - Columbia University

varying query size and k log scale
Varying Query Size and k (log scale)

60%

48%

20%

For large queries and high values of k, Whirlpool-M performs less server operations that Whirlpool-S (and is faster even on a one-processor machine)! (27% less server operations for q3 k=75)

Amélie Marian - Columbia University

varying query size and document size
Varying Query Size and Document Size

Almost twice as fast

Amélie Marian - Columbia University

scalability
Scalability

Percentage of partial matches created by Whirlpool-M as a function of the maximum possible number of partial matches

Amélie Marian - Columbia University

conclusions
Conclusions
  • Efficient adaptive top-k query processing strategy
    • Minimize number of partial matches evaluated
  • Benefit from parallelism with little threading overhead
  • Adapt to different environments
    • Score distribution
    • Selectivity distribution
  • Extensive experimental evaluation
    • Good scalability

Amélie Marian - Columbia University