A Platform for
Download
1 / 32

IDAR 2007 - PowerPoint PPT Presentation


  • 55 Views
  • Uploaded on

A Platform for Efficient Full-Text SEARCH on the Web. Emiran Curtmola. IDAR 2007. Search Semi-structured Data (XML). Growing amount of XML data available for processing and exchange Need for text predicates that go beyond simple keyword search

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' IDAR 2007' - mechelle-emerson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

A Platform for

Efficient Full-Text SEARCH

on the Web

Emiran Curtmola

IDAR 2007


Search semi structured data xml
Search Semi-structured Data (XML)

  • Growing amount of XML data available for processing and exchange

  • Need for text predicates that go beyond simple keyword search

  • Existing applications require to query both on structure and text of documents

     Full-Text queries (FT)

    • query structure + text

    • complex, composable predicates on the words in the text

      • window, distance, order, times etc.


A typical scenario
A Typical Scenario

  • E.g., web service discovery in P2P or Grid

    • Web services typically described using XML (e.g., WSDL standard)

    • Autonomous service providers use non-uniform descriptions, with variable structure and text comments

    • Query: “find web services providing info about <breaking news> on a possible tsunami inAsia (within 10 words)”


Existing approaches db ir
Existing Approaches: DB & IR

doc

newspapers

newspaper

  • DB community

  • data centric (structure)

    • languages

    • efficient evaluation

  • XPath 2.0, XQuery 1.0,

  • XSLT 1.0

newspaper-name

breaking news

entertainment

overview

sightseeing

sailing clubs

museums

  • Information Retrieval

  • (IR) community

  • document centric (text)

    • indices

    • ranking methods

  • Yahoo!, Google,

  • XXL, JuruXML, Elixir etc.

text

text

text

text

text

text

text

text


Query languages for structure text
Query Languages for Structure + Text

  • Challenge: a variety of competing proposals for querying XML on structure + text with [BAS-06]

    • variable expressive power

    • scoring methods

    • often fuzzy semantics

  • Front-runner language: XQuery Full-Text (XQFT)

    • Proposed by W3C task force

      • right now, going to last call until June 22, 2007

      • going as a W3C Recommendation as early as 2008!

    • Subsumes expressivity of most of the proposed FT languages

    • Reference implementation: GalaTex [Curtmola et al. XIME-P 2005]

    • Query in XQFT

      doc/newspapers/newspaper/breaking_news[

      .//* ftcontains “tsunami” and “Asia” window <=10 words]

      /overview


Need to optimize ft queries
Need to Optimize FT Queries

  • Prior to our project, no work on FT query optimization but efficient evaluation limited to

    • Conjunctive keyword search (no predicates)

    • Full-text predicates in isolation

  • Need for efficient evaluation of FT queries

    • universal formal techniques to optimize


Outline
Outline

  • Efficient evaluation of full-text queries

    • Query optimization

    • Impact of scoring methods on optimizations

    • Query distributed data

  • Summary and future work


A novel universal optimization framework
A Novel Universal Optimization Framework

  • XQFT semantics in W3C proposal is given in functional language style

    • no apparent connection to (relational) database query languages

  • We provide an alternative (yet equivalent) semantics captured by

    • Formalization of XML full-text languages in terms of

      • keyword patterns

      • pattern matches

      • predicates evaluated through matches

    • XFT algebra

      • matches are treated as relational tuples


Xft algebra
XFT Algebra

  • Example: query in XQFT

    .//* ftcontains “tsunami” and “Asia” window <=10 words

all occurrences (matches) of “Asia”

all occurrences (matches) of “tsunami”

common ancestors of match pairs

keep only ancestors of close matches


Benefits of the optimization framework amer yahia et al sigmod 2006
Benefits of the Optimization Framework[Amer-Yahia et al. SIGMOD 2006]

  • Enable leveraging the tried-and-true relational-style evaluation & optimization techniques, including

    • Join re-ordering

    • Pushing selection predicates into joins

  • Concise & clean formal semantics for all FT languages by translation to the XFT algebra

    • one-size-fits-all optimization for all FT languages

  • Efficient algorithms for operator evaluation through novel and successful marriage IR &DB

  • Measured speedup of at least two orders of magnitude over two reference XQFT engines


Outline1
Outline

  • Efficient evaluation of full-text queries

    • Query optimization

    • Impact of scoring methods on optimizations

    • Query distributed data

  • Summary and future work


Integrate with universal scoring
Integrate with Universal Scoring

  • Until now, scoring well understood on text only

  • Challenge: score structure + text

    • Non-trivial

    • Many scoring proposals; sometimes hardcoded in the algorithm

  • Extend the universal optimization framework to accommodate for universal scoring


Requirements for extending with scores
Requirements for Extending with Scores

  • Documents carry “scores”

    • relevance of the query matching documents

  • XFT algebraic operators manipulate scores

  • Requirements

    • Generic functions, not a particular scoring function

      • no scoring method is better than the other

    • Avoid re-computing scores: score of a node can be derived solely from the scores of its descendants


Preliminary results scoring scheme
Preliminary Results: Scoring Scheme

  • Parameterized scoring scheme

    • scoreK( k,pos,n ) = score keyword k at position pos in node n

    • scoreM( p,m ) = score a match m with pattern p

      • aggregate scores from subpatterns of a pattern for the same node

    • scoreS( SM(n,p) ) = score a set of matches SM corresponding to node n and pattern p

      • aggregate scores from children to parent

  • The score of a node depends on scoring its set of matches

    • scoreK is used in scoring a match

      • scoreM is used in scoring a set of matches

        • scoreS


  • Example using the scoring scheme
    Example: Using the Scoring Scheme

    • Query: “tsunami” and “Asia” and “danger”

    match (2, 5, 40) for

    pattern (“tsunami”, “Asia”, “danger”)

    =scoreM(scoreM(10, 15), 2)

    match (2, 5) for

    pattern (“tsunami”, “Asia”)

    =scoreM(10, 15)

    “danger”

    =scoreK(danger, 40, node1)=2

    “tsunami”

    =scoreK(tsunami, 2, node1)=10

    “Asia”

    =scoreK(Asia, 5, node1)=15


    Impact of scores on optimizations
    Impact of Scores on Optimizations

    • Challenge

      • Scoring breaks the expected relational “equivalent” query plans

        • scoring intermediate nodes might generate different score values


    Pitfall scoring breaks equivalence

    =scoreM(scoreM(10, 15), 2)

    =scoreM(scoreM(2, 15), 10)

    =scoreM(10, 15)

    =scoreM(2, 15)

    danger

    =2

    Asia

    =10

    Asia

    =15

    tsunami

    =15

    tsunami

    =10

    danger

    =2

    Pitfall: Scoring Breaks Equivalence

    • Query: “tsunami” and “Asia” and “danger”

    • Need

      • Consistent scoring: same scores for equivalent plans

      • Consistent ranking: same ranks for equivalent plans

    7.25

    9.25

    • Different values if scoreM is

    • the pairwise average function

    • There are functions that

      break the relational equivalence


    Ongoing work
    Ongoing Work

    Equivalent rewriting rules

    Scoring scheme

    What are the properties of the scoring scheme such that the rewriting rule(s) holds?

    scoreK Properties?

    scoreM Properties?

    scoreS Properties?

    RW

    E.g., join reordering requires associative, commutative

    scoring functions

    E.g., top-K requires monotonicity


    Ongoing work1
    Ongoing Work

    • Catalog all existing scoring methods for structure and text w.r.t. their compatibility with rewriting optimizations

      • Can we capture them in our framework?

      • E.g., vector space model is consistent scoring for the relational-style rewritings

    Equivalent rewriting rules

    Scoring scheme

    What are the properties of the scoring scheme such that the rewriting rule(s) holds?

    scoreK Properties?

    scoreM Properties?

    scoreS Properties?

    RW

    Equivalent rewriting rules

    A particular scoring scheme

    scoreK

    scoreM

    scoreS

    RW?

    What rewriting rules hold under a particular scoring scheme?


    Ongoing work2
    Ongoing Work

    • Smart, configurable optimizer

    Plug-in a particular scoring scheme at run time

    Is it consistent scoring / ranking?

    (are the rewritings sound?)

    If yes, use the rewritings

    If not, identify and disable all

    non-sound rewritings


    Outline2
    Outline

    • Efficient evaluation of full-text queries

      • Query optimization

      • Impact of scoring methods on optimizations

      • Distributed access methods

    • Summary and future work


    Query on distributed data
    Query on Distributed Data

    • Move from search individual sources to highly distributed sources

    • Challenges

      • Consumers and producers: many, dynamic

        • completely decentralized

      • Users unaware of data location

        • completely distributed data

    • Our goal: efficient distributed computation

      • data discovery, evaluation, ranking of FT queries


    P2p network with xml sources

    Efficient and expressive querying of

    the global XML data?

    P2P Network with XML Sources

    Local

    XML

    • Each node can

    • produce and store XML data

    • answer queries over its local

    • XML store

    • initiate queries on actual

    • content of documents

    1

    Query1: (tsunami, Asia)

    2

    Local

    XML

    3

    Local

    XML

    4

    Query2: (concerts, NYC)

    Network link

    Local

    XML

    5

    6

    8

    7

    10

    9

    12

    11

    Local

    XML

    Local

    XML

    Local

    XML

    Local

    XML

    Local

    XML

    Local

    XML


    Proposed architecture

    Local

    XML

    1

    2

    Local

    XML

    3

    Local

    XML

    4

    Local

    XML

    5

    6

    8

    7

    10

    9

    12

    11

    Local

    XML

    Local

    XML

    Local

    XML

    Local

    XML

    Local

    XML

    Local

    XML

    Proposed Architecture

    Return the answers to the FT query

    • Locally, post-processes at a node

    • leverage the XFT engine

    XFT Algebraic Engine

    Consumer’s side

    Producers’ side

    • Distributed access methods (index)

    • to discover the relevant sources

    • answer keyword/XPath

    • part of the queries


    Proposal leverage query dissemination trees
    Proposal: Leverage Query Dissemination Trees

    • Route queries: move queries, not data

    • Peers self-organize in query dissemination trees

      • Every node contains summary of XML documents stored in its subtrees

    • Use the dissemination trees for query routing

      • Queries always posed at the root

      • If a node’s summary matches the query then forward query to children


    Define the design space

    … but the overall throughput depends on the slowest node.

    Challenge: relieve the traffic congestion

    Define the Design Space

    • less congestion

    • more control overhead

    • more congestion

    • less control overhead

    1 tree per keyword

    1 tree for all keywords


    The design space to explore
    The Design Space To Explore

    • Optimal solution lies between the extremes

    • Proposal

      • Partition set of keywords into blocks

      • Build one tree per keyword block

        • connect all keywords from same block into one tree

    Optimal solution?

    Optimal solution

    1 tree per keyword

    1 tree for all keywords

    Partitioning the data space


    Forces at cross purposes

    find the minimum number of trees

    relieve congestion

    (improve the overall throughput)

    Optimization problem:

    to

    Forces at Cross-purposes

    peak-to-average load within an approximation ε (acceptableε=20%)

    • less congestion

    • more control overhead

    • more congestion

    • less control overhead

    Tradeoff: congestion vs. control traffic

    congestion

    control traffic

    Number trees

    1 tree per keyword

    1 tree for all keywords

    Partitioning the data space


    Preliminary results load balancing
    Preliminary Results: Load Balancing

    • Requirement

      • a node that appears high in one tree will appear in lower levels in all the other trees

         guarantee a node appears on different tree levels in each tree

    • Load balance is when the nodes have been in the top levels at most once

    • Our approach: circular permutation of the internal nodes among the different trees

       peak load decreases drastically

       peak-to-average processing load is within 15%


    Future directions
    Future Directions

    • For conjunctive query routing

      • Query selectivity estimation

    • Scoring in distributed systems

      • E.g., IDF is inherently global

    • Need an analytical cost model to better understand parameters for XML access methods in the design space


    Summary
    Summary

    • A formalized approach to full-text queries for large-scale systems

      • Efficiency

        • Relational-style optimizations of XFT algebraic plans

        • Universal scoring

          • properties of scoring functions for scoring consistency

      • Distributed computation

    • Prototype (under construction) 



    ad