flexible and efficient xml search with complex full text predicates n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Flexible and Efficient XML Search with Complex Full-Text Predicates PowerPoint Presentation
Download Presentation
Flexible and Efficient XML Search with Complex Full-Text Predicates

Loading in 2 Seconds...

play fullscreen
1 / 48

Flexible and Efficient XML Search with Complex Full-Text Predicates - PowerPoint PPT Presentation


  • 68 Views
  • Uploaded on

Flexible and Efficient XML Search with Complex Full-Text Predicates. Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University of California San Diego Alin Deutsch - University of California San Diego. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Flexible and Efficient XML Search with Complex Full-Text Predicates


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University of California San Diego Alin Deutsch - University of California San Diego

    2. Introduction • Need for complex full-text predicates beyond simple keyword search • Library of Congress (LoC) • Biomedical data • ACM, IEEE publications • INEX data collection • Wikipedia XML data set SIGMOD, June 2006

    3. XML real fragment from LoChttp://thomas.loc.gov/home/gpoxmlc109/h2739_ih.xml bill legis-session congress-info legis legis-desc nbr sponsors Congress on education and workforce, comments to appropriate services. legis-body action Jefferson and services … HR2739 House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson 109th action-desc on May 2, 2004 Joe Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson SIGMOD, June 2006

    4. Query with complex FT predicates • Document fragments (nodes) that contain the keywords “Jefferson”and“education” and satisfy the predicates • within a window of 10 words, • with “Jefferson” ordered before “education” SIGMOD, June 2006

    5. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Example: LoC document SIGMOD, June 2006

    6. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Example: LoC document SIGMOD, June 2006

    7. Existing languages • Many XML full-text search languages • expressive power, semantics, scores [BAS-06] • XQFT-class W3C’s XQuery Full-Text (XQFT), NEXI, XIRQL, JuruXML, XSearch, XRank, XKSearch, Schema Free XQuery • Efficient query evaluation limited to • Conjunctive keyword search (no predicates) • Full-text predicates in isolation • Need for a universal optimization framework • Guarantee the universality of the solution SIGMOD, June 2006

    8. Contributions • Formal semantics for XQFT-class • Unified framework • Capture family of tf*idf scoring methods • Structure-aware algorithms to efficiently evaluate XQFT-class languages • XFT full-text algebra • Enable new optimizations inspired by relational rewritings SIGMOD, June 2006

    9. Talk Outline • Motivation & Contributions • Formalization of XML full-text search • Efficient evaluation • Experiments • Conclusion SIGMOD, June 2006

    10. Formalization: design goals • Capture existing full-text languages • Language semantics in terms of • keyword patterns • pattern matches • predicates evaluated through matches • Manipulate tuples • enable relational query evaluation and rewritings SIGMOD, June 2006

    11. Formalization: patterns • Pattern = tuple of simultaneously matching keywords • Query expression: “Jefferson”and“education” • within a window of 10 words, • with “Jefferson” ordered before “education” SIGMOD, June 2006

    12. Formalization: patterns • Formalization specifies • patterns ← conjunction of keywords • set of patterns ← disjunction of keywords • exclusion patterns ← negation of keywords • No matches in the document SIGMOD, June 2006

    13. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matches SIGMOD, June 2006

    14. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matches SIGMOD, June 2006

    15. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matches SIGMOD, June 2006

    16. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matches SIGMOD, June 2006

    17. Formalization: matching tables • Matching table represents • Nested relation • Each node in the document • Each pattern in the query • Set of matches SIGMOD, June 2006

    18. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matching tables SIGMOD, June 2006

    19. XFT Algebra • Similar to relational algebra • Manipulate matching tables • Leverage relational query evaluation + optimization techniques • XFT operators • construct matching table Rk for each keyword k get(k) • manipulate matching tables R1or R2 R1andR2 R1minus R2 σtimes(R), σordered(R), σwindow(R), σdistance(R) SIGMOD, June 2006

    20. Query: Nodes that contain the keywords “Jefferson”and“education” within a window of 10 words, with “Jefferson” ordered before “education” × XFT Algebra Benefit: equivalent query rewritings SIGMOD, June 2006

    21. Talk Outline • Motivation & Contributions • Formalization of XML full-text search • Efficient evaluation • Experiments • Conclusion SIGMOD, June 2006

    22. Query evaluation: AllNodes 5 • Straightforward implementation of the XFT algebra • Each node is considered separately • Each tuple is self-contained • Relational-style evaluation • Joins → equi-joins • Predicates → selections on set of matches SIGMOD, June 2006

    23. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Example: LoC document 1 1.3 1.1 1.3.1 1.3.2 1.1.2 1.1.3 1.1.1 1.3.1.2 1.2 1.2.2 1.2.2.2 SIGMOD, June 2006

    24. × SIGMOD, June 2006

    25. × SIGMOD, June 2006

    26. × Predicate operates one tuple at a time SIGMOD, June 2006

    27. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Example: LoC document 1 1.3 1.1 1.3.1 1.3.2 1.1.2 1.1.3 1.1.1 1.3.1.2 1.2 1.2.2 1.2.2.2 SIGMOD, June 2006

    28. Query evaluation: SCU 5 • AllNodes = straightforward algorithm • Reduce size of intermediate results • structural relationships between nodes • avoid redundant match representation • SCU = Smallest Containing Unit SIGMOD, June 2006

    29. Matching tables → SCU tables → captures same information SIGMOD, June 2006

    30. × SIGMOD, June 2006

    31. × • Equi-join does not work • Need to compute LCA SIGMOD, June 2006

    32. × 1.1 is the LCA of 1.1.3 and 1.1.1 SIGMOD, June 2006

    33. × SIGMOD, June 2006

    34. × SIGMOD, June 2006

    35. × SIGMOD, June 2006

    36. × • Postorder • Stack supports single scan SIGMOD, June 2006

    37. SCU summary 5 • Equivalent to AllNodes • Structure-awareness reduces size of intermediate results • Increase computation cost • Compute LCAs of nodes • Match propagation • Stack-based techniques SIGMOD, June 2006

    38. Related work on LCA for XML • LCA for conjunctive keyword search • XRank [GSBS-03] • Schema-free XQuery [LYJ-04] • XKSearch [XP-05] • Shortcomings • No postprocessing, not compositional • Input in document order • Output postorder traversal • Support for complex predicates is not straightforward SIGMOD, June 2006

    39. Talk Outline • Motivation & Contributions • Formalization of XML full-text search • Efficient evaluation • Experiments • Conclusion SIGMOD, June 2006

    40. Experimental goals • AllNodes vs. SCU • AllNodes: redundant representation • SCU: smaller sizes, more computation • SCU Overhead • Stack • Match propagation • Benefit of Rewritings • Relational-style rewritings SIGMOD, June 2006

    41. Experimental setup • Centrino 1.8GHz with 1GB of RAM • XMark generated datasets • Size ranges from 50 MB – 300 MB SIGMOD, June 2006

    42. Varying document size (q1 - query without predicates) Experiments: AllNodes vs. SCU • q1 = get(“See”) and get(“internationally”) and get(“description”) and get(“charges”) and get(“ship”) SIGMOD, June 2006

    43. Experiments: SCU Overhead • Queries • q4 = σwindow>1(“See”, “internationally”, “description”, “charges”, “ship”) (q1) • q5 = σwindow>90000000(“See”, “internationally”, “description”, “charges”, “ship”) (q1) • Recall that • q1 = get(“See”) and get(“internationally”) and get(“description”) and get(“charges”) and get(“ship”) SIGMOD, June 2006

    44. Varying query predicates (not pushed) Experiments: SCU Overhead • q4 always true → no match propagation, just the stack overhead • q5 always false → propagate all matches SIGMOD, June 2006

    45. Experiments: Benefit of Rewritings • Queries • q2 = σorderedE(“See”, “internationally”, “description”, “charges”, “ship”) (q1) • q3 = push selections in q2 • Recall that • q1 = get(“See”) and get(“internationally”) and get(“description”) and get(“charges”) and get(“ship”) SIGMOD, June 2006

    46. Varying document size (query with predicates) Experiments: Benefit of Rewritings • 40% improvement for relational-like query rewritings SIGMOD, June 2006

    47. Conclusion • A unified logical framework for XML full-text search languages • Algebra admits • Efficient algorithms for operator evaluation • Rewritings of queries into more efficient forms • Facilitate XML joint optimizations of queries on both structure and text search • Future work • Score-aware logical framework SIGMOD, June 2006

    48. Thank you! 5 SIGMOD, June 2006