1 / 48

Flexible and Efficient XML Search with Complex Full-Text Predicates

Flexible and Efficient XML Search with Complex Full-Text Predicates. Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University of California San Diego Alin Deutsch - University of California San Diego. Introduction.

ward
Download Presentation

Flexible and Efficient XML Search with Complex Full-Text Predicates

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University of California San Diego Alin Deutsch - University of California San Diego

  2. Introduction • Need for complex full-text predicates beyond simple keyword search • Library of Congress (LoC) • Biomedical data • ACM, IEEE publications • INEX data collection • Wikipedia XML data set SIGMOD, June 2006

  3. XML real fragment from LoChttp://thomas.loc.gov/home/gpoxmlc109/h2739_ih.xml bill legis-session congress-info legis legis-desc nbr sponsors Congress on education and workforce, comments to appropriate services. legis-body action Jefferson and services … HR2739 House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson 109th action-desc on May 2, 2004 Joe Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on education and workforce sponsored by Joe Jefferson SIGMOD, June 2006

  4. Query with complex FT predicates • Document fragments (nodes) that contain the keywords “Jefferson”and“education” and satisfy the predicates • within a window of 10 words, • with “Jefferson” ordered before “education” SIGMOD, June 2006

  5. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Example: LoC document SIGMOD, June 2006

  6. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Example: LoC document SIGMOD, June 2006

  7. Existing languages • Many XML full-text search languages • expressive power, semantics, scores [BAS-06] • XQFT-class W3C’s XQuery Full-Text (XQFT), NEXI, XIRQL, JuruXML, XSearch, XRank, XKSearch, Schema Free XQuery • Efficient query evaluation limited to • Conjunctive keyword search (no predicates) • Full-text predicates in isolation • Need for a universal optimization framework • Guarantee the universality of the solution SIGMOD, June 2006

  8. Contributions • Formal semantics for XQFT-class • Unified framework • Capture family of tf*idf scoring methods • Structure-aware algorithms to efficiently evaluate XQFT-class languages • XFT full-text algebra • Enable new optimizations inspired by relational rewritings SIGMOD, June 2006

  9. Talk Outline • Motivation & Contributions • Formalization of XML full-text search • Efficient evaluation • Experiments • Conclusion SIGMOD, June 2006

  10. Formalization: design goals • Capture existing full-text languages • Language semantics in terms of • keyword patterns • pattern matches • predicates evaluated through matches • Manipulate tuples • enable relational query evaluation and rewritings SIGMOD, June 2006

  11. Formalization: patterns • Pattern = tuple of simultaneously matching keywords • Query expression: “Jefferson”and“education” • within a window of 10 words, • with “Jefferson” ordered before “education” SIGMOD, June 2006

  12. Formalization: patterns • Formalization specifies • patterns ← conjunction of keywords • set of patterns ← disjunction of keywords • exclusion patterns ← negation of keywords • No matches in the document SIGMOD, June 2006

  13. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matches SIGMOD, June 2006

  14. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matches SIGMOD, June 2006

  15. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matches SIGMOD, June 2006

  16. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matches SIGMOD, June 2006

  17. Formalization: matching tables • Matching table represents • Nested relation • Each node in the document • Each pattern in the query • Set of matches SIGMOD, June 2006

  18. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Formalization: matching tables SIGMOD, June 2006

  19. XFT Algebra • Similar to relational algebra • Manipulate matching tables • Leverage relational query evaluation + optimization techniques • XFT operators • construct matching table Rk for each keyword k get(k) • manipulate matching tables R1or R2 R1andR2 R1minus R2 σtimes(R), σordered(R), σwindow(R), σdistance(R) SIGMOD, June 2006

  20. Query: Nodes that contain the keywords “Jefferson”and“education” within a window of 10 words, with “Jefferson” ordered before “education” × XFT Algebra Benefit: equivalent query rewritings SIGMOD, June 2006

  21. Talk Outline • Motivation & Contributions • Formalization of XML full-text search • Efficient evaluation • Experiments • Conclusion SIGMOD, June 2006

  22. Query evaluation: AllNodes 5 • Straightforward implementation of the XFT algebra • Each node is considered separately • Each tuple is self-contained • Relational-style evaluation • Joins → equi-joins • Predicates → selections on set of matches SIGMOD, June 2006

  23. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Example: LoC document 1 1.3 1.1 1.3.1 1.3.2 1.1.2 1.1.3 1.1.1 1.3.1.2 1.2 1.2.2 1.2.2.2 SIGMOD, June 2006

  24. × SIGMOD, June 2006

  25. × SIGMOD, June 2006

  26. × Predicate operates one tuple at a time SIGMOD, June 2006

  27. bill legis-session congress-info legis legis-desc nbr sponsors legis-body action HR2739 109th Congress on education and workforce, comments to appropriate services. action-desc Jefferson and services … Mr Column and co-sponsors Mrs Miller and Mrs Jones. Others include Jefferson committee-name introduced the following bill. The bill was reintroduced later and was referred to the committee on May 2, 2004 Joe Jefferson House of Representatives Current chamber on workforce and services. Committees on education are headed by Jefferson on education and workforce sponsored by Joe Jefferson Example: LoC document 1 1.3 1.1 1.3.1 1.3.2 1.1.2 1.1.3 1.1.1 1.3.1.2 1.2 1.2.2 1.2.2.2 SIGMOD, June 2006

  28. Query evaluation: SCU 5 • AllNodes = straightforward algorithm • Reduce size of intermediate results • structural relationships between nodes • avoid redundant match representation • SCU = Smallest Containing Unit SIGMOD, June 2006

  29. Matching tables → SCU tables → captures same information SIGMOD, June 2006

  30. × SIGMOD, June 2006

  31. × • Equi-join does not work • Need to compute LCA SIGMOD, June 2006

  32. × 1.1 is the LCA of 1.1.3 and 1.1.1 SIGMOD, June 2006

  33. × SIGMOD, June 2006

  34. × SIGMOD, June 2006

  35. × SIGMOD, June 2006

  36. × • Postorder • Stack supports single scan SIGMOD, June 2006

  37. SCU summary 5 • Equivalent to AllNodes • Structure-awareness reduces size of intermediate results • Increase computation cost • Compute LCAs of nodes • Match propagation • Stack-based techniques SIGMOD, June 2006

  38. Related work on LCA for XML • LCA for conjunctive keyword search • XRank [GSBS-03] • Schema-free XQuery [LYJ-04] • XKSearch [XP-05] • Shortcomings • No postprocessing, not compositional • Input in document order • Output postorder traversal • Support for complex predicates is not straightforward SIGMOD, June 2006

  39. Talk Outline • Motivation & Contributions • Formalization of XML full-text search • Efficient evaluation • Experiments • Conclusion SIGMOD, June 2006

  40. Experimental goals • AllNodes vs. SCU • AllNodes: redundant representation • SCU: smaller sizes, more computation • SCU Overhead • Stack • Match propagation • Benefit of Rewritings • Relational-style rewritings SIGMOD, June 2006

  41. Experimental setup • Centrino 1.8GHz with 1GB of RAM • XMark generated datasets • Size ranges from 50 MB – 300 MB SIGMOD, June 2006

  42. Varying document size (q1 - query without predicates) Experiments: AllNodes vs. SCU • q1 = get(“See”) and get(“internationally”) and get(“description”) and get(“charges”) and get(“ship”) SIGMOD, June 2006

  43. Experiments: SCU Overhead • Queries • q4 = σwindow>1(“See”, “internationally”, “description”, “charges”, “ship”) (q1) • q5 = σwindow>90000000(“See”, “internationally”, “description”, “charges”, “ship”) (q1) • Recall that • q1 = get(“See”) and get(“internationally”) and get(“description”) and get(“charges”) and get(“ship”) SIGMOD, June 2006

  44. Varying query predicates (not pushed) Experiments: SCU Overhead • q4 always true → no match propagation, just the stack overhead • q5 always false → propagate all matches SIGMOD, June 2006

  45. Experiments: Benefit of Rewritings • Queries • q2 = σorderedE(“See”, “internationally”, “description”, “charges”, “ship”) (q1) • q3 = push selections in q2 • Recall that • q1 = get(“See”) and get(“internationally”) and get(“description”) and get(“charges”) and get(“ship”) SIGMOD, June 2006

  46. Varying document size (query with predicates) Experiments: Benefit of Rewritings • 40% improvement for relational-like query rewritings SIGMOD, June 2006

  47. Conclusion • A unified logical framework for XML full-text search languages • Algebra admits • Efficient algorithms for operator evaluation • Rewritings of queries into more efficient forms • Facilitate XML joint optimizations of queries on both structure and text search • Future work • Score-aware logical framework SIGMOD, June 2006

  48. Thank you! 5 SIGMOD, June 2006

More Related