1 / 21

TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks

TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks. Ablimit Aji Emory University. Martin Theobald Max Planck Institute Informatics. Ralf Schenkel Saarland University. Outline. Ad-hoc Focused. Query rewriting Data & scoring model Distributed indexing (new for 2009!)

idalee
Download Presentation

TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks AblimitAji Emory University Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University

  2. Outline Ad-hoc Focused • Query rewriting • Data & scoring model • Distributed indexing (new for 2009!) • Query processing • Results • Ad-hoc • Efficiency Efficiency Focused

  3. Query Rewriting I (NEXI/XPath-FT) • CAS Queries • //article//(sec|p)[(about(.//header, “Yoga Lessons” ) or about(.//title, +Yoga -history)) and about(.//figure, exercise) ] • Query DAGs • tag-term pairs as leafs • navigational tags as support elements • Discard all Boolean constraints, “andish” mode for both CO and CAS // article sec p // header$ yoga header$ lesson title$ yoga figure$ exercise self

  4. Query Rewriting II (NEXI) • CO Queries • “Yoga Lessons” +Yoga -history exercise • //*[about(., “Yoga Lessons” +Yoga -history exercise)] • Virtual * tag, fully pre-computed and materialized in inverted lists as *-term pairs • Can be generalized to specific tag classes (e.g. <article|sec|p>) *$yoga *$lesson *$exercise self self

  5. article 1 6 title abs sec 2 2 1 3 4 5 “xml data manage” “xml manage system vary wide expressive power“ title par 5 3 6 4 “native xml data base” “native xml data base system store schemaless data“ ftf (“xml”, article1 ) = 4 Data Model “xml data manage xmlmanage system vary wide expressive power native xml native xmldata base system store schemaless data“ “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ <article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data. </par> </sec> </article> “native xml data base native xml data base system store schemaless data“ “native xml data base native xml data base system store schemaless data“ ftf (“xml”, sec4 ) = 2 • XML Trees (no XLink/ID/IDRef) • Pre-/post-order ranges for the structure • Redundant full-content text nodes

  6. Scoring Model [TopX @ INEX ’05–’09] Content Index (Tag-Term Pairs) Element Freq. Element Statistics • XML-specific variant of Okapi BM25 (aka. E-BM25, Robertson et al. [INEX ‘05]) with k1 = 2.0, b=0.75 decay factor for ftf of 0.925 author[“gates”] vs. section[“gates”]

  7. How to create a full CAS index for a large XML collection efficiently? • TopX index statistics for Wikipedia 2009 (55 GB XML sources) • Go distributed!

  8. Distributed Indexing I Top-k Engine • Two-level hashing: • At query processing time: • hash(ti)  NodeId|FileId|ByteOffset • (64-bit dictionary) • At Indexing Time: • FileId(ti) = hash(ti) mod f • NodeId (ti) = FileId(ti) mod p Node1 Node2 Nodep … File[1] … File[f/p] File[(f/p)+1] … File[2f/p] File[(p-1)(f/p)+1] … File[f] … tag$term1 tag$term3 … tag$term2 tag$term4 … tag$term4 tag$term5 … … Docs[1, …, n/p] Docs[(n/p)+1, …, 2n/p] Docs[(p-1)/(n/p)+1, …, n]

  9. Distributed Indexing II • Shared dictionary is mapping 64-bit keys  64-bit values • Using hash(ti) as keys • Using 8 bits/NodeId, 12 bits/FileId, 44 bits/ByteOffset as values • Max. distributed index size: 4,096 x 244 bytes = 16 Terabytes (Dictionary itself takes ~4 GB for 200 million keys)

  10. Index Files: Inverted Block Structure for CAS Queries pre post score sec[“xml”] 0 Element Block Doc-ID 1 • Group element blocks with similar Max-Score into document blocks of fixed length (e.g. 256KB) • Sort element blocks within each document block by Doc-ID • Supports • Sequential (“sorted”) access by descending max(Max-Score) • Merge-joins by Doc-ID • Dynamic top-k pruning, efficient merge-joins over large blocks Doc-ID 2 Max-Sore Document Block ≤ 256KB SA Doc-ID 5 title[“xml”] 122,564 Doc-ID 3 Doc-ID 6 Max-Sore … L …

  11. Merging BlocksIncrementally //sec[about(.//, “XML”)] //par[about(.//, “retrieval”)] sec[“xml”] par[“retrieval”] 1 2 Max(Max-Score): 0.9 1.0  Sorted access and efficient merge-joins on top of large document blocks from disk SA 4 2 7 5 3 5 0.6 0.8 6 6 … …

  12. Some more tricks… • Dump leading histogram blocks directly into index list headers • Histograms only for index lists that exceed one document block (<5% of all lists) • Supports probabilistic pruning and cost-based index access scheduling[Prob-Top-K, VLDB ’04; IO-Top-K, VLDB ’06] • Efficient on-the-fly index decompression (S16), internal caching of decompressed index lists • Incrementally read & process precomputed memory images for fast top-k queries on top of large disk blocks ~36 bytes sec[“xml”] DB1(256 KB) DB2(256 KB) Histogram Block freq EB 1 EB 2 … EB k … … 1 0 score

  13. Runs • Ah-hoc Track (Article-Only, CO & CAS) • Focused • Best-In-Context • Thorough • Efficiency • Type (A) Focused (same as Ad-Hoc Focused) • Top-15, Top-150, Top-1500, Article-Only, CO & CAS • Type (B) Focused, CO only • Top-15 only, but up to 96 keywords/query

  14. Results – Ad-hoc, Focused

  15. Results – Ad-hoc, Best-In-Context

  16. Results – Ad-hoc, Thorough

  17. Results – Efficiency, Focused (Type A)

  18. Results – Efficiency, Focused (Type A)

  19. Results – Efficiency, Focused (Type B)

  20. Results – Efficiency, Focused (Type B)

  21. Future Work • Phrase-matching & proximity ranking(non-monotonic!) • “Holistic” Top-k for XQuery • Multiple XPaths per XQuery • Efficient inter-document retrieval • Complex Boolean constraints among paths • Updates! • Full-fledged open-source platform for W3C XQuery Full-Text

More Related