1 / 15

TopX @ INEX ‘05

TopX @ INEX ‘05. Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for Informatics Saarbrücken. article. article. title. “ Current Approaches to XML Data Manage- ment .”. title. “ The X ML Files ”. bib. sec. sec. sec. sec. bib. title.

lida
Download Presentation

TopX @ INEX ‘05

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TopX @ INEX ‘05 Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for Informatics Saarbrücken

  2. article article title “Current Approaches to XML Data Manage- ment.” title “The XML Files” bib sec sec sec sec bib title “Native XML databases.” title “The Ontology Game” title par par item item “The Dirty Little Secret” “Native XML database systems can store schemaless data ... ” “XML queries with an expres- sive power similar to that of Datalog …” title par “XML” par “Sophisticated technologies developed by smart people.” url inproc par title “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” “w3c.org/xml” “XML-QL: A Query Language for XML.” “Proc. Query Languages Workshop, W3C,1998.” par “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files …” “What does XML add for retrieval? It adds formal ways …” //article[//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML database”)] ]//bib[about(.//item, “W3C”)] An Efficient and Versatile Query Engine for TopX Search

  3. TopX: Efficient XML-IR [VLDB ’05] Goal: Efficiently retrieve the best results of a similarity query • Extend top-k query processing algorithms for sorted lists [Buckley ’85; Güntzer, Balke & Kießling ’00; Fagin ‘01]to XML data • Non-schematic, heterogeneous data sources • Combined inverted index for content & structure • Avoid full index scans, postpone expensive random accesses to large disk-resident data structures • Exploit cheap disk space for redundant indexing An Efficient and Versatile Query Engine for TopX Search

  4. Data Model “xml ir ir technique xml clustering xml evaluation“ article 1 6 “clustering xml evaluation“ title abs sec 4 2 5 3 3 3 “xml ir” “ir technique xml“ title par 5 2 6 1 “clustering xml” “evaluation“ ftf(“xml”, article1 ) = 3 <article> <title>XML-IR</title> <abs> IR techniques for XML</abs> <sec> <title> Clustering on XML </title> <par>Evaluation</par> </sec> </article> • Simplified XML model • disregarding IDRef & XLink/XPointer • Redundant full-contents • Per-element term frequencies ftf(ti,e) for full-contents • Pre/postorder labels for each tag-term pair An Efficient and Versatile Query Engine for TopX Search

  5. Full-Content Scoring Model per-element statistics • Full-content scores cast into an Okapi-BM25 probabilistic modelwith element-specific parameterization Basic scoring idea within IR-style family of TF*IDF ranking functions Additional static score mass c for relaxable structural conditions An Efficient and Versatile Query Engine for TopX Search

  6. Inverted Block-Index for Content & Structure sec[clustering] title[xml] par[evaluation] • Inverted index over tag-term pairs (full-contents) • Benefits from increased selectivity of combined tag-term pairs • Accelerates child-or-descendant axis, e.g., sec//”clustering” sec[clustering] par[evaluation] title[xml] • Sequential block-scans • Re-order elements in descending order of (maxscore, docid, score) per list • Fetch all tag-term pairs per doc in one sequential block-access • docid limits the range of in-memory structural joins • Stored as inverted files or database tables (B+-tree indexes) An Efficient and Versatile Query Engine for TopX Search

  7. Navigational Index sec title[xml] par[evaluation] sec title par • Additional navigational index • Non-redundant element directory • Supports element paths and branching path queries • Random accesses using (docid, tag) as key • Schema-oblivious indexing & querying An Efficient and Versatile Query Engine for TopX Search

  8. TopX Query Processing [Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85] • Adapt Threshold Algorithm (TA) paradigm • Focus on inexpensive sequential/sorted accesses • Postpone expensive random accesses • Candidated= connected sub-pattern with element ids and scores • Incrementally evaluate path constraints using pre/postorder labels • In-memory structural joins (nested loops, staircase, or holistic twig joins) • Upper/lower score guarantees per candidate • Remember set of evaluated dimensions E(d) worstscore(d) = ∑iE(d)score(ti,e) bestscore(d) = worstscore(d) + ∑iE(d) highi • Early threshold termination • Candidate queuing • Stop, if • Extensions • Batching of sorted accesses & efficient queue management • Cost model for random access scheduling • Probabilistic candidate pruning for approximate top-k results [VLDB ’04] An Efficient and Versatile Query Engine for TopX Search

  9. TopX Query Processing By Example 171 46 171 46 171 9 171 171 46 46 46 46 171 9 9 9 46 46 9 9 worst=2.2best=2.2 worst=0.5best=2.4 worst=0.9best=2.9 worst=0.9best=2.8 score=0.5best=1.3 worst=0.5best=2.5 worst=0.5best=2.3 score=1.7best=2.5 worst=0.9 worst=0.5 worst=0.9best=2.7 worst=0.5best=0.5 worst=0.9best=2.8 worst=0.9best=2.75 worst=0.9best=1.8 worst=0.9best=1.0 worst=0.9best=2.55 worst=1.0best=2.65 worst=1.0best=2.8 worst=1.0 worst=1.0best=1.9 worst=1.0best=1.6 worst=1.0best=2.75 worst=0.9 worst=0.85best=2.75 51 216 216 216 51 51 72 216 216 72 51 216 72 3 3 28 28 3 84 3 28 28 28 182 3 28 3 182 worst=0.8best=2.65 worst=1.7 worst=2.2 worst=0.1best=0.9 worst=0.8best=1.6 worst=0.8best=2.45 worst=0.0 best=1.35 worst=0.0best=2.65 worst=0.0best=2.45 worst=0.0best=1.7 worst=0.0 best=1.4 worst=0.0best=2.75 worst=0.0best=2.8 worst=1.6best=2.1 worst=1.6 worst=0.0best=2.9 worst=0.85best=2.45 worst=0.85best=2.65 worst=0.85best=2.15 Top-2 results sec[clustering] title[xml] par[evaluation] min-2=0.0 min-2=0.5 min-2=0.9 min-2=1.6 sec[clustering] par[evaluation] title[xml] 1.0 1.0 1.0 0.9 0.9 1.0 0.8 0.8 0.85 0.5 0.75 0.1 doc2 doc17 doc1 doc5 Candidate queue doc3 Pseudo- Candidate An Efficient and Versatile Query Engine for TopX Search

  10. CO.Thorough • Element-granularity • Turn query into pseudo CAS query using “//*” • No post-filtering on specific element types • nxCG@10 = 0.0379 (rank 22 of 55) • MAP = 0.008 (rank 37 of 55) • Old INEX_eval: MAP=0.058 (rank 3) An Efficient and Versatile Query Engine for TopX Search

  11. COS.Fetch&Browse • Document-granularity • Rank documents according to their best target element • Strict evaluation of support & target elements • Return all target elements per doc using the document score (no overlap) • MAP = 0.0601 (rank 4 of 19) An Efficient and Versatile Query Engine for TopX Search

  12. SSCAS • Element-granularity with strict support & target elements (no overlap) • nxCG@10 = 0.45 (ranks 1 & 2 of 25) • MAP = 0.0322 & 0.0272 (ranks 1 & 6 ) An Efficient and Versatile Query Engine for TopX Search

  13. Top-k Efficiency k P@k MAP@k # SA epsilon # RA relPrec relPrec CPU sec Join&Sort 10 n/a 9,122,318 0 0.261 StructIndex 10 n/a 761,970 3,25,068 0.37 StructIndex+ 10 n/a 77,482 5,074,384 1.87 0.34 0.09 1.00 TopX – MinProbe 10 0.0 635,507 64,807 0.03 TopX – BenProbe 10 0.0 723,169 84,424 0.07 TopX – BenProbe 1,000 0.0 882,929 1,902,427 0.35 0.03 0.17 1.00 An Efficient and Versatile Query Engine for TopX Search

  14. Probabilistic Pruning P@k MAP@k # SA epsilon # RA relPrec CPU sec k TopX - MinProbe 10 0.00 635,507 64,807 0.03 0.34 0.09 1.00 10 0.25 392,395 56,952 0.05 0.34 0.08 0.77 10 0.50 231,109 48,963 0.02 0.31 0.08 0.65 10 0.75 102,118 42,174 0.01 0.33 0.08 0.51 10 1.00 36,936 35,327 0.01 0.30 0.07 0.38 An Efficient and Versatile Query Engine for TopX Search

  15. Conclusions & Ongoing Work • Efficient and versatile TopX query processor • Extensible framework for text, semi-structured & structured data • Probabilistic Extensions • Probabilistic cost model for random access scheduling • Very good precision/runtime ratio for probabilistic candidate pruning • Full NEXI support • Phrase matching, mandatory terms “+”, negation “-”, attributes “@” • Query weights (e.g., relevance feedback, ontological similarities) • Scalability • Optimized for runtime, exploits cheap disk space (redundancy factor 4-5 for INEX) • Participated at TREC Terabyte Efficiency Task • Dynamic and self-tuning query expansions [Sigir ’05] • Incrementally merges inverted lists for a set of active expansions • Vague Content & Structure (VCAS) queries (maybe next year..) An Efficient and Versatile Query Engine for TopX Search

More Related