An Adaptive XML Retrieval System
This paper presents an adaptive XML retrieval system that focuses on optimizing XML querying and ranking. The system splits collection elements into separate indices to ensure coverage, reduce overlap, and enhance retrieval efficiency. It approaches XML retrieval through various tasks including exhaustive and focused querying, which allows for varying granularity in the retrieval results. By indexing different document components and addressing challenges in overlapping statistics, the approach aims to improve the overall performance of XML data retrieval. Results from experiments using extensive datasets show promising coverage and retrieval effectiveness.
An Adaptive XML Retrieval System
E N D
Presentation Transcript
Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab An Adaptive XML Retrieval System XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 The XML retrieval tasks • Query formulation • CO – Content only • CAS – Content and structure (NEXI) • Retrieval tasks • Thorough: • “find all highly exhaustive and specific elements” • Retrieval results can be (possibly overlapping) XML elements of varying granularity that fulfill the query • Focussed : • “ find the most exhaustive and specific element in a path” • No overlap in returned results
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Approaches for XML retrieval • Index full documents. • Score documents and then components inside the documents • Problem: Works well for “fetch and browse” but not for the general thorough task • Index only leaf elements • Score leaves and propagate scores along the XML tree • Problem: weights used to propagate are either set manually by the user or set empirically • Index all elements into same index • Score all possible elements • Problem: distorted “element-level" statistics due to overlapping • Can we fix the distorted statistics?
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 An adaptive XML retrieval system • Split all collection elements into separate indices such that • Coverage - each element is indexed in at least one index • No overlap - elements in each index do not nest. • Run Query on each index • Merge results to a single result list
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Split to indices - example article[1] article[1] Index 0 Index 1 bdy[1] bdy[1] sec[2] Index 2 sec[1] sec[1] Index 3 p[3] ss1[2] p[1] ss1[1] • Index 0: /article[1] /article[1] Index 1: /article[1]/bdy[1] /article[1]/bdy[1] Index 2: /article[1]/bdy[1]/sec[1], /article[1]/bdy[1]/sec[1] /article[1]/bdy[1]/sec[2] Index 3: /article[1]/bdy[1]/sec[2]/p[1], /article[1]/bdy[1]/sec[1]/ss1[1] /article[1]/bdy[1]/sec[2]/p[3] /article[1]/bdy[1]/sec[1]/ss1[2]
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 An adaptive indexing schema SplitToIndices(doc, minCompSize, nInd) • Find all leaves in doc that are larger than minCompSize • If no minimal leaves found return G0 = {root} • Let d be the longest path among all those leaves • Create groups {G0,…,Gd-1} where each Gi contains all elements inferred Xpath prefixes of length i of all matched leaves. • Remove repeating elements in each group • Split the groups {G1,…,Gd} to indices{I0,…, InInd-1} (several strategies) • Return {I0,…, InInd-1}
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Examples – cut long paths • Minimal element -/article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1]/td[2] • Split to Indices • index 0 : /article[1] • index 1 : /article[1]/body[1] • index 2 : /article[1]/body[1]/section[7] • index 3: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1] • index 4: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2] • index 5: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1] • index 6: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1]/td[2]
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Experiements • IEEE collection 1995-2004 • 17,000 articles, 700MB • Average document length ~41K • Average depth 6.9 • 29 topics from INEX 2005 • Wikipedia collection • 660,000 pages, 4.5GB • Average document length 6.8K • Average depth 6.72 • 111 topics from INEX 2006
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Coverage • For nInd=7 and minCompSize=10. • 87% coverage for IEEE collection recall base • 75% coverage for Wikipedia collection filtered recall base • The filtered recall base was generated by removing all link elements from the recall base • We still miss some small elements and some in-between elements which has depth > 7
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Doc pivot • Some low level indices have partial content of the collection thus missing statistics • Solution: compensate by containing document’s score Score’(e) = docPivot * Score(doc(e)) + (1 – docPivot) * Score(e))
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Elements distribution
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Tuning number of Indices Set minCompSize=10 needle
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Tuning min Component Size Set num indices nInd=7 Set num indices = 7
Summary • Adaptive Indexing schema • split XML elements to separate indices • Same parameters for different collections • XML retrieval system • achieved by running existing IR engines on each index • Can be used for CAS • Relatively low MAep results • Does XML structure reflect any semantic structure? XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008
Thank you! XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008