An Adaptive XML Retrieval System

Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab An Adaptive XML Retrieval System XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008

XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 The XML retrieval tasks • Query formulation • CO – Content only • CAS – Content and structure (NEXI) • Retrieval tasks • Thorough: • “find all highly exhaustive and specific elements” • Retrieval results can be (possibly overlapping) XML elements of varying granularity that fulfill the query • Focussed : • “ find the most exhaustive and specific element in a path” • No overlap in returned results

XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Approaches for XML retrieval • Index full documents. • Score documents and then components inside the documents • Problem: Works well for “fetch and browse” but not for the general thorough task • Index only leaf elements • Score leaves and propagate scores along the XML tree • Problem: weights used to propagate are either set manually by the user or set empirically • Index all elements into same index • Score all possible elements • Problem: distorted “element-level" statistics due to overlapping • Can we fix the distorted statistics?

XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 An adaptive XML retrieval system • Split all collection elements into separate indices such that • Coverage - each element is indexed in at least one index • No overlap - elements in each index do not nest. • Run Query on each index • Merge results to a single result list

XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Split to indices - example article[1] article[1] Index 0 Index 1 bdy[1] bdy[1] sec[2] Index 2 sec[1] sec[1] Index 3 p[3] ss1[2] p[1] ss1[1] • Index 0: /article[1] /article[1] Index 1: /article[1]/bdy[1] /article[1]/bdy[1] Index 2: /article[1]/bdy[1]/sec[1], /article[1]/bdy[1]/sec[1] /article[1]/bdy[1]/sec[2] Index 3: /article[1]/bdy[1]/sec[2]/p[1], /article[1]/bdy[1]/sec[1]/ss1[1] /article[1]/bdy[1]/sec[2]/p[3] /article[1]/bdy[1]/sec[1]/ss1[2]

XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 An adaptive indexing schema SplitToIndices(doc, minCompSize, nInd) • Find all leaves in doc that are larger than minCompSize • If no minimal leaves found return G0 = {root} • Let d be the longest path among all those leaves • Create groups {G0,…,Gd-1} where each Gi contains all elements inferred Xpath prefixes of length i of all matched leaves. • Remove repeating elements in each group • Split the groups {G1,…,Gd} to indices{I0,…, InInd-1} (several strategies) • Return {I0,…, InInd-1}

XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Examples – cut long paths • Minimal element -/article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1]/td[2] • Split to Indices • index 0 : /article[1] • index 1 : /article[1]/body[1] • index 2 : /article[1]/body[1]/section[7] • index 3: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1] • index 4: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2] • index 5: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1] • index 6: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1]/td[2]

XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Experiements • IEEE collection 1995-2004 • 17,000 articles, 700MB • Average document length ~41K • Average depth 6.9 • 29 topics from INEX 2005 • Wikipedia collection • 660,000 pages, 4.5GB • Average document length 6.8K • Average depth 6.72 • 111 topics from INEX 2006

XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Coverage • For nInd=7 and minCompSize=10. • 87% coverage for IEEE collection recall base • 75% coverage for Wikipedia collection filtered recall base • The filtered recall base was generated by removing all link elements from the recall base • We still miss some small elements and some in-between elements which has depth > 7

XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Doc pivot • Some low level indices have partial content of the collection thus missing statistics • Solution: compensate by containing document’s score Score’(e) = docPivot * Score(doc(e)) + (1 – docPivot) * Score(e))

XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Elements distribution

XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Tuning number of Indices Set minCompSize=10 needle

XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Tuning min Component Size Set num indices nInd=7 Set num indices = 7

Summary • Adaptive Indexing schema • split XML elements to separate indices • Same parameters for different collections • XML retrieval system • achieved by running existing IR engines on each index • Can be used for CAS • Relatively low MAep results • Does XML structure reflect any semantic structure? XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008

Thank you! XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008

An Adaptive XML Retrieval System

An Adaptive XML Retrieval System

Presentation Transcript

Robot Retrieval System

NewsBoy: an interactive news retrieval system

Guiding an adaptive system through chaos

XML Retrieval

XML Retrieval

An information retrieval system for parliamentary documents

Information Retrieval System

XML Retrieval

MOVIE RETRIEVAL SYSTEM

An overview of Immune system- Adaptive immunity

Fast Adaptive Storage and Retrieval

XML Information Retrieval and INEX

XML Information Retrieval

MOVIE RETRIEVAL SYSTEM

Adaptive XML Storage

Adaptive XML Search

Structure/XML Retrieval

XML Information Retrieval

XML Distributed Retrieval

Lecture 21: XML Retrieval

An online photo organization and retrieval system

Robot Retrieval System