XML Information Retrieval

XML IR hot DB & IR research topic for >6 years XQuery/IR, XSearch, XIRQL, XXL, Niagara, …,TopX, SphereSearch, … XPath/XQuery Full Text >60 participating groups in the Initiative for the Evaluation of XML Retrieval (INEX 2006) Most systems allow to specify constraints on content and structure XML Information Retrieval Why do Search Engines not use XML IR? („There is no XML data“ is not a valid answer) Why do end users not use XPath? ECIR 2006, London, UK

Users vs. Structural XML IR • Structural query languagesdo not work in practise: • Schema is unknown or heterogeneous • Language is too complex • Humans don‘t think XPath • Results often unsatisfying //professor[contains(.,SB) and contains(.//course,IR] I need information about a professor in SB who teaches IR. • System support to generate „good“ structured queries: • User interfaces („advanced search“) • Natural language processing • Interactive query refinement ECIR 2006, London, UK

Relevance Feedback Structural Features for Feedback on XML Evaluation Summary and Outlook Outline ECIR 2006, London, UK

Relevance Feedback for Interactive Query Refinement XML 1 IR IR 2 index 3 Fagin index 4 index XML IR … query evaluation XMLnot(Fagin) 1. User submits query 2. User marks relevant and nonrelevant docs • Feedback for XML IR: • Start with keyword query • Find structural expansions • Create structural query 3. System finds best terms to distinguish between relevant and nonrelevant docs 4. System submits expanded query ECIR 2006, London, UK

Structural Features User marksrelevant result article frontmatter body backmatter sec sec author„Baeza-Yates“ sec „Semistructured data…“ subsec„XML has evolved…“ subsec p p p„With the advent of XSLT…“ Possible features: Tag+Content of descen-dants of ancestors Tag+Contentof ancestors Content ofresult Tag+Content ofdescendants AD: article//author[Baeza] C: XML D: p[XSLT] A: sec[data] ECIR 2006, London, UK

Feature Selection Order features by Robertson Selection Value: wherepf probability that f occurs in relevant result,qf probability that f occurs in nonrelevant result Compute Robertson-Sparck-Jones weight for each feature (also used as weight in query): whererf number of relevant results with fR number of relevant resultseff number of elements that contain fE number of all elements ECIR 2006, London, UK

Query Construction article author[Baeza] sec[data] p[XSLT] Initial query: query evaluation *[query evaluation] *[query evaluation XML] Tag+Content of descen-dants of ancestors Tag+Contentof ancestors Content ofresult Tag+Content ofdescendants AD: article//author[Baeza] C: XML D: p[XSLT] A: sec[data] ECIR 2006, London, UK

Architecture query TopX SearchEngine INEX Tools & Assessments results query + results feedback Candidate Classes expanded query C Module D Module AD Module A Module Weighting + Selection ECIR 2006, London, UK

Relevance Feedback Structural Features for Feedback on XML Evaluation Conclusion and Future Work Outline ECIR 2006, London, UK

INEX collection(IEEE-CS journal and conference articles): 12,107 XML docs with 12 mio. elements queries with manual relevance assessments 52 keyword queries from 2003 & 2004 with our TopX Search Engine [VLDB05] Baseline run with MAP~0.1, Precision@20=0.174 Automatic feedback for top-k from relevance assessments Evaluation ignores results used for feedback (and descendants of results) [INEX 2004 RF Track Homepage] Evaluation Settings ECIR 2006, London, UK

Experimental Results with TopX All dimensions together are best. Reasonable results for INEX 2005 RF Track ECIR 2006, London, UK

Consider other feedback dimensions Relevance Feedback for queries with structure Active feedback: proactively ask user for feedback on selected elements Exploit correllation of expansion candidates Integration with Graphical User Interface Evaluation of feedback algorithms (INEX 2006 Relevance Feedback Track): Eliminate effect of „training on data“ Eliminate influence of search engine Current and Future Work ECIR 2006, London, UK

Structural Feedback is an important step towards making structural XML IR work. Reasonable results with even simple choice of expansion dimensions. Many open problems are left for future research. Conclusions ECIR 2006, London, UK

Thank you! ECIR 2006, London, UK

Data Structures Professor[SB] Course[IR] Research[XML] Professor[SB] Course[IR] Research[XML] • 1) Build index lists for each tag-term pair, grouped by document, sorted by max score in document • Block-fetch all elements for the same doc • Create and/or update candidates, including testing PCs in memory • Maintain score and best score for each candidate, prune when possible ECIR 2006, London, UK

XML Information Retrieval Area Overview and Contributions The TopX Search Engine Structural Relevance Feedback Outline ECIR 2006, London, UK

Query and scoring model for similarity queries Extend top-k query processing algorithms for sorted lists [Buckley85, Güntzer et al. 00, Fagin01]to XML queries & data, including similarity queries Exploit cheap disk space for highly redundant indexing TopX: Efficient XML IR Goal: Efficiently compute the best results of a similarity query ECIR 2006, London, UK

<P> Gerhard Weikum <C>IR</C> SB <R>XML</R></P> TopX Data Model docid=1pre=1; post=3 tag=“P“ content=“Gerhard Weikum IR SB XML“ 1 docid=1pre=3; post=2tag=“R“content=“XML“ docid=1pre=2; post=1tag=“C“content=“IR“ 2 3 • pure tree model, ignoring links • content of descendants replicated, per-element term scores (using tf/idf scores or variant of Okapi BM25 model) • pre/postorder labels reflecting element hierarchy [Grust02] ECIR 2006, London, UK

Query = tree/graph pattern with mandatory/optional content conditions (CC) mandatory path conditions (PC) mandatory target element formulated in XPath-like language Special case: Keyword Query *[IR XML database] Queries Professor[SB] Course[IR] Research[XML] ECIR 2006, London, UK

Query Scores for Content Conditions with element statistics • Basic scoring idea within IR-style family of TF*IDF ranking functions • Content-based scores cast into an Okapi-BM25 probabilistic model with element-specific model parameterization wheretf(ci,e) number of occurrences of ci in element eNT number of elements with tag TefT(ci) number of elements with tag T that contain ci ECIR 2006, London, UK

Query Scores 171 171 0.8 182 182 0.5 Professor[SB] Course[IR] Research[XML] • candidate = connected sub-pattern with element ids and scores • result = candidate with scores for all mandatory conditions and the target element • content-based score of resultwith elements e1,…,em for query q with CC T1[c1], ...,Tm[cm] (some ei may be empty) Additional extensions for path conditions ECIR 2006, London, UK

Dimensions for Structural Expansion User marksrelevant result article frontmatter body backmatter sec sec citation„Serge Abiteboul“ author„Baeza-Yates“ sec „Semistructured data…“ subsec„XML has evolved…“ subsec p p p„With the advent of XSLT…“ Possible dimensions: Tag+Content of other elements in the document Content ofresult Path tothe result C: XML P: article/body/sec/subsec D: //author[Baeza] //citation[Abiteboul] ECIR 2006, London, UK

Weights for Content and Doc Dimensions • Compute Rocchio weights [1971] for each feature (also used as weight in query): whererf number of relevant results with fR number of relevant resultsnf number of nonrelevant results with fN number of nonrelevant results Alternatively:consider accumulated scoremass instead of rf, nf • Order features by weight, break ties with Mutual Information of score and relevance distributions • Select top-NC content features, top-ND document features ECIR 2006, London, UK

Tag names alone cannot enhance retrieval quality, complete paths are too strict. Use path fragments: Prefixes: article/#, article/body/# Infixes: #/section/# Subpaths: #/body/section/# Paths with wildcards: article/#/section/# Suffixes: #/subsection Full paths Weights based on Rocchio Path-based Constraints article body section subsection ECIR 2006, London, UK

Engine-based: Generate an expanded query with structural constraints Submit to structural query engine Rerank (large) existing set of results Hybrid: Evaluate some of the new conditions with an engine rerank the resulting set of results Evaluation of Expanded Queries Three options: ECIR 2006, London, UK

Generating Expanded Queries author[Baeza] citation[Abiteboul] Initial query: query evaluation * *[query evaluation] *[query evaluation XML] : descendant-or-self axis Path dimension is handled differently Tag+Content of other elements in the document Content ofresult C: XML D: //author[Baeza] //citation[Abiteboul] ECIR 2006, London, UK

Basic approach: Consider set E of results (|E|~1000) for the initial keyword query with scores s(e) For each element e, compute score wd(e) in each dimension d: Compute all features for e in dimension d Compute wd(e) as cosine of e‘s feature vector and the selected query features for dimension d Normalize all scores to [-1,1] and add partial scores Sort E by combined score Reranking Query Results Hybrid evaluation: evaluate some dimensions(like content) with engine, the others with reranking ECIR 2006, London, UK

Architecture query XML SearchEngine results query + results expanded query feedback results of expanded query Feedback Dimensions Content Module Path Module Doc Module … reranked results Scoring + Reranking ECIR 2006, London, UK

Experimental Results with TopX: Paths Position 1 for INEX 2005 Relevance Feedback Track (of 15) ECIR 2006, London, UK

XML Information Retrieval

XML Information Retrieval

Presentation Transcript

XML Retrieval

XML Retrieval

XML Retrieval

Will XML and Information Retrieval Make Society Transparent?

Information Retrieval

Evaluation of XML Information Retrieval Systems

XML Information Retrieval and INEX

XML Information Retrieval

Information Retrieval

Ranked Information Retrieval on XML Data

Structure/XML Retrieval

XML Distributed Retrieval

Lecture 21: XML Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

information retrieval

Information Retrieval