1 / 37

Evaluation of XML Information Retrieval Systems

Evaluation of XML Information Retrieval Systems. Arjen P. de Vries, CWI, Amsterdam arjen@acm.org (With significant input from Mounia Lalmas and Gabriella Kazai, QMUL). Book. Chapters. Sections. World Wide Web

thy
Download Presentation

Evaluation of XML Information Retrieval Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation of XML Information Retrieval Systems Arjen P. de Vries, CWI, Amsterdam arjen@acm.org (With significant input from Mounia Lalmas and Gabriella Kazai, QMUL)

  2. Book Chapters Sections World Wide Web This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence.. Subsections XML Retrieval: Aim • Traditional information retrieval is about finding relevant documents, e.g. an entire book. • XML retrieval allows users to retrieve document components that are more focussed, e.g. a subsection of a book instead of an entire book. SEARCHING = QUERYING + BROWSING

  3. XML Retrieval: Querying • With respect to content • Standard queries but retrieving XML elements • “London tube strikes” • With respect to content and structure • Constraints on the types of XML elements to be retrieved • E.g. “Sections of an article in the Times about congestion charges” • E.g. Articles that contain sections about congestion charges in London, and that contain a picture of Ken Livingstone”

  4. XML Retrieval: Aim Return document components (XML elements) of varying granularity (e.g. a book, a chapter, a section, a paragraph, a table, a figure, etc) relevant to the user’s information need both with regards to content and structure criteria.

  5. XML Retrieval: Challenges Article?XML, ?evaluation ?retrieval 0.9 XML 0.5 evaluation 0.2 XML 0.4 retrieval 0.7 evaluation 0.2 0.4 0.2 0.5 Section 2 Section 1 Title 0.6 0.4 0.4 no fixed retrieval unit + nested elements + element types • how to obtain element, document and collection statistics? • which element is a good retrieval unit? • which elements contribute best to content of Article? • how to estimate? • how to aggregate? • …

  6. Evaluation: IR Laboratory Experiment • Given: • Collection of documents • Collection of topics • Relevance assessments for each topic-document combination: ‘recall base’ • Or, often only an estimate of recall base (pooling method) • Evaluate systems • Metric computed on system’s result list (often Mean Average Precision) captures expected user behaviour

  7. INEX: INitiative for the Evaluation of XML Retrieval • Started in 2002 and runs each year April - December, ending with a workshop at Schloss Dagstuhl Germany • Funded by DELOS, EU network of excellence in digital libraries • Documents (~500MB): 12,107 articles in XML format from IEEE Computer Society; 8 millions elements! • INEX 2002 60 topics, inex_eval metric • INEX 2003 66 topics, use subset of XPath, inex_eval and inex_eval_ng metrics • INEX 2004 75 topics, subset of 2003 XPath subset (NEXI) Official: inex_eval with averaged different “assumed user behaviours” Others: inex_eval_ng, XCG, t2i, ERR, …

  8. Ad hoc retrieval • Content-only Task (CO): aim is to decrease user effort by pointing the user to the most specific relevant elements (2002, 2003, 2004) • No structural hints XML engine to identify most appropriate level of granularity CO

  9. Example Topic (CO) <title>Open standards for digital video in distance learning</title> <description>Open technologies behind media streaming in distance learning projects</description> <narrative> I am looking for articles/components discussing methodologies of digital video production and distribution that respect free access to media content through internet or via CD-ROMs or DVDs in connection to the learning process. Discussions of open versus proprietary standards of storing and sending digital video will be appreciated. </narrative> <keywords>media streaming,video streaming,audio streaming, digital video,distance learning,open standards,free access</keywords>

  10. article XML retrieval evaluation s1 s2 s3 XML retrieval XML evaluation ss1 ss2 Relevance • ‘Traditional’ notion of relevance: • A document is relevant if ‘it has significant and demonstrable bearing on the matter at hand’. • Common assumptions in laboratory experimentation: • Objectivity • Relevance captured as topicality • Binary nature of relevance • Independence between items

  11. article XML retrieval evaluation s1 s2 s3 XML retrieval XML evaluation ss1 ss2 Relevance in XML retrieval • Topicality not enough • Binary nature not enough • Independence is wrong

  12. Relevance in INEX (2003-) • Two relevance dimensions • Exhaustivity (E):how exhaustively a component discusses the topic of request • Specificity (S):how focused the component is on the topic of request (i.e. discusses no other, irrelevant topics) • Multiple grades • highly (3), • fairly (2), • marginally (1), • not (0) exhaustive/specific (3,1) (3,2) (1,3) (2,3) (based on Chiaramella etal, FERMI fetch and browse model 1996)

  13. How to score system results? • How to distinguish between (1,3) and (3,3), … , when evaluating retrieval results? • The relative merits of each answer depend on the user model!!! • Impatient: only reward retrieval of highly exhaustive and specific elements (3,3) • Relatively patient: only reward retrieval of highly specific elements (3,3), (2,3) (1,3) • … • Very patient: reward - to a different extent - the retrieval of any relevant elements; i.e. everything apart (0,0) • Use a ‘quantisation function’ to capture aspects of user model

  14. Examples of quantisation functions Impatient Very patient

  15. Assessments: Ranked result list: p p . . . sec sec Overlap in XML Retrieval • Overlapping (nested) result elements in output list • Overlapping (nested) reference elements in recall-base article ... sec author sec title subsec subsec title p p

  16. Overlap in retrieval results Rank Systems (runs) Avg Prec % Overlap 1. IBM Haifa Research Lab(CO-0.5-LAREFIENMENT) 0.1437 80.89 2. IBM Haifa Research Lab(CO-0.5) 0.1340 81.46 3. University of Waterloo(Waterloo-Baseline) 0.1267 76.32 4. University of Amsterdam(UAms-CO-T-FBack) 0.1174 81.85 5. University of Waterloo(Waterloo-Expanded) 0.1173 75.62 6. Queensland University of Technology(CO_PS_Stop50K) 0.1073 75.89 7. Queensland University of Technology(CO_PS_099_049) 0.1072 76.81 8. IBM Haifa Research Lab(CO-0.5-Clustering) 0.1043 81.10 9. University of Amsterdam(UAms-CO-T) 0.1030 71.96 10. LIP6(simple) 0.0921 64.29 Official INEX 2004 Results for CO topics Not very focussed retrieval!

  17. Exhaustivity propagates up! • ~26,000 relevant elements on ~14,000 relevant paths • Propagated assessments: ~45% • Increase in size of recall-base: ~182%

  18. ‘Overpopulated Recall-base’ • The recall-base contains more reference items than an ideal system should in fact retrieve • Caused by the cumulative nature of exhaustiveness • But… while systems could be rewarded a partial score for retrieving a near miss, AN ‘IDEAL’ SYSTEMS SHOULD NOT BE PENALISED FOR NOT RETRIEVING SUCH NEAR MISSES!!!

  19. Effect of Overpopulated Recall-base 100% recall only if all relevant elements returned (i.e., returning overlapping elements) Precision is plotted against lower recall values than merited according to the task definition!

  20. inex_eval • Based on precall(see Raghavan etal, TOIS 1989), a variant of Mean Average Precision; also related to expected search length (see Cooper , JASIS 1968) • User model captured in quantisation function • Overall performance as simple average across quantisation functions (INEX 2004) • Does not consider overlap in retrieval results • Although INEX 2004 reported an ‘overlap indicator’ • Does not consider overlap in recall-base

  21. inex_eval_ng • Exhaustivity and specificity related to the notion of an ideal concept space (see Wong & Yao, TOIS 1995) upon which precision and recall are defined • Considers size of retrieved elements • Considers overlap in retrieval results • Does not consider overlap in recall-base

  22. Does it matter much? • TIJAH: • Rank XML document elements by their score • UvA: • Rank XML document elements by a combination of element score and containing article score • MAPUvA >> MAPTIJAH • So, the experiment shows (convincingly) that the UvA method is better than TIJAH method! And, TIJAH implements article weighting; MAPUvA = MAPTIJAH 

  23. Does it matter much? Improved MAP might reflect undesirable results (from a user perspective)!!! • MAPUvA >> MAPTIJAH • So, the experiment shows (convincingly) that the UvA method is better than TIJAH method!

  24. XCG Metrics (new) • Not directly dependent on size of (complete) recall-base • Explicit separation of ideal results vs. near misses • Extended Cumulated Gain (CG, see Kekäläinen and Järvelin, TOIS 2002) based metrics • User-model captured in relevance-value function • Ideal Recall-base

  25. Ideal Recall-base and Run • Ideal recall-base • Ideal results should be retrieved; near misses could be retrieved, but should not penalise if not retrieved • Derived based on user preferences • Ideal run • Ordering elements of the ideal recall-base by relevance score (3,1) (3,2) (3,3) (1,2) (1,3)

  26. Relevance-Value (RV) Functions • Models user behaviour • Result-list independent • Based only on (e,s) value pairs (~quantisation functions) • Result-list dependent • Considers overlap of result elements (~inex-2003) : ranked result list : reflects user’s tolerance to redundant component parts

  27. Cumulated Gain • Gain vector (G) from ranked document list • Ideal gain vector (I) from documents in recall-base • Cumulated gain (CG) • Plot CGG of actual run against CGI of ideal ranking • nCGG = CGG / CGI L = <d4,d5,d2,d3,d1> G = <3,0,1,3,2> I = <3,3,2,1,0> CGG= <3,3,4,7,9> CGI= <3,6,8,9,9>

  28. Cumulated Gain for XML Recall-base: Ranked result list: Ideal gain vector I[i] = r(ci) (r(ci) from ideal recall-base) Actual gain vector G[i] = r(ci) (r(ci) from full recall-base!)

  29. Retrieval of ideal results is rewarded, near misses can be rewarded partial score, but does not penalise systems for not retrieving near misses! Cumulated Gain for XML • Multiple relevance • Result-list dependent RV function Overlap of • I derived from ideal recall-base Overlap of dimensions result elements reference elements

  30. (3,1) (3,3) Cumulated Gain for XML • However, consequences of ideal recall-base in CG • | G | < | I | • Max(CGG) > Max(CGI) G = <1,0.75,…> I = <1> I = <1,0,...> Extend ideal gain vector with irrelevant elements Force CGG to level after reaching Max(CGI)

  31. XCG Summary • Unsolved issues with recall/precision due to overlap of reference elements in recall-base • XML-CG with ideal recall-base provides a solution for overlap of result and reference elements • Still possible to reward partial success without theside-effect • “Plug-in” user models: RV function used as parameter of metrics • Limitation: Max(CGG) = Max(CGI) :

  32. XCG: Top 10 INEX 2004 runs [?] rank by inex_eval

  33. Ad hoc retrieval: Tasks • Content-only (CO): aim is to decrease user effort by pointing the user to the most specific relevant elements (2002, 2003, 2004) • Strict content-and-structure (SCAS): retrieve relevant elements that exactly match the structure specified in the query (2002 - CAS, 2003) • Vague content-and-structure (VCAS): • retrieve relevant elements that may not be the same as the target elements, but are structurally similar (2003) • retrieve relevant elements even if do not exactly meet the structural conditions; treat structure specification as hints as to where to look(2004) • Note: INEX 2005 calls this the CO+S task CO CAS

  34. INEX Content-and-structure Topic: <title>//article[about(.,'formal methods verify correctness aviation systems')]//sec//* [about(.,'case study application model checking theorem proving')]</title> <description>Find documents discussing formal methods to verify correctness of aviation systems. From those articles extract parts discussing a case study of using model checking or theorem proving for the verification. </description> <narrative>To be considered relevant a document must be about using formal methods to verify correctness of aviation systems, such as flight traffic control systems, airplane- or helicopter- parts. From those documents a section-part must be returned (I do not want the whole section, I want something smaller). That part should be about a case study of applying a model checker or a theorem proverb to the verification. </narrative> <keywords>SPIN, SMV, PVS, SPARK, CWB</keywords>

  35. Content-and-structure topics: Restrictions • Returning “attribute” type elements (e.g. author, date) not allowed “return authors of article containing sections on the evaluation of XML retrieval systems” • Aboutness criterion must be specified- at least - in the target elements “return all paragraphs contained in sections that discuss the evaluation of XML retrieval systems” • Branches not allowed “return sections about evaluation of XML retrieval systems that are contained in articles that contain paragraphs about the overlap problem” Are we imposing too many restrictions?

  36. Assessments: some results • Elements assessed in INEX 2003 26% assessments on elements in the pool (66 % in INEX 2002). 68% highly specific elements not in the pool • INEX 2002 23 inconsistent assessments per topic for one rule • Agreements in INEX 2004 12.19% non-zero agreements - 22% at article level 3.42% exact agreements - 7% at article level higher agreements for CAS topics (see Piwowarski & Lalmas, CORIA 2004; Kazai etal, ECIR 2004)

  37. Conclusion and future work • Difficult research issues in XML retrieval are not ‘just’ about the effective retrieval of XML documents, but also about what and how to evaluate! • The ‘Cranfield tradition’ (and therefore the TREC evaluation setup) does not apply directly! • Some (yet to be finalised) plans for INEX 2005 • More explicit statement of the targeted user task? • Participants could submit different runs for different quantisations • Reduce number of grades to simplify assessments? • Investigate consequences of ‘powerful’ NEXI queries on evaluation • E.g., //article[about(., XML evaluation) and about(.//sec, overlap)]

More Related