1 / 72

Lecture 21: XML Retrieval

Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring 2007 http://courses.ischool.berkeley.edu/i240/s07. Lecture 21: XML Retrieval. Principles of Information Retrieval. Mini-TREC. Proposed Schedule

dane-cote
Download Presentation

Lecture 21: XML Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring 2007 http://courses.ischool.berkeley.edu/i240/s07 Lecture 21: XML Retrieval Principles of Information Retrieval

  2. Mini-TREC • Proposed Schedule • February 15 – Database and previous Queries • February 27 – report on system acquisition and setup • March 8, New Queries for testing… • April 19, Results due (Next Thursday) • April 24 or 26, Results and system rankings • May 8 Group reports and discussion

  3. Announcement • No Class on Tuesday (April 17th)

  4. Today • Review • Geographic Information Retrieval • GIR Algorithms and evaluation based on a presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K. • XML and Structured Element Retrieval • INEX • Approaches to XML retrieval Credit for some of the slides in this lecture goes to Marti Hearst

  5. Today • Review • Geographic Information Retrieval • GIR Algorithms and evaluation based on a presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K. • Web Crawling and Search Issues • Web Crawling • Web Search Engines and Algorithms Credit for some of the slides in this lecture goes to Marti Hearst

  6. Introduction • What is Geographic Information Retrieval? • GIR is concerned with providing access to georeferenced information sources. It includes all of the areas of traditional IR research with the addition of spatially and geographically oriented indexing and retrieval. • It combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research.

  7. Example: Results display from CheshireGeo: http://calsip.regis.berkeley.edu/pattyf/mapserver/cheshire2/cheshire_init.html

  8. Other convex, conservative Approximations 2) MBR: Minimum aligned Bounding rectangle (4) 1) Minimum Bounding Circle (3) 3) Minimum Bounding Ellipse (5) 4) Rotated minimum bounding rectangle (5) 5) 4-corner convex polygon (8) 6) Convex hull (varies) After Brinkhoff et al, 1993b Presented in order of increasing quality. Number in parentheses denotes number of parameters needed to store representation

  9. Our Research Questions • Spatial Ranking • How effectively can the spatial similarity between a query region and a document region be evaluated and ranked based on the overlap of the geometric approximations for these regions? • Geometric Approximations & Spatial Ranking: • How do different geometric approximations affect the rankings? • MBRs: the most popular approximation • Convex hulls: the highest quality convex approximation

  10. Spatial Ranking: Methods for computing spatial similarity

  11. Probabilistic Models: Logistic Regression attributes • X1 = area of overlap(query region, candidate GIO) / area of query region • X2= area of overlap(query region, candidate GIO) / area of candidate GIO  • X3 = 1 – abs(fraction of overlap region that is onshore fraction of candidate GIO that is onshore) • Where: Range for all variables is 0 (not similar) to 1 (same)

  12. CA Named Places in the Test Collection – complex polygons Counties Cities Bioregions National Parks National Forests Water QCB Regions

  13. CA Counties – Geometric Approximations MBRs Convex Hulls Ave. False Area of Approximation: MBRs: 94.61% Convex Hulls: 26.73%

  14. CA User Defined Areas (UDAs) in the Test Collection

  15. 42 of 58 counties referenced in the test collection metadata 10 counties randomly selected as query regions to train LR model 32 counties used as query regions to test model Test Collection Query Regions: CA Counties

  16. LR model • X1 = area of overlap(query region, candidate GIO) / area of query region • X2= area of overlap(query region, candidate GIO) / area of candidate GIO • Where: Range for all variables is 0 (not similar) to 1 (same)

  17. Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list. Some of our Results For metadata indexed by CA named place regions: • These results suggest: • Convex Hulls perform better than MBRs • Expected result given that the CH is a higher quality approximation • A probabilistic ranking based on MBRs can perform as well if not better than a non-probabiliistic ranking method based on Convex Hulls • Interesting • Since any approximation other than the MBR requires great expense, this suggests that the exploration of new ranking methods based on the MBR are a good way to go. For all metadata in the test collection:

  18. Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list. Some of our Results For metadata indexed by CA named place regions: BUT: The inclusion of UDA indexed metadata reduces precision. This is because coarse approximations of onshore or coastal geographic regions will necessarily include much irrelevant offshore area, and vice versa For all metadata in the test collection:

  19. Shorefactor Model • X1 = area of overlap(query region, candidate GIO) / area of query region • X2 = area of overlap(query region, candidate GIO) / area of candidate GIO • X3 = 1 – abs(fraction of query region approximation that is onshore – fraction of candidate GIO approximation that is onshore) • Where: Range for all variables is 0 (not similar) to 1 (same)

  20. Some of our Results, with Shorefactor These results suggest: • Addition of Shorefactor variable improves the model (LR 2), especially for MBRs • Improvement not so dramatic for convex hull approximations – b/c the problem that shorefactor addresses is not that significant when areas are represented by convex hulls. For all metadata in the test collection: Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list.

  21. Results for All Data - MBRs Precision Recall

  22. Results for All Data - Convex Hull Precision Recall

  23. XML Retrieval • The following slides are adapted from presentations at INEX 2003-2005 and at the INEX Element Retrieval Workshop in Glasgow 2005, with some new additions for general context, etc.

  24. INEX Organization Organized By: • University of Duisburg-Essen, Germany • Norbert Fuhr, Saadia Malik, and others • Queen Mary University of London, UK • Mounia Lalmas, Gabriella Kazai, and others • Supported By: • DELOS Network of Excellence in Digital Libraries (EU) • IEEE Computer Society • University of Duisburg-Essen

  25. XML Retrieval Issues • Using Structure? • Specification of Queries • How to evaluate?

  26. Cheshire SGML/XML Support • Underlying native format for all data is SGML or XML • The DTD defines the database contents • Full SGML/XML parsing • SGML/XML Format Configuration Files define the database location and indexes • Various format conversions and utilities available for Z39.50 support (MARC, GRS-1

  27. SGML/XML Support • Configuration files for the Server are SGML/XML: • They include elements describing all of the data files and indexes for the database. • They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.

  28. Indexing • Any SGML/XML tagged field or attribute can be indexed: • B-Tree and Hash access via Berkeley DB (Sleepycat) • Stemming, keyword, exact keys and “special keys” • Mapping from any Z39.50 Attribute combination to a specific index • Underlying postings information includes term frequency for probabilistic searching • Component extraction with separate component indexes

  29. XML Element Extraction • A new search “ElementSetName” is XML_ELEMENT_ • Any Xpath, element name, or regular expression can be included following the final underscore when submitting a present request • The matching elements are extracted from the records matching the search and delivered in a simple format..

  30. XML Extraction % zselect sherlock 372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port 2100 is open as connection #372} % zfind topic mathematics {OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}} % zset recsyntax XML % zset elementset XML_ELEMENT_Fld245 % zdisplay {OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11} {RecordSyntax XML 1.2.840.10003.5.109.10}} { <RESULT_DATA DOCID="1"> <ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]"> <Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245> </ITEM> <RESULT_DATA> … etc…

  31. TREC3 Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide

  32. TREC3 Logistic Regression Average Absolute Query Frequency Query Length Average Absolute Component Frequency Document Length Average Inverse Component Frequency Number of Terms in both query and Component

  33. Okapi BM25 • Where: • Q is a query containing terms T • K is k1((1-b) + b.dl/avdl) • k1, b and k3are parameters , usually 1.2, 0.75 and 7-1000 • tf is the frequency of the term in a specific document • qtf is the frequency of the term in a topic from which Q was derived • dl and avdl are the document length and the average document length measured in some convenient unit • w(1) is the Robertson-Sparck Jones weight.

  34. Combining Boolean and Probabilistic Search Elements • Two original approaches: • Boolean Approach • Non-probabilistic “Fusion Search” Set merger approach is a weighted merger of document scores from separate Boolean and Probabilistic queries

  35. INEX ‘04 Fusion Search Subquery Subquery Final Ranked List Fusion/ Merge Subquery Subquery Comp. Query Results Comp. Query Results • Merge multiple ranked and Boolean index searches within each query and multiple component search resultsets • Major components merged are Articles, Body, Sections, subsections, paragraphs

  36. Merging and Ranking Operators • Extends the capabilities of merging to include merger operations in queries like Boolean operators • Fuzzy Logic Operators (not used for INEX) • !FUZZY_AND • !FUZZY_OR • !FUZZY_NOT • Containment operators: Restrict components to or with a particular parent • !RESTRICT_FROM • !RESTRICT_TO • Merge Operators • !MERGE_SUM • !MERGE_MEAN • !MERGE_NORM • !MERGE_CMBZ

  37. New LR Coefficients Estimates using INEX ‘03 relevance assessments for b1 = Average Absolute Query Frequency b2 = Query Length b3 = Average Absolute Component Frequency b4 = Document Length b5 = Average Inverse Component Frequency b6 = Number of Terms in common between query and Component

  38. INEX CO Runs • Three official, one later run - all Title-only • Fusion - Combines Okapi and LR using the MERGE_CMBZ operator • NewParms (LR)- Using only LR with the new parameters • Feedback - An attempt at blind relevance feedback • PostFusion - Fusion of the new LR coefficients and Okapi

  39. Query Generation - CO • # 162 TITLE = Text and Index Compression Algorithms • QUERY: topicshort @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @+ {Text and Index Compression Algorithms}) !MERGE_CMBZ (topicshort @ {Text and Index Compression Algorithms}) !MERGE_CMBZ (alltitles @ {Text and Index Compression Algorithms}) • @+ is Okapi, @ is LR • !MERGE_CMBZ is a normalized score summation and enhancement

  40. INEX CO Runs Strict Generalized Avg Prec FUSION = 0.0642 NEWPARMS = 0.0582 FDBK = 0.0415 POSTFUS = 0.0690 Avg Prec FUSION = 0.0923 NEWPARMS = 0.0853 FDBK = 0.0390 POSTFUS = 0.0952

  41. INEX VCAS Runs • Two official runs • FUSVCAS - Element fusion using LR and various operators for path restriction • NEWVCAS - Using the new LR coefficients for each appropriate index and various operators for path restriction

  42. Query Generation - VCAS • #66 TITLE = //article[about(., intelligent transport systems)]//sec[about(., on-board route planning navigation system for automobiles)] • Submitted query = ((topic @ {intelligent transport systems})) !RESTRICT_FROM ((sec_words @ {on-board route planning navigation system for automobiles})) • Target elements: sec|ss1|ss2|ss3

  43. VCAS Results Generalized Strict Avg Prec FUSVCAS = 0.0321 NEWVCAS = 0.0270 Avg Prec FUSVCAS = 0.0601 NEWVCAS = 0.0569

  44. Heterogeneous Track • Approach using the Cheshire’s Virtual Database options • Primarily a version of distributed IR • Each collection indexed separately • Search via Z39.50 distributed queries • Z39.50 Attribute mapping used to map query indexes to appropriate elements in a given collection • Only LR used and collection results merged using probability of relevance for each collection result

  45. INEX 2005 Approach • Used only Logistic regression methods • “TREC3” with Pivot • “TREC2” with Pivot • “TREC2” with Blind Feedback • Used post-processing for specific tasks

  46. Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For some set of m statistical measures, Xi, derived from the collection and query

  47. TREC2 Algorithm Term Freq for: Query Document Collection Matching Terms

  48. Blind Feedback Document Relevance + - + Rt Nt -Rt Nt - R-Rt N-Nt-R+R N-Nt R N-R N Document indexing • Term selection from top-ranked documents is based on the classic Robertson/Sparck Jones probabilistic model: For each term t

  49. Blind Feedback • Top x new terms taken from top y documents • For each term in the top y assumed relevant set… • Terms are ranked by termwt and the top x selected for inclusion in the query

  50. Pivot method • Based on the pivot weighting used by IBM Haifa in INEX 2004 (Mass & Mandelbrod) • Used 0.50 as pivot for all cases • For TREC3 and TREC2 runs all component results weighted by article-level results for the matching article

More Related