1 / 47

Context

XML Structured Document Retrieval and Distributed Resource Discovery Ray R. Larson School of Information Management & Systems University of California, Berkeley ray@sherlock.berkeley.edu. Context. NSF/JISC International Digital Library Grant

jacob
Download Presentation

Context

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML Structured Document Retrieval and Distributed Resource DiscoveryRay R. LarsonSchool of Information Management & SystemsUniversity of California, Berkeleyray@sherlock.berkeley.edu NASA Ames Lecture -- Ray R. Larson

  2. Context • NSF/JISC International Digital Library Grant • Cross-Domain Resource Discovery: Integrated Discovery and Use of Textual, Numeric and Spatial Data • UC Berkeley DLI2 Grant: • ReInventing Scholarly Information Access • UC Berkeley working with the University of Liverpool/Manchester Computing with participation from • DeMontfort University (MASTER) • Art and Humanities Data Service (http://ahds.ac.uk/) • OTA (Oxford), HDS (Essex), PADS (Glasgow), ADS (York), VADS (Surrey & Northumbria) • Consortium of University Research Libraries (CURL) • UC Berkeley Library (and California Digital Library) • Making of America II • Online Archive of California • British Natural History Museum, London • NESSTAR (NEtworked Social Science Tools and Resources) NASA Ames Lecture -- Ray R. Larson

  3. Research Areas • Goals are • Practical application of existing DL technologies to some large-scale cross-domain collections • Theoretical examination and evaluation of next-generation designs for systems architecture and and distributed cross-domain searching for DLs NASA Ames Lecture -- Ray R. Larson

  4. Approach • For the first goal, we are implementing a distributed search system based on international standards (Z39.50 and SGML/XML) using the Cheshire II information retrieval system • Databases include: • HE Archives hub • Arts and Humanities Data Service (AHDS) • MASTER • CURL (Consortium of University Research Libraries) • Online Archive of California (OAC) • Making of America II (MOA2) NASA Ames Lecture -- Ray R. Larson

  5. Current Usage of Cheshire II • Web clients for: • Berkeley NSF/NASA/ARPA Digital Library • World Conservation Digital Library • SunSite (UC Berkeley Science Libraries) • University of Liverpool • Higher Education Archives Hub • Glasgow, Edinburgh, Bath, Liverpool, Kings College London, University College London, Nottingham, Durham, School of Oriental and African Studies, Manchester, Southhampton, Warwick and others (to be expanded) • University of Essex, HDS (part of AHDS) • Oxford Text Archive (test only) • California Sheet Music Project • Cha-Cha (Berkeley Intranet Search Engine) • Berkeley Metadata project cross-language demo • Univ. of Virginia (test implementations) • Cheshire ranking algorithm is basis for original Inktomi NASA Ames Lecture -- Ray R. Larson

  6. Current and Upcoming Usage of Cheshire II • DIEPER Digitized European Periodicals project. • http://gdz.sub.uni-goettingen.de/dieper/ • NESSTAR (Networked Social Science Tools and Resources. • http://www.nesstar.org/ • FASTER – Flexible Access to Statistics Tables and Electronic Resources. (Continuation of NESSTAR) • http://www.faster-data.org/ • MASTER (Manuscript Access through Standards for Electronic Records. • http://www.cta.dmu.ac.uk/projects/master/ NASA Ames Lecture -- Ray R. Larson

  7. Upcoming Usage of Cheshire II • ZETOC (Prototype of the Electronic Table of Contents from the British Library) • http://zetoc.mimas.ac.uk/ • Archives Hub • http://www.archiveshub.ac.uk/ • RSLP Palaeography project • http://www.palaeography.ac.uk/ • British Natural History Museum, London • JISC data services directory hosted by MIMAS • Resource Discovery Network (RDN), where it will be used to harvest RDN records from the various hubs using OAI and provide search NASA Ames Lecture -- Ray R. Larson

  8. Client/Server Architecture • Server Supports: • Database storage • Indexing • Z39.50 access to local data • Boolean and Probabilistic Searching • Relevance Feedback • External SQL database support • Client Supports: • Programmable (Tcl/Tk) Graphical User Interface • Z39.50 access to remote servers • SGML/XML & MARC formatting • Combined Client/Server CGI scripting via WebCheshire used for web applications NASA Ames Lecture -- Ray R. Larson

  9. SGML/XML Support • Underlying native format for all data is SGML/XML • The DTD defines the file format for each file • Full SGML/XML parsing • XML Configuration Files define the database • USMARC DTD and MARC to SGML conversion (and back again) • Access to full-text via special SGML tags • Support for SGML/XML component definition and indexing NASA Ames Lecture -- Ray R. Larson

  10. SGML/XML Support • Example XML record for a DL document <ELIB-BIB> <BIB-VERSION>ELIB-v1.0</BIB-VERSION> <ID>756</ID> <ENTRY>June 12, 1996</ENTRY> <DATE>June 1996</DATE> <TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to the Sierra Nevada</TITLE> <ORGANIZATION>University of California</ORGANIZATION> <TYPE>report</TYPE> <AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL> <AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL> <AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL> <AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL> <PROJECT>SNEP</PROJECT> <SERIES>Vol 3</SERIES> <PAGES>40</PAGES> <TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF> <PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF> </ELIB-BIB> NASA Ames Lecture -- Ray R. Larson

  11. SGML/XML Support <USMARC Material="BK" ID="00000003"><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry>001001400000005001700014008004100031010001400072035002000086035001700106100001900123245010500142250001100247260003200258300003300290504005000323650003600373700002200409700002200431950003200453998000700485</Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005>19940414143202.0</Fld005> <Fld008>830810 1983 nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a>82019962 </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm08866667</a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a><b>theory and practice /</b><c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a><b>J. Wiley,</b><c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a><b>ill. ;</b><c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" SubjSys="LCSH"><a>Management information systems.</a></Fld650> ... • Example SGML/MARC Record NASA Ames Lecture -- Ray R. Larson

  12. SGML/XML Support • Configuration files for the Server are also SGML/XML: • They include tags describing all of the data files and indexes for the database. • They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database. • They include definition of components and component indexes NASA Ames Lecture -- Ray R. Larson

  13. Component Extraction and Retrieval • Any sub-elements of an SGML/XML document can be defined as a separately indexed “component”. • Components can be ranked and retrieved independently of the source document (but linked back to their original source) • For example paragraphs and abstracts in the full text of documents could be defined as components to provide paragraph-level search • Example: Glassier archives… NASA Ames Lecture -- Ray R. Larson

  14. Component Extraction and Retrieval • The Glassier archive is an EAD document (1.9 Mb in size) • Contains “Series, Subseries, and Item level” descriptions of things in the archive NASA Ames Lecture -- Ray R. Larson

  15. Excerpt from Glasier Archive <c level="subseries"> <did> <head>GP-1-1: General correspondence. Public letters.</head> <unitid id="gp-1-1">GP-1-1</unitid> <unittitle>Glasier Papers. General correspondence. Public letters.</unittitle> </did> <arrangement> <head>Arrangement </head> <p>Public letters arranged alphabetically within each year </p> </arrangement> <c level="item" langmaterial="eng"> <did> <unitid id="gp-1-1-0001">GP-1-1-0001</unitid> <unittitle>Letter from Richard Murray. <geogname>Glasgow</geogname>; <unitdate > 7 Apr 1879</unitdate>.</unittitle> <origination><persname>Murray, Richard</persname></origination> <physdesc><extent>1 letter</extent></physdesc> </did> <note><p>Employment reference for J.B.G. as draughtsman<subject>Glasier, John Bruce</subject></p> </note> </c> ETC…. NASA Ames Lecture -- Ray R. Larson

  16. Example Component Def … <COMPONENTS> <COMPONENTDEF> <COMPONENTNAME> /home/ray/Work/Glasier_test/indexes/COMPONENT_DB1 </COMPONENTNAME> <COMPONENTNORM>NONE</COMPONENTNORM> <COMPSTARTTAG> <tagspec> <FTAG> c </FTAG><ATTR> level <VALUE>item</VALUE></ATTR> </tagspec> </COMPSTARTTAG> <COMPONENTINDEXES> <!-- First index def --> … NASA Ames Lecture -- Ray R. Larson

  17. Components • Both individual tags and “ranges” with a starting tag and (different) ending tag can be used as components • Components permit parts of complex SGML/XML documents to be treated as separate documents NASA Ames Lecture -- Ray R. Larson

  18. Local Remote Z39.50 Z39.50 Internet Z39.50 Z39.50 Images Scanned Text Cheshire II Searching NASA Ames Lecture -- Ray R. Larson

  19. Boolean Search Capability • All Boolean operations are supported • “zfind author x and (title y or subject z) not subject A” • Named sets are supported and stored on the server • Boolean operations between stored sets are supported • “zfind SET1 and subject widgets or SET2” • Nested parentheses and truncation are supported • “zfind xtitle Alice#” NASA Ames Lecture -- Ray R. Larson

  20. Probabilistic Models • Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query • Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) • Rely on accurate estimates of probabilities NASA Ames Lecture -- Ray R. Larson

  21. Probability Ranking Principle • If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data. Stephen E. Robertson, J. Documentation 1977 NASA Ames Lecture -- Ray R. Larson

  22. Probabilistic Models: Logistic Regression • Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables. Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds: NASA Ames Lecture -- Ray R. Larson

  23. 100 - 90 - 80 - 70 - 60 - 50 - 40 - 30 - 20 - 10 - 0 - Relevance 0 10 20 30 40 50 60 Term Frequency in Document Logistic Regression NASA Ames Lecture -- Ray R. Larson

  24. Probabilistic Retrieval: Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide NASA Ames Lecture -- Ray R. Larson

  25. Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged NASA Ames Lecture -- Ray R. Larson

  26. Cheshire Probabilistic Retrieval • Uses Logistic Regression ranking method developed at Berkeley with new algorithm for weigh calculation at retrieval time. • Z39.50 “relevance” operator used to indicate probabilistic search • Any index can have Probabilistic searching performed: • zfind topic @ “cheshire cats, looking glasses, march hares and other such things” • zfind title @ caucus races • Boolean and Probabilistic elements can be combined: • zfind topic @ government documents and title guidebooks NASA Ames Lecture -- Ray R. Larson

  27. Combining Search Types • It is also possible to combine the results of multiple independent searches into a single result set. (using the Z39.50 SORT service of the Cheshire system) • E.g.: • Search of Full Text (Probabilistic) • Search of Full Text (Boolean) • Search of Components (Probabilistic) • Search of Titles (Probabilistic) • Search of Subject Headings (Probabilistic) • All result sets are merged and re-ranked to produce the final list. NASA Ames Lecture -- Ray R. Larson

  28. Relevance Feedback. • Any records in a result set can be used for Relevance Feedback • Uses the “set name” to receive feedback instructions. • zfind SET1:2,5-9,30,45 • zfind SET2:6 • Chosen records are used to build a new probabilistic query • Ranked results are returned • Planned support for (modified) Rocchio RF NASA Ames Lecture -- Ray R. Larson

  29. Cheshire II - Two-Stage Retrieval (EVM generation) • Example: Using the LC Classification System • Pseudo-Document created for each LC class containing terms derived from “content-rich” portions of documents in that class (subject headings, titles, etc.) • Permits searching by any term in the class • Ranked Probabilistic retrieval techniques attempt to present the “Best Matches” to a query first. • User selects classes to feed back for the “second stage” search of documents (which includes info from first stage selections) • Can be used with any classified/Indexed collection and controlled vocabulary NASA Ames Lecture -- Ray R. Larson

  30. Automatic Class Assignment Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered clusters are order-independent, usually based on an intellectually derived scheme Doc Doc Doc Doc Search Engine Doc Doc Doc 1. Create pseudo-documents representing intellectually derived classes. 2. Search using document contents 3. Obtain ranked list 4. Assign document to N categories ranked over threshold. OR assign to top-ranked category NASA Ames Lecture -- Ray R. Larson

  31. Cheshire II - Cluster Generation • Define basis for clustering records. • Select field to form the basis of the cluster. • Evidence Fields to use as contents of the pseudo-documents. • During indexing cluster keys are generated with basis and evidence from each record. • Cluster keys are sorted and merged on basis and pseudo-documents created for each unique basis element containing all evidence fields. • Pseudo-Documents (Class clusters) are indexed on combined evidence fields. NASA Ames Lecture -- Ray R. Larson

  32. Cheshire II - Two-Stage Retrieval • Using the Mesh Subject Heading System • Pseudo-Document created for each MESH heading containing terms derived from “content-rich” portions of documents in that class (other subject headings, titles, abstract, etc.) • Permits searching by any term in the class • Ranked Probabilistic retrieval techniques attempt to present the “Best Matches” to a query first. • User selects classes to feed back for the “second stage” search of documents. • Can be used with any classified/Indexed collection. NASA Ames Lecture -- Ray R. Larson

  33. Distributed Search: The Problem • Hundreds or Thousands of servers with databases ranging widely in content, topic, format • Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results • How to select the “best” ones to search? • What to search first • Which to search next • Topical /domain constraints on the search selections • Variable contents of database (metadata only, full text…) NASA Ames Lecture -- Ray R. Larson

  34. An Approach for Cross-Domain Resource Discovery • MetaSearch • New approach to building metasearch based on Z39.50 • Instead of using broadcast search we are using two Z39.50 Services • Identification of database metadata using Z39.50 Explain • Extraction of distributed indexes using Z39.50 SCAN • Evaluation • How efficiently can we build distributed indexes? • How effectively can we choose databases using the index? • How effective is merging search results from multiple sources? • Hierarchies of servers (general/meta-topical/individual)? NASA Ames Lecture -- Ray R. Larson

  35. Z39.50 Overview UI Map Query Search Engine Map Results Map Query Internet Map Results Map Query UI Map Results NASA Ames Lecture -- Ray R. Larson

  36. Z39.50 Explain • Explain supports searches for • Server-Level metadata • Server Name • IP Addresses • Ports • Database-Level metadata • Database name • Search attributes (indexes and combinations) • Support metadata (record syntaxes, etc) NASA Ames Lecture -- Ray R. Larson

  37. Z39.50 SCAN • Originally intended to support Browsing • Query for • Database • Attributes plus Term (i.e., index and start point) • Step Size • Number of terms to retrieve • Position in Response set • Results • Number of terms returned • List of Terms and their frequency in the database (for the given attribute combination) NASA Ames Lecture -- Ray R. Larson

  38. Z39.50 SCAN Results Syntax: zscan indexname1 term stepsize number_of_terms pref_pos % zscan title cat 1 20 1 {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 27} {cat-fight 1} {catalan 19} {catalogu 37} {catalonia 8} {catalyt 2} {catania 1} {cataract 1} {catch 173} {catch-all 3} {catch-up 2} … zscan topic cat 1 20 1 {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 706} {cat-and-mouse 19} {cat-burglar 1} {cat-carrying 1} {cat-egory 1} {cat-fight 1} {cat-gut 1} {cat-litter 1} {cat-lovers 2} {cat-pee 1} {cat-run 1} {cat-scanners 1} … NASA Ames Lecture -- Ray R. Larson

  39. MetaSearch Server Index Creation • For all servers, or a topical subset… • Get Explain information (especially DC mappings) • For each index (or each DC index) • Use SCAN to extract terms and frequency • Add term + freq + source index + database metadata to the metasearch “Collection Document” (XML) • Planned extensions: • Post-Process indexes (especially Geo Names, etc) for special types of data • e.g. create “geographical coverage” indexes NASA Ames Lecture -- Ray R. Larson

  40. MetaSearch Approach Search Engine MetaSearch Server Map Query Map Explain And Scan Queries Map Results Internet DB 1 DB2 Map Results Search Engine Map Query Distributed Index Search Engine Map Results Db 5 Db 6 NASA Ames Lecture -- Ray R. Larson DB 3 DB 4

  41. Known Problems • Not all Z39.50 Servers support SCAN or Explain • Solutions: • Probing for attributes instead of explain (e.g. DC attributes or analogs) • We also support OAI and can extract OAI metadata for servers that support OAI • Collection Documents are static and need to be replaced when the associated collection changes NASA Ames Lecture -- Ray R. Larson

  42. Evaluation • Test Environment • TREC Tipster and FT data (approx. 3.5 GB) • Partitioned into 236 smaller collections based on source and (for TIPSTER) date by month (Distributed Search Testbed built by French, et al.) • High size variability (Range from 1 to thousands of docs) • 21,225,299 Words, 142,345,670 chars total for harvested records • Efficiency (old data) • Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) • Average of 14.07 seconds excluding FT (131 seconds for FT database with 7 indexes) • Now collecting more information – so longer harvest times longer, but still under one minute on average NASA Ames Lecture -- Ray R. Larson

  43. Evaluation • Effectiveness • Still working on evaluation comparing our DB ranking with the TIPSTER relevance judgements • Can be compared with published selection methods (CORI, GlOSS, etc.) using the same testbed NASA Ames Lecture -- Ray R. Larson

  44. Future • Testing of variant algorithms for ranking collections • Application to real systems and testing in a production environment (Archives Hub) • Logically Clustering servers by topic • Meta-Meta Servers (treating the MetaSearch database as just another database) NASA Ames Lecture -- Ray R. Larson

  45. Distributed Metadata Servers Database Servers Meta-Topical Servers General Servers Replicated servers NASA Ames Lecture -- Ray R. Larson

  46. Conclusion • A lot of interesting work to be done • Redesign and development of the Cheshire II system • Evaluating new meta-indexing methods • Developing and Evaluating methods for merging cross-domain results (or, perhaps, when to keep them separate) NASA Ames Lecture -- Ray R. Larson

  47. Further Information • Full Cheshire II client and server source is available ftp://cheshire.berkeley.edu/pub/cheshire/ • Includes HTML documentation • Also on Berkeley Digital Library Software Distribution CD • Project Web Site http://cheshire.berkeley.edu/ NASA Ames Lecture -- Ray R. Larson

More Related