slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Context PowerPoint Presentation
Download Presentation
Context

Loading in 2 Seconds...

play fullscreen
1 / 47

Context - PowerPoint PPT Presentation


  • 464 Views
  • Uploaded on

XML Structured Document Retrieval and Distributed Resource Discovery Ray R. Larson School of Information Management & Systems University of California, Berkeley ray@sherlock.berkeley.edu. Context. NSF/JISC International Digital Library Grant

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Context' - jacob


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

XML Structured Document Retrieval and Distributed Resource DiscoveryRay R. LarsonSchool of Information Management & SystemsUniversity of California, Berkeleyray@sherlock.berkeley.edu

NASA Ames Lecture -- Ray R. Larson

context
Context
  • NSF/JISC International Digital Library Grant
    • Cross-Domain Resource Discovery: Integrated Discovery and Use of Textual, Numeric and Spatial Data
  • UC Berkeley DLI2 Grant:
    • ReInventing Scholarly Information Access
  • UC Berkeley working with the University of Liverpool/Manchester Computing with participation from
    • DeMontfort University (MASTER)
    • Art and Humanities Data Service (http://ahds.ac.uk/)
      • OTA (Oxford), HDS (Essex), PADS (Glasgow), ADS (York), VADS (Surrey & Northumbria)
    • Consortium of University Research Libraries (CURL)
    • UC Berkeley Library (and California Digital Library)
      • Making of America II
      • Online Archive of California
    • British Natural History Museum, London
    • NESSTAR (NEtworked Social Science Tools and Resources)

NASA Ames Lecture -- Ray R. Larson

research areas
Research Areas
  • Goals are
    • Practical application of existing DL technologies to some large-scale cross-domain collections
    • Theoretical examination and evaluation of next-generation designs for systems architecture and and distributed cross-domain searching for DLs

NASA Ames Lecture -- Ray R. Larson

approach
Approach
  • For the first goal, we are implementing a distributed search system based on international standards (Z39.50 and SGML/XML) using the Cheshire II information retrieval system
  • Databases include:
    • HE Archives hub
    • Arts and Humanities Data Service (AHDS)
    • MASTER
    • CURL (Consortium of University Research Libraries)
    • Online Archive of California (OAC)
    • Making of America II (MOA2)

NASA Ames Lecture -- Ray R. Larson

current usage of cheshire ii
Current Usage of Cheshire II
  • Web clients for:
    • Berkeley NSF/NASA/ARPA Digital Library
    • World Conservation Digital Library
    • SunSite (UC Berkeley Science Libraries)
    • University of Liverpool
    • Higher Education Archives Hub
      • Glasgow, Edinburgh, Bath, Liverpool, Kings College London, University College London, Nottingham, Durham, School of Oriental and African Studies, Manchester, Southhampton, Warwick and others (to be expanded)
    • University of Essex, HDS (part of AHDS)
    • Oxford Text Archive (test only)
    • California Sheet Music Project
    • Cha-Cha (Berkeley Intranet Search Engine)
    • Berkeley Metadata project cross-language demo
    • Univ. of Virginia (test implementations)
    • Cheshire ranking algorithm is basis for original Inktomi

NASA Ames Lecture -- Ray R. Larson

current and upcoming usage of cheshire ii
Current and Upcoming Usage of Cheshire II
  • DIEPER Digitized European Periodicals project.
    • http://gdz.sub.uni-goettingen.de/dieper/
  • NESSTAR (Networked Social Science Tools and Resources.
    • http://www.nesstar.org/
  • FASTER – Flexible Access to Statistics Tables and Electronic Resources. (Continuation of NESSTAR)
    • http://www.faster-data.org/
  • MASTER (Manuscript Access through Standards for Electronic Records.
    • http://www.cta.dmu.ac.uk/projects/master/

NASA Ames Lecture -- Ray R. Larson

upcoming usage of cheshire ii
Upcoming Usage of Cheshire II
  • ZETOC (Prototype of the Electronic Table of Contents from the British Library)
    • http://zetoc.mimas.ac.uk/
  • Archives Hub
    • http://www.archiveshub.ac.uk/
  • RSLP Palaeography project
    • http://www.palaeography.ac.uk/
  • British Natural History Museum, London
  • JISC data services directory hosted by MIMAS
  • Resource Discovery Network (RDN), where it will be used to harvest RDN records from the various hubs using OAI and provide search

NASA Ames Lecture -- Ray R. Larson

client server architecture
Client/Server Architecture
  • Server Supports:
    • Database storage
    • Indexing
    • Z39.50 access to local data
    • Boolean and Probabilistic Searching
    • Relevance Feedback
    • External SQL database support
  • Client Supports:
    • Programmable (Tcl/Tk) Graphical User Interface
    • Z39.50 access to remote servers
    • SGML/XML & MARC formatting
  • Combined Client/Server CGI scripting via WebCheshire used for web applications

NASA Ames Lecture -- Ray R. Larson

sgml xml support
SGML/XML Support
  • Underlying native format for all data is SGML/XML
  • The DTD defines the file format for each file
  • Full SGML/XML parsing
  • XML Configuration Files define the database
  • USMARC DTD and MARC to SGML conversion (and back again)
  • Access to full-text via special SGML tags
  • Support for SGML/XML component definition and indexing

NASA Ames Lecture -- Ray R. Larson

sgml xml support1
SGML/XML Support
  • Example XML record for a DL document

<ELIB-BIB>

<BIB-VERSION>ELIB-v1.0</BIB-VERSION>

<ID>756</ID>

<ENTRY>June 12, 1996</ENTRY>

<DATE>June 1996</DATE>

<TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to

the Sierra Nevada</TITLE>

<ORGANIZATION>University of California</ORGANIZATION>

<TYPE>report</TYPE>

<AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL>

<AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL>

<AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL>

<AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL>

<PROJECT>SNEP</PROJECT>

<SERIES>Vol 3</SERIES>

<PAGES>40</PAGES>

<TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF>

<PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF>

</ELIB-BIB>

NASA Ames Lecture -- Ray R. Larson

sgml xml support2
SGML/XML Support

<USMARC Material="BK" ID="00000003"><leader><LRL>00722</LRL><RecStat>n</RecStat> <RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount> <SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel> <DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos> 5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader> <Directry>001001400000005001700014008004100031010001400072035002000086035001700106100001900123245010500142250001100247260003200258300003300290504005000323650003600373700002200409700002200431950003200453998000700485</Directry><VarFlds> <VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005>19940414143202.0</Fld005> <Fld008>830810 1983 nyu eng u</Fld008></VarCFlds> <VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a>82019962 </a></Fld010> <Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm08866667</a></Fld035><Fld035 I1="Blank" I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100 NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245 AddEnty="Yes" NFChars="0"><a>Information systems :</a><b>theory and practice /</b><c>John G. Burch, Jr., Felix R. Strater, Gary Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a><b>J. Wiley,</b><c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank" I2="Blnk"><a>xvi, 632 p. :</a><b>ill. ;</b><c>24 cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank" I2="Blnk"><a>Includes bibliographical references and index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo" SubjSys="LCSH"><a>Management information systems.</a></Fld650> ...

  • Example SGML/MARC Record

NASA Ames Lecture -- Ray R. Larson

sgml xml support3
SGML/XML Support
  • Configuration files for the Server are also SGML/XML:
    • They include tags describing all of the data files and indexes for the database.
    • They also include instructions on how data is to be extracted for indexing and how Z39.50 attributes map to the indexes for a given database.
    • They include definition of components and component indexes

NASA Ames Lecture -- Ray R. Larson

component extraction and retrieval
Component Extraction and Retrieval
  • Any sub-elements of an SGML/XML document can be defined as a separately indexed “component”.
  • Components can be ranked and retrieved independently of the source document (but linked back to their original source)
  • For example paragraphs and abstracts in the full text of documents could be defined as components to provide paragraph-level search
  • Example: Glassier archives…

NASA Ames Lecture -- Ray R. Larson

component extraction and retrieval1
Component Extraction and Retrieval
  • The Glassier archive is an EAD document (1.9 Mb in size)
  • Contains “Series, Subseries, and Item level” descriptions of things in the archive

NASA Ames Lecture -- Ray R. Larson

excerpt from glasier archive
Excerpt from Glasier Archive

<c level="subseries">

<did>

<head>GP-1-1: General correspondence. Public letters.</head>

<unitid id="gp-1-1">GP-1-1</unitid>

<unittitle>Glasier Papers. General correspondence. Public letters.</unittitle>

</did>

<arrangement>

<head>Arrangement </head>

<p>Public letters arranged alphabetically within each year </p>

</arrangement>

<c level="item" langmaterial="eng">

<did>

<unitid id="gp-1-1-0001">GP-1-1-0001</unitid>

<unittitle>Letter from Richard Murray. <geogname>Glasgow</geogname>; <unitdate

>

7 Apr 1879</unitdate>.</unittitle>

<origination><persname>Murray, Richard</persname></origination>

<physdesc><extent>1 letter</extent></physdesc>

</did>

<note><p>Employment reference for J.B.G. as draughtsman<subject>Glasier, John

Bruce</subject></p>

</note>

</c>

ETC….

NASA Ames Lecture -- Ray R. Larson

example component def
Example Component Def

<COMPONENTS>

<COMPONENTDEF>

<COMPONENTNAME> /home/ray/Work/Glasier_test/indexes/COMPONENT_DB1 </COMPONENTNAME>

<COMPONENTNORM>NONE</COMPONENTNORM>

<COMPSTARTTAG>

<tagspec>

<FTAG> c </FTAG><ATTR> level <VALUE>item</VALUE></ATTR>

</tagspec>

</COMPSTARTTAG>

<COMPONENTINDEXES>

<!-- First index def -->

NASA Ames Lecture -- Ray R. Larson

components
Components
  • Both individual tags and “ranges” with a starting tag and (different) ending tag can be used as components
  • Components permit parts of complex SGML/XML documents to be treated as separate documents

NASA Ames Lecture -- Ray R. Larson

cheshire ii searching

Local

Remote

Z39.50

Z39.50

Internet

Z39.50

Z39.50

Images

Scanned

Text

Cheshire II Searching

NASA Ames Lecture -- Ray R. Larson

boolean search capability
Boolean Search Capability
  • All Boolean operations are supported
    • “zfind author x and (title y or subject z) not subject A”
  • Named sets are supported and stored on the server
  • Boolean operations between stored sets are supported
    • “zfind SET1 and subject widgets or SET2”
  • Nested parentheses and truncation are supported
    • “zfind xtitle Alice#”

NASA Ames Lecture -- Ray R. Larson

probabilistic models
Probabilistic Models
  • Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query
  • Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)
  • Rely on accurate estimates of probabilities

NASA Ames Lecture -- Ray R. Larson

probability ranking principle
Probability Ranking Principle
  • If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.

Stephen E. Robertson, J. Documentation 1977

NASA Ames Lecture -- Ray R. Larson

probabilistic models logistic regression
Probabilistic Models: Logistic Regression
  • Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables.

Log odds of relevance is a linear function of attributes:

Term contributions summed:

Probability of Relevance is inverse of log odds:

NASA Ames Lecture -- Ray R. Larson

logistic regression

100 -

90 -

80 -

70 -

60 -

50 -

40 -

30 -

20 -

10 -

0 -

Relevance

0 10 20 30 40 50 60

Term Frequency in Document

Logistic Regression

NASA Ames Lecture -- Ray R. Larson

probabilistic retrieval logistic regression
Probabilistic Retrieval: Logistic Regression

Probability of relevance is based on

Logistic regression from a sample set of documents

to determine values of the coefficients (TREC).

At retrieval the probability estimate is obtained by:

For the 6 X attribute measures shown on the next slide

NASA Ames Lecture -- Ray R. Larson

probabilistic retrieval logistic regression attributes
Probabilistic Retrieval: Logistic Regression attributes

Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document -- logged

NASA Ames Lecture -- Ray R. Larson

cheshire probabilistic retrieval
Cheshire Probabilistic Retrieval
  • Uses Logistic Regression ranking method developed at Berkeley with new algorithm for weigh calculation at retrieval time.
  • Z39.50 “relevance” operator used to indicate probabilistic search
  • Any index can have Probabilistic searching performed:
    • zfind topic @ “cheshire cats, looking glasses, march hares and other such things”
    • zfind title @ caucus races
  • Boolean and Probabilistic elements can be combined:
    • zfind topic @ government documents and title guidebooks

NASA Ames Lecture -- Ray R. Larson

combining search types
Combining Search Types
  • It is also possible to combine the results of multiple independent searches into a single result set. (using the Z39.50 SORT service of the Cheshire system)
    • E.g.:
    • Search of Full Text (Probabilistic)
    • Search of Full Text (Boolean)
    • Search of Components (Probabilistic)
    • Search of Titles (Probabilistic)
    • Search of Subject Headings (Probabilistic)
  • All result sets are merged and re-ranked to produce the final list.

NASA Ames Lecture -- Ray R. Larson

relevance feedback
Relevance Feedback.
  • Any records in a result set can be used for Relevance Feedback
  • Uses the “set name” to receive feedback instructions.
    • zfind SET1:2,5-9,30,45
    • zfind SET2:6
  • Chosen records are used to build a new probabilistic query
  • Ranked results are returned
  • Planned support for (modified) Rocchio RF

NASA Ames Lecture -- Ray R. Larson

cheshire ii two stage retrieval evm generation
Cheshire II - Two-Stage Retrieval (EVM generation)
  • Example: Using the LC Classification System
    • Pseudo-Document created for each LC class containing terms derived from “content-rich” portions of documents in that class (subject headings, titles, etc.)
    • Permits searching by any term in the class
    • Ranked Probabilistic retrieval techniques attempt to present the “Best Matches” to a query first.
    • User selects classes to feed back for the “second stage” search of documents (which includes info from first stage selections)
  • Can be used with any classified/Indexed collection and controlled vocabulary

NASA Ames Lecture -- Ray R. Larson

automatic class assignment
Automatic Class Assignment

Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered

clusters are order-independent, usually based on an intellectually derived scheme

Doc

Doc

Doc

Doc

Search

Engine

Doc

Doc

Doc

1. Create pseudo-documents representing

intellectually derived classes.

2. Search using document contents

3. Obtain ranked list

4. Assign document to N categories

ranked over threshold. OR assign

to top-ranked category

NASA Ames Lecture -- Ray R. Larson

cheshire ii cluster generation
Cheshire II - Cluster Generation
  • Define basis for clustering records.
    • Select field to form the basis of the cluster.
    • Evidence Fields to use as contents of the pseudo-documents.
  • During indexing cluster keys are generated with basis and evidence from each record.
  • Cluster keys are sorted and merged on basis and pseudo-documents created for each unique basis element containing all evidence fields.
  • Pseudo-Documents (Class clusters) are indexed on combined evidence fields.

NASA Ames Lecture -- Ray R. Larson

cheshire ii two stage retrieval
Cheshire II - Two-Stage Retrieval
  • Using the Mesh Subject Heading System
    • Pseudo-Document created for each MESH heading containing terms derived from “content-rich” portions of documents in that class (other subject headings, titles, abstract, etc.)
    • Permits searching by any term in the class
    • Ranked Probabilistic retrieval techniques attempt to present the “Best Matches” to a query first.
    • User selects classes to feed back for the “second stage” search of documents.
  • Can be used with any classified/Indexed collection.

NASA Ames Lecture -- Ray R. Larson

distributed search the problem
Distributed Search: The Problem
  • Hundreds or Thousands of servers with databases ranging widely in content, topic, format
    • Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results
    • How to select the “best” ones to search?
      • What to search first
      • Which to search next
    • Topical /domain constraints on the search selections
    • Variable contents of database (metadata only, full text…)

NASA Ames Lecture -- Ray R. Larson

an approach for cross domain resource discovery
An Approach for Cross-Domain Resource Discovery
  • MetaSearch
    • New approach to building metasearch based on Z39.50
    • Instead of using broadcast search we are using two Z39.50 Services
      • Identification of database metadata using Z39.50 Explain
      • Extraction of distributed indexes using Z39.50 SCAN
  • Evaluation
    • How efficiently can we build distributed indexes?
    • How effectively can we choose databases using the index?
    • How effective is merging search results from multiple sources?
    • Hierarchies of servers (general/meta-topical/individual)?

NASA Ames Lecture -- Ray R. Larson

z39 50 overview
Z39.50 Overview

UI

Map

Query

Search

Engine

Map

Results

Map

Query

Internet

Map

Results

Map

Query

UI

Map

Results

NASA Ames Lecture -- Ray R. Larson

z39 50 explain
Z39.50 Explain
  • Explain supports searches for
    • Server-Level metadata
      • Server Name
      • IP Addresses
      • Ports
    • Database-Level metadata
      • Database name
      • Search attributes (indexes and combinations)
    • Support metadata (record syntaxes, etc)

NASA Ames Lecture -- Ray R. Larson

z39 50 scan
Z39.50 SCAN
  • Originally intended to support Browsing
  • Query for
    • Database
    • Attributes plus Term (i.e., index and start point)
    • Step Size
    • Number of terms to retrieve
    • Position in Response set
  • Results
    • Number of terms returned
    • List of Terms and their frequency in the database (for the given attribute combination)

NASA Ames Lecture -- Ray R. Larson

z39 50 scan results
Z39.50 SCAN Results

Syntax: zscan indexname1 term stepsize number_of_terms pref_pos

% zscan title cat 1 20 1

{SCAN {Status 0}

{Terms 20}

{StepSize 1}

{Position 1}}

{cat 27}

{cat-fight 1}

{catalan 19}

{catalogu 37}

{catalonia 8}

{catalyt 2}

{catania 1}

{cataract 1}

{catch 173}

{catch-all 3}

{catch-up 2} …

zscan topic cat 1 20 1

{SCAN {Status 0}

{Terms 20}

{StepSize 1}

{Position 1}}

{cat 706}

{cat-and-mouse 19}

{cat-burglar 1}

{cat-carrying 1}

{cat-egory 1}

{cat-fight 1}

{cat-gut 1}

{cat-litter 1}

{cat-lovers 2}

{cat-pee 1}

{cat-run 1}

{cat-scanners 1} …

NASA Ames Lecture -- Ray R. Larson

metasearch server index creation
MetaSearch Server Index Creation
  • For all servers, or a topical subset…
    • Get Explain information (especially DC mappings)
    • For each index (or each DC index)
      • Use SCAN to extract terms and frequency
      • Add term + freq + source index + database metadata to the metasearch “Collection Document” (XML)
    • Planned extensions:
      • Post-Process indexes (especially Geo Names, etc) for special types of data
        • e.g. create “geographical coverage” indexes

NASA Ames Lecture -- Ray R. Larson

metasearch approach
MetaSearch Approach

Search

Engine

MetaSearch

Server

Map

Query

Map Explain

And Scan

Queries

Map

Results

Internet

DB 1

DB2

Map

Results

Search

Engine

Map

Query

Distributed

Index

Search

Engine

Map

Results

Db 5

Db 6

NASA Ames Lecture -- Ray R. Larson

DB 3

DB 4

known problems
Known Problems
  • Not all Z39.50 Servers support SCAN or Explain
  • Solutions:
    • Probing for attributes instead of explain (e.g. DC attributes or analogs)
    • We also support OAI and can extract OAI metadata for servers that support OAI
  • Collection Documents are static and need to be replaced when the associated collection changes

NASA Ames Lecture -- Ray R. Larson

evaluation
Evaluation
  • Test Environment
    • TREC Tipster and FT data (approx. 3.5 GB)
    • Partitioned into 236 smaller collections based on source and (for TIPSTER) date by month (Distributed Search Testbed built by French, et al.)
      • High size variability (Range from 1 to thousands of docs)
      • 21,225,299 Words, 142,345,670 chars total for harvested records
  • Efficiency (old data)
    • Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average)
    • Average of 14.07 seconds excluding FT (131 seconds for FT database with 7 indexes)
    • Now collecting more information – so longer harvest times longer, but still under one minute on average

NASA Ames Lecture -- Ray R. Larson

evaluation1
Evaluation
  • Effectiveness
    • Still working on evaluation comparing our DB ranking with the TIPSTER relevance judgements
    • Can be compared with published selection methods (CORI, GlOSS, etc.) using the same testbed

NASA Ames Lecture -- Ray R. Larson

future
Future
  • Testing of variant algorithms for ranking collections
  • Application to real systems and testing in a production environment (Archives Hub)
  • Logically Clustering servers by topic
  • Meta-Meta Servers (treating the MetaSearch database as just another database)

NASA Ames Lecture -- Ray R. Larson

distributed metadata servers
Distributed Metadata Servers

Database

Servers

Meta-Topical

Servers

General Servers

Replicated

servers

NASA Ames Lecture -- Ray R. Larson

conclusion
Conclusion
  • A lot of interesting work to be done
    • Redesign and development of the Cheshire II system
    • Evaluating new meta-indexing methods
    • Developing and Evaluating methods for merging cross-domain results (or, perhaps, when to keep them separate)

NASA Ames Lecture -- Ray R. Larson

further information
Further Information
  • Full Cheshire II client and server source is available ftp://cheshire.berkeley.edu/pub/cheshire/
    • Includes HTML documentation
    • Also on Berkeley Digital Library Software Distribution CD
  • Project Web Site http://cheshire.berkeley.edu/

NASA Ames Lecture -- Ray R. Larson