CS 430 / INFO 430 Information Retrieval

CS 430 / INFO 430Information Retrieval Lecture 13 Architecture of Information Retrieval Systems

Course Administration Assignment 2 Deadline changed to midnight on Sunday, October 9 There is major electrical work in Upson Hall on Saturday and most of the computers and labs will not be available.

Course Administration Midterm Examination Wednesday, October 12, 7:30 to 9:00, Upson B17 The topics to be are examined are all lectures and discussion class readings before the midterm break. See the Web site for a sample paper from a previous year. See the Web site for instructions about laptop computers.

Course Administration Discussion Class on October 19 This class will be held in Philips Hall 213

Notation Index Docs Catalog UI Documents File of catalog Searchable file records index User Human Automatic interface action process Physical objects User interface service

Single Homogeneous Collection: Full Text Indexing Index Docs • Documents and indexes are held on a single computer system (may be several computers). • Information retrieval uses a full text index, which may be tuned to the specific corpus. Build index Search Examples: SMART, Lucene

Single Homogeneous Collection: Use of Catalog Records Index Docs Catalog • Documents may be digital or physical objects, e.g., books. • Documents are described by catalog records generated manually (or sometimes automatically). • Information retrieval uses an index of catalog records Build index Create catalog Search Example: Library catalog

Several Similar Collections: One Computer System Index Index Index Docs • Several more or less similar collections are held on a single computer system. • Each collection is indexed separately using the same software, procedures, algorithms, etc. (but tuned for each collection, e.g., different stoplists). Build indexes Search Docs Docs Example: PubMed

Distributed Architecture: Standard Search Protocols Index 1 Index 2 Strict adherence to standards allows any user interface to search any conforming search service.

Standard Search Protocols Example: Z 39.50 Family of Standards for Searching Library Catalogs The Z 39.50 family of standards has proved successful in a tightly knit community, where: • There is a strong tradition of standardization, with many professionally trained people. • The categories of material change gradually, allowing a slow-moving standardization process. The standardization approach has failed where these two criteria are not met. Historic note: WAIS was based on an early version of Z39.50.

Z39.50 principles • Servers store a set of databases with searchable indexes • Interactions are based on a session • The client opens a connection with the server(s), carries out a sequence of interactions and then closes the connection. • During the course of the session, both the server and the client remember the state of their interaction.

State • Z39.50 • The server carries out the search and builds a results set • Server saves the results set. • Subsequent message from the client can reference the result set. • Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database.

Standards Z 39.50 Family of Standards for Searching Library Catalogs Content:Anglo American Cataloging Rules Structure of Content: MARC Encoding Rules: Base Encoding Rules (character sets,separators, etc.) Message Passing Protocol:Z 39.50 Query Format:Bib 1 (Boolean), Type 102 (full text) In addition, there are the underlying network standards, e.g. the Internet suite of protocols.

Distributed Architecture: Meta-search (Broadcast Search) Index 1 Index 2 Index n UI • A user interface service broadcasts a query to several indexes and merges the results. • Can be used with full text or catalogs. Searches Search User interface service Example: Dienst

Distributed Architecture: Broadcast Search Interface Service: Can be a separate server (e.g., CGI), or run on the user's computer (e.g., applet). Protocols: In the simple version, each collection must support the same standards and protocols (e.g., Z 39.50, http).

Distributed Architecture: Broadcast Search Problems with Broadcast Search • Performance: If any collection does not respond, the Interface Server waits for a time out. • Recall: If any collection does not respond, documents in that collection are not found. • Ranking and duplicates: There are great difficulties in reconciling ranked lists from different collections. Broadcast searching is as bad as its weakest link! Conclusion: Broadcast search does not scale beyond about five or ten collections, even with strict standardization.

Union Catalog Docs Index to Union Catalog Union Catalog Docs • Catalog records from several libraries are merged into a single union catalog • Information retrieval uses an index of the records in the union catalog Create catalog records Build index Search Example: Harvard University's Hollis system

Use of Union Catalogs Docs Index to Union Catalog Union Catalog Docs Search Batch indexing: Metadata about all items is accumulated in a central system. Real-time searching: The user (a) searches the central index, (b) retrieves catalog records, (c) retrieves documents from collections. Retrieve

Building Union Catalogs Harvesting • Each collection makes a copy of its metadata (catalog records) available from a sever associated with the collection. • A search service harvests metadata from all collections on a regular cycle and builds a central search system. Advantages ... • Can index material from databases without explicit URLs. • Allows authentication and selection of material. but ... • Requires that collections have metadata and support harvesting protocol (e.g., Open Archives Initiative Protocol for Metadata Harvesting).

OAI Verbs • Identify – repository characteristics • ListMetadataFormats – DC required • ListSets – repository partitioning • ListRecords – (selectively) harvest metadata • ListIdentifiers – (selectively) harvest metadata identifiers • GetRecord – known item retrieval

OAI-PMH Key technical features • Simple HTTP encoding • Built on of established XML standards • Multiple metadata formats, but Dublin Core required • Repository partitioning (sets) • Selective harvesting (sets and dates) • Clean partition between core and implementation-specific extensions • Multiple item-level metadata • Collection level metadata

Open Archives Initiative Protocol for Metadata Harvesting See: http://www.openarchives.org/ Herbert Van de Sompel and Carl Lagoze, "The Santa Fe Convention of the Open Archives Initiative." D-Lib Magazine, 6(2), 2000 http://www.dlib.org/dlib/february00/vandesompel- oai/02vandesompel-oai.html

Web Searching: Architecture Docs on Web server Docs on Web server Index to all Web pages • Documents stored on many Web servers are indexed in a single central index. (This is similar to a union catalog.) • The central index is implemented as a single system on a very large number of computers Build index Search Examples: Google, Yahoo!

Use of Web Search Service Docs on Web server Docs on Web server Index to all Web pages Search Batch indexing: Each Web page is brought to the central location and indexed. Real-time searching: The user (a) searches the central index, (b) retrieves documents (Web pages) from original location. Retrieve

Web Searching: Building the Index Documents are Web pages Each document is: • identified by Web Crawling • copied to a central location • indexed and added to the central index After indexing the documents are usually discarded, but a cached copy may be retained. Web searching is the topic of Lectures 19-21 and Discussion Classes 9 and 10.

Web Crawling Advantages of Web crawling • Entirely automatic, low cost. Highly efficient at gathering very large amounts of material. but ... • Can only gather openly accessible materials. • Cannot gather material in databases unless explicit URLs are known. • Cannot easily make use of metadata provided by collections.

Standardization: Function Versus Cost of Acceptance Cost of acceptance Few adopters Many adopters Function

Example: Textual Mark-up Cost of acceptance SGML XML HTML Function ASCII

CS 430 / INFO 430 Information Retrieval