Resource Discovery Strategies

Resource Discovery Strategies CS 502 – 20030324 Carl Lagoze – Cornell University Acknowledgements: Luis Gravano Andreas Paepcke Bill Arms

Function versus cost of acceptance Cost of acceptance Z39.50 SDLIP/STARTS Metadata Harvesting google Function

Web Search Strategies – Metadata Harvesting metadata

Author Title Abstract Identifer ? Web Search Strategies – Metadata Harvesting metadata

Web Search Strategies – Automated Indexing “central” index ?

Automated Information Discovery Creating metadata records manually is labor-intensive and hence expensive. The aim of automated information discovery is for users to discover information without using skilled human effort to build indexes.

Resources for Automated Information Discovery Computer power brute force computing ranking methods automatic generation of metadata The intelligence of the user browsing relevance feedback information visualization

Similarity Ranking • Ranking methods using similarity • Measure the degree of similarity between a query and a document (or between two documents). • Basic technique is the vector space model with term weighting. Similar Requests Documents Similar: How similar is document to a request?

Vector Space Methods: Concept n-dimensional space, where n is the total number of different terms (words) in a set of documents. Each document is represented by a vector, with magnitude in each dimension equal to the (weighted) number of times that the corresponding term appears in the document. Similarity between two documents is the angle between their vectors. Much of this work was carried out by Gerald Salton and colleagues in Cornell's computer science department.

Example 1: Incidence Array terms in d1 -> ant ant bee terms in d2 -> bee hog ant dog terms in d3 -> cat gnu dog eel fox terms ant bee cat dog eel fox gnu hog d1 1 1 d2 1 1 1 1 d3 1 1 1 1 1 Weights: tij = 1 if document i contains term j and zero otherwise

Reasons for Term Weighting • Similarity using an incidence matrix measures the • occurrences of terms, but no other characteristics of • the documents. • Terms are more useful for information retrieval if • they: • appear several times in one document (weighting by term frequency) • only appear in some documents (weighting by document frequency) • appear in short document (weighting by document length)

Inverse Document Frequency Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

Why extend traditional IR for the Web • Traditional TREC benchmarks are relatively small scale (large is about 20GB) • Web queries are very short • Extreme variability of quality • Web has context and “hints” • Structure of pages (e.g. html title might be rated higher) • Implicit metadata of link context • Anchor text of citing pages • Weighting influenced by citation structure (recall Garfield)

PageRank Algorithm (Google) Concept: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages.

Page Ranks Citing page P1 P2 P3 P4 P5 P6 P1 1 1 1 P2 1 P3 1 P4 1 1 1 1 P5 1 1 P6 1 1 Cited page Number 2 1 4 2 2 2

Normalize by Number of Links from Page Citing page P1 P2 P3 P4 P5 P6 P1 1 0.25 0.5 P20.25 P30.5 P40.50.25 0.5 0.5 P50.5 0.5 P60.25 0.5 = B Cited page Number 2 1 4 2 2 2

Calculating the Weights Initially all pages have weight 1 w1 = Recalculate weights w2 = Bw1 = Iterate wk+1 = Bwk until iteration convergences 1.75 0.25 0.50 1.75 1.00 0.75 1 1 1 1 1 1

Page Ranks (Basic Algorithm) Iteration converges when w = Bw This w is the high order eigenvector of B It ranks the pages by links to them, normalized by the number of citations from each page and weighted by the ranking of the cited pages

PageRank Intuitive Model A user: 1. Starts at a random page on the web 2a. With probability p, selects any random page and jumps to it 2b. With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2a and 2b a very large number of times Pages are ranked according to the relative frequency with which they are visited. The basic algorithm described on the previous slides is with p = 0

Web Search Strategies – Federated Searching Search Client ?

Z39.50 http://www.loc.gov/z3950/agency/

Aims of Z39.50 • Permits one computer, the client, to search and retrieve information on another, the database server • Important both technically and for its wide use in library systems • Most development has concentrated on bibliographic data • Most implementations emphasize searches that use a bibliographic set of attributes to search databases of MARC records • Built on notion of common protocol as interoperability paradigm

Technical history • Z39.50 • Developed for X.25 networks (connection orientation), conversion to run over TCP fitted later • Original concept in days when repeating a search was expensive computation (about 1980)

Z39.50 principles • Abstract view of database searching. • Server stores a set of databases with searchable indexes • Interactions are based on a session • The client opens a connection with the server, carries out a sequence of interactions and then closes the connection. • During the course of the session, both the server and the client remember the state of their interaction.

State • Z39.50 • The server carries out the search and builds a results set • Server saves the results set. • Subsequent message from the client can reference the result set. • Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database.

Z 39.50 services • init -- client connects to the server and exchanges initial information, e.g., preferred message size • explain -- client inquires of the server what databases are available for searching, the fields that are available, the syntax and formats supported, and other options • search -- client presents a query to a database choices of syntax for specifying searches • • only Boolean queries widely implemented • • one or more records may be returned to the client

Z 39.50 services manipulation of results sets -- e.g., sort or delete present -- requests the server to send specified records from the results set to the client in a specified format • options: for controlling content and formats for managing large records or large results sets

Problems with Z39.50 • Very difficult to implement • There are freely available implementations, but they are complex • Outdated assumptions • Searching is expensive computationally • Bandwidth is limited (ASN.1 compression) • Originally designed for bibliographic record retrieval, and not full documents or other objects • “Overspecified” • (Almost) Nobody Implements Explain! • Assumes questionable user model (stateful)

ZING – update of Z39.50 concepts • http://www.loc.gov/z3950/agency/zing/zing-home.html • SRW (Search and Retrieval for the Web) • Retain basic Z39.50 concepts (state, explain, flexible access points or metadata formats) • Simplifications and modernizations (statefulness, use of XML/SOAP) • CQL (Common Query Language) • Expressive common query language with XML definition

Web Search Strategies - Metasearching Metasearch Engine ?

What is “Metasearching”? • Given many document sources and a query, a metasearcher: • Finds the good sources for the query • Evaluates the query at these sources • Merges the results from these sources Metasearcher Existing Web Application Unindexed Documents Legacy Database / WAIS / etc.

Metasearching Issues • How to query different types of sources? • How to combine results and rankings from multiple data sources? Metasearcher http://…/getTitle? title=‘biomedical’&… SELECT title FROM articles . . . grep ‘biomedical’ *.txt

Metasearching Issues . . . Cont’d • How to choose among multiple data sources? • How to get metadata about multiple data sources? Metasearcher Best: http://….?getMetaData Worst: “Hi. What do you have?” cat *.txt SELECT SCHEMA …….

Simple Digital Library Interoperability Protocol http://www-diglib.stanford.edu/~testbed/doc2/SDLIP/

SDLIP • Compromise between a full-scale, all encompassing search middleware design such as Z39.50 and the “anything goes” approach typical for ad-hoc search interface design on web • Support for stateful and stateless operation by the server • Build on notion of mediators as interoperability paradigm • Support for thin clients, such as handheld devices • Developed jointly by Stanford, Berkeley, and UC Santa Barbara

SDLIP – search middleware

SDLIP Interfaces • Search Interface – defines simple query language, protocol can then include other languages • Result Interface – parking meter metaphor supports varying notions of results sets • Source Metadata Interface – provides extension mechanism through discovery server capabilities

Result Access Interface • This interface allows client applications to access the set of result documents, wherever that set is maintained • Four services: • getSessionInfo • getDocs • extendStateTimeout • cancelRequest

Source Metadata Interface • Provides information about the service and server itself, such as • Collections served • Collection metadata/content information • Searchable properties • Three operations • getInterface • getSubcollectionInfo • getPropertyInfo

STARTS/SDARTS http://sdarts.cs.columbia.edu/default.html

STARTS • Stanford Protocol Proposal for Internet Retrieval and Search • Joint work of Stanford Digital Library Project and Cornell Digital Library Research Group • SDARTS – current work at Columbia to extend STARTS • Integrate with SDLIP and metadata harvesting (OAI-PMH) • Include “deep web” automated content summary concepts

Different text search engines are largely incompatible • Different query languages (the query-language problem) • Different ranking algorithms (the rank-merging problem) • No exported information about sources (the metadata problem)

Rank Merging • Return information in query result to allow rank merging: • unnormalized score of the document • statistics about each query term

We cannot merge document ranks from different sources directly • Search engines use different ranking algorithms: DB1: (doc1, 0.7), (doc2, 0.3) DB2: (doc3, 1000), (doc4, 400) Merged rank? • Some algorithms depend on the source characteristics

Extra information helps merge document ranks meaningfully Sources return query results and statistics: Query: "distributed databases" DB1: (doc1, 0.7) "distributed" appears 3 times in doc1"databases" appears 5 times in doc1

author=Hopcroft? Hopcroft doc8 Tarjan doc9 Tarjan doc6 Wilensky doc7 Hopcroft doc1, doc2 Hartmanis doc3, doc4 Motivating Source MetadataRouting Problem - Disjoint Search Sources Hopcroft I1, I3 Hartmanis I3 Tarjan I1, I2 Wilensky I2 I1,I3 doc1, doc2 doc8 Content Summary I1 I2 I3

Source Metadata • Data to help select the right sources for a query source metadata attributes - what the source engine can do source content summary - what the source engine can search • Simplified form of Z39.50 “explain” service

Source metadata attributes • Fields Supported • Modifiers Supported • Score Range • Ranking Algorithm ID

Source Content Summary For each source: • Vocabulary • Document frequency for each word • Total number of postings for each word • Number of documents • Implementation of GLOSS work: • GlOSS: Text-Source Discovery over the Internet, L. Gravano, H. Garcia-Molina, A. Tomasic, in ACM Transactions on Database Systems, vol. 24, no. 2, Jun. 1999

Deep Web Content Summary Extraction via Focused Query Probing

Resource Discovery Strategies