1 / 41

Metasearching

Metasearching. CS 502 – 20020312 Carl Lagoze – Cornell University. Acknowledgements: Luis Gravano Andreas Paepcke. Web Search Strategies – Crawling. “central” index. ?. Web Search Strategies – Metadata Harvesting. metadata. Author Title Abstract Identifer. ?.

umed
Download Presentation

Metasearching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metasearching CS 502 – 20020312 Carl Lagoze – Cornell University Acknowledgements: Luis Gravano Andreas Paepcke 20020307

  2. Web Search Strategies – Crawling “central” index ? 20020307

  3. Web Search Strategies – Metadata Harvesting metadata 20020307

  4. Author Title Abstract Identifer ? Web Search Strategies – Metadata Harvesting metadata 20020307

  5. Web Search Strategies - Metasearching Metasearch Engine ? 20020307

  6. What is “Metasearching”? • Given many document sources and a query, a metasearcher: • Finds the good sources for the query • Evaluates the query at these sources • Merges the results from these sources Metasearcher Existing Web Application Unindexed Documents Legacy Database / WAIS / etc. 20020307

  7. Metasearching Issues • How to query different types of sources? • How to combine results and rankings from multiple data sources? Metasearcher http://…/getTitle? title=‘biomedical’&… SELECT title FROM articles . . . grep ‘biomedical’ *.txt 20020307

  8. Metasearching Issues . . . Cont’d • How to choose among multiple data sources? • How to get metadata about multiple data sources? Metasearcher Best: http://….?getMetaData Worst: “Hi. What do you have?” cat *.txt SELECT SCHEMA ……. 20020307

  9. Function versus cost of acceptance Cost of acceptance Z39.50 SDLIP/STARTS Metadata Harvesting google 20020307 Function

  10. Z39.50 http://www.loc.gov/z3950/agency/ 20020307

  11. Aims of Z39.50 • Permits one computer, the client, to search and retrieve information on another, the database server • Important both technically and for its wide use in library systems • Most development has concentrated on bibliographic data • Most implementations emphasize searches that use a bibliographic set of attributes to search databases of MARC records 20020307

  12. Technical history • Z39.50 • Developed for X.25 networks (connection orientation), conversion to run over TCP fitted later • Original concept in days when repeating a search was expensive computation (about 1980) • WAIS is a stateless derivative of an early version of Z39.50 20020307

  13. Z39.50 principles • Abstract view of database searching. • Server stores a set of databases with searchable indexes • Interactions are based on a session • The client opens a connection with the server, carries out a sequence of interactions and then closes the connection. • During the course of the session, both the server and the client remember the state of their interaction. 20020307

  14. State • Z39.50 • The server carries out the search and builds a results set • Server saves the results set. • Subsequent message from the client can reference the result set. • Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database. 20020307

  15. Z 39.50 services • init -- client connects to the server and exchanges initial information, e.g., preferred message size • explain -- client inquires of the server what databases are available for searching, the fields that are available, the syntax and formats supported, and other options • search -- client presents a query to a database choices of syntax for specifying searches • • only Boolean queries widely implemented • • one or more records may be returned to the client 20020307

  16. Z 39.50 services manipulation of results sets -- e.g., sort or delete present -- requests the server to send specified records from the results set to the client in a specified format • options: for controlling content and formats for managing large records or large results sets 20020307

  17. Sample query • In the database named "Books" find all records for which the access point title that contains the value "evangeline" and the access point author contains the value "longfellow.“ • Z39.50 defines a rich variety of search access points that can be extended by implementers 20020307

  18. Problems with Z39.50 • Very difficult to implement • There are freely available implementations, but they are complex • Outdated assumptions • Searching is expensive computationally • Bandwidth is limited (ASN.1 compression) • Originally designed for bibliographic record retrieval, and not full documents or other objects • “Overspecified” • (Almost) Nobody Implements Explain! • Assumes questionable user model (stateful) 20020307

  19. Simple Digital Library Interoperability Protocol http://www-diglib.stanford.edu/~testbed/doc2/SDLIP/ 20020307

  20. SDLIP • Compromise between a full-scale, all encompassing search middleware design such as Z39.50 and the “anything goes” approach typical for ad-hoc search interface design on web • Support for stateful and stateless operation by the server • Support for thin clients, such as handheld devices • Developed jointly by Stanford, Berkeley, and UC Santa Barbara • Heavily influenced by DASL from IETF 20020307

  21. SDLIP – search middleware 20020307

  22. Managing complexity through separate interfaces 20020307

  23. SDLIP Interfaces • Search Interface – defines simple query language, protocol can then include other languages • Result Interface – parking meter metaphor supports varying notions of results sets • Source Metadata Interface – provides extension mechanism through discovery server capabilities 20020307

  24. Result Access Interface • This interface allows client applications to access the set of result documents, wherever that set is maintained • Four services: • getSessionInfo • getDocs • extendStateTimeout • cancelRequest 20020307

  25. Source Metadata Interface • Provides information about the service and server itself, such as • Collections served • Collection metadata/content information • Searchable properties • Three operations • getInterface • getSubcollectionInfo • getPropertyInfo 20020307

  26. STARTS/SDARTS http://www-db.stanford.edu/~gravano/starts_home.html http://sdarts.cs.columbia.edu/default.html 20020307

  27. STARTS • Stanford Protocol Proposal for Internet Retrieval and Search • Joint work of Stanford Digital Library Project and Cornell Digital Library Research Group • SDARTS – current work at Columbia to integrate with SDLIP and metadata harvesting (OAI-PMH) 20020307

  28. Different text search engines are largely incompatible • Different query languages (the query-language problem) • Different ranking algorithms (the rank-merging problem) • No exported information about sources (the metadata problem) 20020307

  29. Rank Merging • Return information in query result to allow rank merging: • unnormalized score of the document • statistics about each query term 20020307

  30. We cannot merge document ranks from different sources directly • Search engines use different ranking algorithms: DB1: (doc1, 0.7), (doc2, 0.3) DB2: (doc3, 1000), (doc4, 400) Merged rank? • Some algorithms depend on the source characteristics 20020307

  31. Extra information helps merge document ranks meaningfully Sources return query results and statistics: Query: "distributed databases" DB1: (doc1, 0.7) "distributed" appears 3 times in doc1"databases" appears 5 times in doc1 20020307

  32. author=Hopcroft? Hopcroft doc8 Tarjan doc9 Tarjan doc6 Wilensky doc7 Hopcroft doc1, doc2 Hartmanis doc3, doc4 Motivating Source MetadataRouting Problem - Disjoint Search Sources Hopcroft I1, I3 Hartmanis I3 Tarjan I1, I2 Wilensky I2 I1,I3 doc1, doc2 doc8 Content Summary I1 I2 I3 20020307

  33. Source Metadata • Data to help select the right sources for a query source metadata attributes - what the source engine can do source content summary - what the source engine can search • Simplified form of Z39.50 “explain” service 20020307

  34. Source metadata attributes • Fields Supported • Modifiers Supported • Score Range • Ranking Algorithm ID 20020307

  35. Source Content Summary For each source: • Vocabulary • Document frequency for each word • Total number of postings for each word • Number of documents • Implementation of GLOSS work: • GlOSS: Text-Source Discovery over the Internet, L. Gravano, H. Garcia-Molina, A. Tomasic, in ACM Transactions on Database Systems, vol. 24, no. 2, Jun. 1999 20020307

  36. Distributed Searching Issues Query Routing to Replicated Sources 20020307

  37. author=Hopcroft? Hopcroft doc8 Tarjan doc9 Hopcroft doc8 Tarjan doc9 Routing ProblemReplicated Distributed Indexes Tarjan doc6 Wilensky doc7 Tarjan doc6 Wilensky doc7 20020307

  38. Routing Issues • Choice of primary?, secondary?, etc. • Fault-tolerance • Routing Factors • Performance-based • Freshness-based • Cost-based • weighted mix based on user preference 20020307

  39. Components of Replicated Routing Problem • Metadata Issue: metadata made available by indexer to aid in routing • Metadata Distribution Issue: topology of metadata repositories • Decision Issue: routing decision algorithms • Fault-tolerance: use of backup indexers 20020307

  40. Distributed Metadata for Query Routing central metadata store 20020307

  41. Performance-based Routing - present 8 T Timed low pass filter Average response time Predicted response time New = low pass filter(T, actual response time, old ) 20020307

More Related