240 likes | 252 Views
SDLIP + STARTS = SDARTS A Protocol and Toolkit for Metasearching. Noah Green Panagiotis G. Ipeirotis Luis Gravano. Computer Science Dept., Columbia University. Web vs. “Hidden” Web. Web Link structure Crawlable. Individual collections (or “ Hidden” Web ) No link structure
E N D
SDLIP + STARTS = SDARTSA Protocol and Toolkit for Metasearching Noah Green Panagiotis G. Ipeirotis Luis Gravano Computer Science Dept., Columbia University
Web vs. “Hidden” Web • Web • Link structure • Crawlable • Individual collections (or “Hidden” Web) • No link structure • Documents “hidden” behind search forms Columbia University Computer Science Dept.
Metasearching Given many documentsources and a query, a metasearcher: • Finds the good sources for the query. • Evaluates the query at these sources. • Merges the results from these sources. Metasearcher Existing Web Application Non-indexed Documents Legacy Database / WAIS / etc. Columbia University Computer Science Dept.
Metasearching Issues • How to evaluate the relevance of different sources? • How to get metadata? • How to query different types of sources? • How to merge the results? Metasearcher http://…/getTitle? title=‘biomedical’&… SELECT title FROM articles . . . grep ‘biomedical’ *.txt Columbia University Computer Science Dept.
S = Search Metasearcher M = Metadata S M S M S M grep cat select http://…. Solution: A Common Protocol Columbia University Computer Science Dept.
Why “SDARTS = SDLIP+STARTS”? • NOT yet another protocol • We combined existing efforts, keeping compatibility • SDLIP defines a common interface for interacting with the sources • STARTS defines expressive metadata that sources should export Columbia University Computer Science Dept.
SDARTS: Outline • Description of SDLIP. • Description of STARTS. • Integration of SDLIP and STARTS into SDARTS. • Implementation and configuration of SDARTS wrappers. Columbia University Computer Science Dept.
Developed during DLI2 project by: • Stanford University • UC Berkeley • UC San Diego • UC Santa Barbara • San Diego Supercomputer Center • California Digital Library Columbia University Computer Science Dept.
S M DB-specific interfaces SDLIP: An Interoperability Protocol • Basic interfaces: • Search • Metadata • A wrapper implements these interfaces • Interface parameter and return types are XML • Transport layer implementations (HTTP, CORBA) Common SDLIP interface • Flexible and adaptable • Optimized for clients that know the source to query (i.e., simple requirements for metadata) Columbia University Computer Science Dept.
STARTS: Informal Standardfor Search Engine Interoperability • Coordinated by Stanford in 1996; • Both search engine vendors and "users“ participated: • Netscape • Microsoft Network • GILS • Infoseek • Harvest • Hewlett-Packard • Fulcrum • Verity • Wais • PLS • Excite Columbia University Computer Science Dept.
STARTS: A Metasearching Protocol • Defines: • Query language • Results format • Metadata for the collection • No specified transport layer or implementation • Naturally complements SDLIP for metasearching purposes Example of metadata: Stemming = no # of docs = 20,000 … Diabetes TF:12, DF: 4 XML TF:1200, DF:750 … Columbia University Computer Science Dept.
SDARTS = SDLIP + SDARTS • Extends SDLIP with a richer metadata interface from STARTS • Keeps compatibility with SDLIP (same DTDs) • Can support easily similar protocols (transforming XML is easy) • Makes wrapping collections easy through a toolkit Columbia University Computer Science Dept.
SDARTS: Implementation Details • Defined STARTS using XML; new version named “STARTS XML.” • Used the getPropertyInfo()from SDLIP to extend SDLIP with STARTS metadata. • Term frequency information is available through a different URL (faster download for metasearchers that do not use it). Columbia University Computer Science Dept.
Example of STARTS Metadata: “Content Summary” <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE starts:scontent-summary SYSTEM "http://www.cs.columbia.edu/~dli2test/dtd/starts.dtd"> <starts:scontent-summary xmlns:starts="http://www.cs.columbia.edu/~dli2test/STARTS/" version="Starts 1.0" stemming="false" stopwords="false" case-sensitive="true" fields="false" numdocs="19997"> <starts:field-freq-info> … <starts:field type-set="basic1" name="body-of-text"/> <starts:term> <starts:value>algorithm</starts:value> </starts:term> <starts:term-freq>75</starts:term-freq> <starts:doc-freq>34</starts:doc-freq> … Columbia University Computer Science Dept.
SDARTS Wrapper Design Rationale • Goal: Isolate developer from parsing and generating STARTS XML requests and responses • Goal: Reusability and simplicity • SDARTS toolkits and reference implementations • Wrapping local text document collections • Wrapping XML collections • Wrapping HTTP/CGI interfaces Columbia University Computer Science Dept.
Internet SDARTS Wrapping Architecture SDLIP LSP Client Program STARTS XML over HTTP/DASL LSPObjects SDARTS Bean BackEndLSP S FrontEnd LSP M Existing SDLIP Client STARTS XML Native Protocol/ Search Engine Columbia University Computer Science Dept.
SDARTS: Wrapper Implementation • Standardize on STARTS as the XML protocol for SDLIP • Create a standard wrapper architecture LSPObjects STARTS XML BackEnd LSP S FrontEnd LSP M • “Front-End”: • Implements SDLIP interfaces • Communicates with client using STARTS XML nested inside SDLIP method calls • “Back-End”: • Communicates with front-end using simple container objects • Talks to underlying collection using native protocol Native Protocol/ Search Engine Columbia University Computer Science Dept.
Adding a Local Text Collection • Write standard doc_config.xml file • Regular expressions to describe where to find fields • No coding or compilation needed! doc_ config .xml index meta_ attributes .xml content_ summary .xml TextBackEndLSP Lucene Search Engine Non-indexed Text Documents Columbia University Computer Science Dept.
Sample doc_config.xml <doc-config re-index="true"> <path>/home/dli2test/collections/doc1/20groups</path> <linkage-prefix>http://localhost/20groups</linkage-prefix> . . . . . . . . <stop-words><word>the</word><word>a</word></stop-words> . . . . . . . . <field-descriptor name="author"> <start><regexp>^From: </regexp></start> <end><regexp>$</regexp></end> </field-descriptor> . . . . . . . . </doc-config> Columbia University Computer Science Dept.
Adding a Local XML Collection • Write standard doc_config.xml file • Write an XSL stylesheet to find fields in documents • No coding or compilation needed! doc_style.xsl index meta_ attributes .xml content_ summary .xml doc_config.xml Apache Xalan XSL Processor Lucene Search Engine XMLBackEndLSP Non-indexed XML Documents Columbia University Computer Science Dept.
Adding an External Web Collection • Must code a custom wrapper to send correct CGI parameters and parse returning HTML • No new code needed; uses XSLT for parsing the results • Usually no metadata or content summary available • Possible to automate metadata extraction: • [Callan et al., SIGMOD’99]: Automatic extraction of vocabulary statistics • [Ipeirotis et al., SIGMOD’01]: Automatic categorization of databases • [Raghavan and Garcia-Molina, VLDB 2001]: Automatic interaction with forms meta_attributes.xml Web BackEnd LSP HTTP/CGI Collection Columbia University Computer Science Dept.
Conclusions • SDARTS uses SDLIP interfaces and code (compatible with it). • SDARTS enhances SDLIP and STARTS. • Reference wrappers available for common collection types. • Any text or XML document collection can be easily wrapped without new compiled code. • Automatic metadata extraction for local collections • Using XSLT for web wrappers • Possible to automate the extraction of rich metadata for web-accessible collections • New wrappers can be written without having to parse or generate STARTS XML. • SDARTS is in Java and can run on multiple platforms. Columbia University Computer Science Dept.
We are on the Web :) • Available for downloading: • SDARTS DTDs and documentation • Java code and search engine (Lucene) included • Source code documentation • Web client source code • Reference wrappers (text, XML, web) • Wrapped collections • The web client is web-accessible for the public to test and query our SDARTS server http://sdarts.cs.columbia.edu/ Columbia University Computer Science Dept.
Related Work • Metadata: • Open Archives • Dublin Core • MARC • … • Interoperability Protocols: • Z39.50 • GILS Columbia University Computer Science Dept.