Databases and Information Retrieval: Rethinking the Great Divide

Databases and Information Retrieval:Rethinking the Great Divide SIGMOD Panel 14 Jun 2005 Jayavel Shanmugasundaram Cornell University

The Great Data Divide The Great Query Divide 10000 Foot View of Data Management Information Retrieval Systems Ranked Keyword Search Queries Complex and Structured Database Systems Structured Unstructured Data

Bridging the Great Divide • Option 1: Tie together existing DB and IR systems • Example: Approaches based on SQL/MM • Option 2: Extend existing DB systems with IR functionality, or vice versa • Example: Add searching and ranking to RDBMSs • Option 3: Design a new data management system from the ground-up • Example: Quark data management system

Why Option 1 Wont Work Information Retrieval Systems Ranked Keyword Search Queries Complex and Structured Database Systems Structured Unstructured Data

Bridging the Great Divide • Option 1: Tie together existing DB and IR systems • Example: Approaches based on SQL/MM • Drawback: Not powerful enough • Option 2: Extend existing DB systems with IR functionality, or vice versa • Example: Add searching and ranking to RDBMSs • Option 3: Design a new data management system from the ground-up • Example: Quark data management system

<workshopdate=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paperid=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <sectionname=”Introduction”> Searching on structured text is becoming more important with XML … </section> … <citexmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> … Find relevant elements in important workshops between the years 1999 and 2001 that are about ‘Ricardo’ and ‘XML’

Why Extending (R)DBMSs Won’t Work • Violates many assumptions “hardwired” into current database systems • Structured queries over structured fields, keyword search queries over text fields • Is author name a structured or text field? • Operators have precise, well-defined semantics • Even the query result is not well-defined – do we return a paper or a workshop? • Scoring is an attribute tacked on as a relational attribute • How can this scoring generalize IR scoring?

Why Extending IR Systems Won’t Work • IR systems provide little support for structured data • No support for complex operators • How can complex queries be evaluated? • Scoring does not take structure into account • How can scoring capture both structured and unstructured data?

Bridging the Great Divide • Option 1: Tie together existing DB and IR systems • Example: Approaches based on SQL/MM • Drawback: Not powerful enough • Option 2: Extend existing DB systems with IR functionality, or vice versa • Example: Add searching and ranking to RDBMSs • Drawback: Shoehorns alien functionality into already complex systems • Option 3: Design a new data management system from the ground-up • Example: Quark data management system

Why Option 3 Will Work • Designed ground-up with three principles • Structural data independence • Users can issues any query (complex and keyword) over any data (structured and unstructured) • Generalized scoring • Scoring works over any mix of structured and unstructured data (e.g., XRank over HTML and XML) • Flexible query language • Allows for arbitrary return results and scores (e.g., TeXQuery, precursor to XQuery Full-Text, NEXI)

Bridging the Great Divide • Option 1: Tie together existing DB and IR systems • Example: Approaches based on SQL/MM • Drawback: Not powerful enough • Option 2: Extend existing DB systems with IR functionality, or vice versa • Example: Add searching and ranking to RDBMSs • Drawback: Shoehorns alien functionality into already complex systems • Option 3: Design a new data management system from the ground-up • Example: Quark data management system • Most promising alternative!

Databases and Information Retrieval: Rethinking the Great Divide