XML + Databases = ? (DIMACS Workshop, 3/2000)

XML + Databases = ?(DIMACS Workshop, 3/2000) Mike Carey Exploratory Database Systems Department IBM Almaden Research Center carey@almaden.ibm.com

Plan for Today’s Talk • Thoughts on DB and web technologies • The web and web “querying” • Semistructured databases • Object-relational databases • XML and databases • XML/DB research at IBM Almaden • The XPERANTO project • Motivation and approach • Whirlwind tour of the system

The Web is Great at Supporting URL-Based Sharing • Ex:Online conference proceedings • Web browsers have given us • Universal file access (ftp++) • Universal document access (html) • Universal service access (forms) • What more could we navigational couch potatoes possibly want? • Universal platform for e-shopping!

The Web is Lousy at Supporting Parametric Searches • Ex:Find all the used Musicman Sterling bass guitars currently available for under $750 within a 50-mile radius of my San Jose home • This is hard for a number of reasons • Data buried in web pages, news groups, classified ads, store sites, auction sites, … • No schema (no metal fish, please!) • No data types (miles, US$, instruments) • No regularity within/across (good!) sites

Aren’t We Supposed to be the Experts on Data Management? • The DB community brought the world • Data models, schemas, and views • Query languages, optimizers, fast joins • Scalable parallel servers • Federated database systems • What do we have in our bag of tricks? • Semistructured databases • Object-relational database systems

Is Semistructured Database Technology the Answer? • Database characteristics • Collections of [name, value] pairs or maybe [name, type, value] triples • Collections typically set<any> or list<any> • System characteristics • “Typeloose” query languages • Indexes for nested, typeloose structures • Appropriate query processing techniques

Are Semistructured Databases the Answer? (2) • No, because schemas are critical for • Data readers • What info is in a given collection? • Thus, what queries might make sense? • Data writers • What should I call this piece of info? • Is it okay to put this kind of data here? • Efficient/effective query processors • Indexing, statistics, ... (e.g., range queries) • Integration mappings (e.g., unit conversions)

Are Semistructured Databases the Answer? (3) • It has some nice features, though • Flexible, dynamic schemas • Forgiving w.r.t. variations and exceptions • Schema evolution is not a big deal • Richer data modeling (vs. relational) • Nested structures, ordered collections • More powerful query languages • Blurring of schema and data querying • Ordering, nesting, restructuring handled

Is Object-Relational Database Technology the Answer? • Database characteristics • Base types, user-defined structured types, inheritance, reference types, collections • Collections are well-typed • System characteristics • Extended SQL-based query languages • Support for methods (fenced/unfenced) • Also triggers, LOBs, extensible indexes

Are Object-Relational Databases the Answer? (2) • No, because most O-R DBMSs have • Overly rigid schemas • Every instance is of one (known) type • Evolving a type can be a major burden • Distributed type management is hard • Crufty old storage managers • Ragged or sparse records poorly supported • Insufficient power in extended SQL • Prehistoric assumptions get in the way • Weak on restructuring, schema-querying

Is XML the Answer?(Yes!! ...What Was the Question Again?) • Structured documents (for the web) <book> <booktitle> Tables Are The Answer </booktitle> <author id = “cdate”> <name> <firstname> Chris </firstname> <lastname> Date </lastname> </name> <address> <city> Saratoga </city> <state> CA </state> </address> </author> </book>

Is XML the Answer? (2) • W3C’s XML Schema working group • Typed elements, attributes, documents • Simple types and complex types • Derived types (extension, restriction) • Facets, anonymous types, groups, … • Uniqueness, keys and key references • W3C’s XML Query working group • XML-QL, Xpath, XQL, XSL/T, XSQL, … • Recommendation due in late 2000 (?)

Is XML the Answer? (3) • XML Schema might help because • XML has achieved a huge mindshare for data interchange on the web • DTD standardization is happening for documents within vertical industries, and XML Schemas should take over • When finished, XML Schema should be a widely used schema description tool • Similar to O-R schemas, but with more flexibility (and web-based sex appeal)

Some Useful XML+DB Topics • Publish documents with XML Schemas from O-R databases • B2B e-commerce messages • B2C comparison shopping (if permitted!) • Robust O-R DB-resident web sites with XML for page content generation • Use XML Schema as the central data model for data integration middleware • I.e., web information integration

Useful XML+DB Topics (2) • Build a “native” XML Repository on top of an O-R DBMS • Map from XML Schema model to O-R DBMS modeling constructs • Map from XML queries to O-R queries (including tag variables and loose typing) • Thereby provide XML document storage management with industrial-strength robustness, scalability, and performance

Useful XML+DB Topics (3) • Evolve XML-QL into a complete web data manipulation language • Typing a la XML Schema • Ordered/unordered collections • XPath-inspired expressions • Easier grouping and aggregation • Updates (insert/delete, modify) • Etc.

The XPERANTO Project • Middleware for publishing O-R (or plain relational) DB content on the web • Provides a virtual XML document view • Based on a “pure XML” approach • Using XML-QL (as W3C placeholder) • Born at Almaden in summer of 1999 • Mike Carey, Dana Florescu, Zack Ives, Ying Lu, Jai Shanmugasundaram, Beau Shekita, Subbu Subramanian

The XPERANTO Belief System • Databases contain, and will continue to contain, the world’s “data jewels” • Transactional data (RDBMS) • Important multimedia assets (ORDBMS) • XML application developers of the future may not love SQL like we do • View databases as default XML documents • Let them define appropriate (query-able) views of these XML documents

XML Schema Generator Views XML Schema Catalog Info XML Tagger SQL Queries Data Tuples Query Translation Metadata Services View Services XML-QL Parser XQGM Type & Table Services Query Rewrite XQGM Table & Type Info SQL Translation O-R Database SQL Query Processor Stored Tables System Catalog XPERANTO Architecture

XPERANTO Components • XML-QL Parser • Neutral query representation (XQGM) • Query Rewrite • View composition and other rewrites • SQL Translation • Produce SQL query(s) to get the required data from the underlying DBMS • XML Tagger • Tag and structure the tabular results

XPERANTO Components • View Services • Repository for XML view definitions • Type & Table Services • Interface (and cache) for DB catalog info • XML Schema Generator • Give DB catalog info in XML Schema form for default views • Infer XML Schema info for queries and non-default view definitions

Consider a Simple O-R Schema Create TablebookAS (bookID CHAR(30), name VARCHAR(255), publisher VARCHAR(30)) Create TablepublisherAS (name VARCHAR(30), address VARCHAR(255)) Create Typeauthor_typeAS (bookID CHAR(30), first VARCHAR(30), last VARCHAR(30)) Create Table author OFauthor_type (REF IS ssn USER GENERATED)

Part of the Default XML View <simpleType name=”string255” source=”string”> <maxLength value=”255” /> </simpleType> <simpleType name=”string30” source=”string”> <maxLength value=”30” /> </simpleType> <complexType name=“bookTupleType”> <element name=“bookID” type=“string30” /> <element name=“name” type=“string255” /> <element name=“publisher” type=“string30” /> </complexType> <complexType name=“bookSetType”> <element name=“bookTuple” type=“bookTupleType” maxOccurs=“*” /> </complexType> <element name=“book” type=“bookSetType” /> . . .

XPERANTO’s Default Views • XPERANTO generates default O-R to XML Schema mappings • Each DB shown as an XML file • Subtyping handled via XML Schema’s refinement facilities • OIDs and references become ids/idrefs • “Don’t use this at home!” • Application developers are expected to define the real view(s) using XML-QL

Creating a Better XML View WHERE <library.book.bookTuple> <bookID> $bid </> <name> $name </> <publisher> $bpub </> </> IN “db2:xml:books/library”, $bpub = “Kluwer”CONSTRUCT <book id=$bid> <name> $bname </> {WHERE <library.publisher.publisherTuple> <name> $bpub </> <address> $addr </> </> IN “db2:xml:books/library” CONSTRUCT <publisher> <address> $addr </> </>} {WHERE <library.author.authorTuple> <bookID> $bid </> <first> $fname </> <last> $lname </> </> IN “db2:xml:books/library” CONSTRUCT <author first=$fname last=$lname/>} </> . . .

XPERANTO Query Rewrite • XML-QL queries first translated into XQGM representation • Neutral, well-poised for more features • Easier to go from XML-QL to SQL • Borrow rewrites from DB2 UDB engine • XQGM is an extension of DB2’s QGM • XML data type for “columns” • Set of XML-specific functions

SQL Generation and XML Document Tagging/Structuring • Sorted Outer Union queries are used to obtain the data • Fetch the data in one query that brings it back in the appropriate order • Tag and nest it to create XML document • Advantages of this approach • Shown to be stable as well as fast • Simple (linear-space) tagging possible • Just watch for nesting-related changes

Outer Union Query Example WITH OuterUnion (type, bookID, bookName, pubName, pubAddr, authFirst, authLast) AS ( SELECT ‘0’, b.bookID, b.name, NULL, NULL, NULL, NULL FROM book b WHERE b.publisher = “Kluwer”UNION ALL SELECT ‘1’, b.bookID, NULL, p.name, p.address, NULL, NULL FROM book b, publisher p WHERE b.publisher = “Kluwer” and b.publisher = p.nameUNION ALL SELECT ‘2’, b.bookID, NULL, NULL, NULL, a.first, a.last FROM book b, author a WHERE b.publisher = “Kluwer” and b.bookID = a.bookID)SELECT * FROM OuterUnion ORDER BY bookID

XPERANTO Project Summary • Goal is to publish O-R data in XML form • Default XML views • XML-QL for defining useful views • “Look Ma, no SQL!” • Currently (re)building our prototype • View composition is our first stop • Updates in addition to queries • Queries over both data and metadata • Other needs for XML web sites...?

A Few Closing Remarks • DB community must ensure that the web will support real queries…! • XML Schema and XML Query standards need ongoing input from DB researchers • Large-scale technologies needed for XML indexing, caching, querying, etc. • DB community should also work on important underlying technologies • Publishing XML both from and to RDBMSs and ORDBMSs, for example!

XML + Databases = ? (DIMACS Workshop, 3/2000)