A Cost-based Approach For Converting Relational Schemas To XML

A Cost-based Approach For Converting Relational Schemas To XML Ramon Lawrence University of Iowa ramon-lawrence@uiowa.edu http://www.cs.uiowa.edu/~rlawrenc/

Overview • The goal of this research is to convert existing relational schemas to XML schemas with minimal user input. • The contributions of this work are: • A fully automated relational to XML schema conversion algorithm that minimizes a user-specified cost function (in this case space efficiency) that does not require the user annotate the relational schema or map to an intermediate model. • The algorithm incorporates user preferences during the mapping if they are present. Thus, the mapping can be completed with no user involvement, or as much involvement as the user desires. • Empirical results demonstrating how XML nesting improves on flat translation by creating more space efficient schemas without introducing redundancy.

Motivation • Many applications are being migrated from using relational databases to using XML. This conversion can be very costly if the application designers must also re-engineer the schema. • The ability to automatically convert a relational schema to XML schema is valuable. Further, it is useful if the translation is not a “flat translation” but rather uses nesting to improve the efficiency and readability of the resulting XML schema. • Previous conversion algorithms required the user to specify a complete mapping or did not allow user-specified constraints.

Previous Work • The previous work can be put in 4 categories: • 1) Flat translation - converts to XML without using nesting. • 2) Query-based translation - conversion occurs by using a query extraction language that is an extension of SQL or a new XML query language. e.g. SilkRoute [Fernandez02] and XML publishing [Carey00]. • 3) Model-based translation - converts the relational model to an intermediate model and then maps to XML using translation rules. e.g. [Bird00,Du01,Embley01] • 4) Dependency-based translation - uses dependency information in schema to translate to XML. e.g. NeT and CoT [Lee02]. • No system is fully automated and yet at the same time respects user constraints on the final schema result. No previous work uses a cost function to return the optimal nesting.

Nesting and Efficiency • Exploiting nesting in XML has three major advantages: • 1) Improves readability by grouping related concepts. • 2) Improves space efficiency by avoiding repetition and encoding of foreign keys that can be implicitly defined by the hierarchical structure. • 3) Improves query efficiency by clustering concepts together and avoiding joins.

Space Efficiency Metric • Space efficient XML documents can be achieved by nesting as it eliminates foreign keys. For example, in TPC-H, order information can be nested under customer information, and the foreign key orders.o_custkey is not represented explicitly: • <!ELEMENT customer (c_custkey, c_name, ..., orders*)> • <!ELEMENT orders (o_orderkey, o_orderdate, ...)> • Amount of space savings: • Number of data values not stored is equal to the size of the relation (in this case orders) *(# of foreign key attributes). • Let |R| denote the number of tuples of R. The space savings of nesting relational schema R under S on key K is denoted by savings(R,S,K) and is approximated by sizeOf(K)*|R|, where sizeOf(K) is the maximum schema size of the foreign key K of R.

Nest Graph • A nest graph represents the nesting possibilities and their desirability by using existing relational schema information. • A Nest-GraphG = (V,E) is a directed, weighted graph consisting of a set of nodes V and a directed edge set E, such that for each relation R in the database schema, there exists a node VRV, and for each non-nullable foreign key FKR(S) there is an edge e=(VS,VR,w) E, where w=savings(R,S,FKR(S)). • Nest-graph for TPC-H:

XML Mapping Algorithm • The mapping from the relational model to an XML schema has these general steps: • Extract relational schema information including foreign keys and relation sizes from the database. • Use the information to build a Nest-Graph G. • Use the classical algorithm by Edmonds to calculate the maximum weight arborescence T of G. • The maximum weight arborescence is a maximal spanning tree (MST) T. Use T to generate the nesting in the XML schema. • Assume that given an edge e=(S,R,w) of T, the two relations are S(A1,A2,...,Am) and R(B1,B2,...,Bn) respectively. Let A1=PKS, B1=PKR, and Bk=FKR(S). • The primary keys and foreign keys are encoded as XML attributes, and all other relational attributes are encoded as XML elements. • Note: Introduce a new node r' to G, and a set of edges E' where each e=(r',Vi,0) E' for all i=1..|V|. This guarantees that a MST will exist.

XML Mapping Algorithm (2) • Using the maximal spanning tree T: • For each edge e=(S,R,w) of T, we nest R under S (and omit Bk): • <!ELEMENT S (A_2, ..., A_m, R*)> • <!ATTLIST S A_1 ID> • <!ELEMENT R (B_2, B_3,..., B_k-1, B_k+1,...B_n)> • <!ATTLIST R B_1 ID> • For each edge e=(S,R,w) where eG and eT, this relationship will be captured using ID/IDREF. Under the element R is a IDREF attribute like: • <!ELEMENT R (B_2, B_3,..., B_k-1, B_k+1,..., B_n> • <!ATTLIST R B_k IDREF>

Maximal Spanning Tree Result • Maximal spanning tree: • Space savings achieved: • For TPC-H at scale factor 1, the space savings is 57,849,820 characters or 14,462,455 data values. Equivalently, the savings is 43.5% of all foreign keys or 12.4% of all data values.

XML Result DTD • <!ELEMENT root (part*, region*)> • <!ELEMENT part (p_name, … , p_retailprice, partsupp*)> • <!ATTLIST part p_partkey ID> • <!ELEMENT region (r_name, r_comment, nation*)> • <!ATTLIST region r_regionkey ID> • <!ELEMENT partsupp (ps_availqty, … , lineitem*)> • <!ATTLIST partsupp ps_suppkey IDREF> • <!ELEMENT nation (n_name, … , customer*, supplier*)> • <!ATTLIST nation n_nationkey ID> • <!ELEMENT lineitem (l_linenumber, …, l_comment)> • <!ATTLIST lineitem l_orderkey IDREF> • <!ELEMENT customer(c_name, … , c_comment, orders*)> • <!ATTLIST customer c_custkey ID> • <!ELEMENT supplier (s_name, …, s_acctbal)> • <!ATTLIST supplier s_suppkey ID> • <!ELEMENT orders (o_orderdate, … , o_comment)> • <!ATTLIST orders o_orderkey ID>

User-Directed Mapping • The algorithm determines the optimal space efficient nesting of relational schemas into an XML schema given no constraints. • However, the user can control the XML schema output by specifying constraints on the XML schema that should be satisfied without having to specify a total mapping. • Four types of constraints can be incorporated into the mapping algorithm by appropriate modifications to the Nest-Graph: • 1) Non-nested nodes - by removing node’s incoming edges. • 2) Edges (nestings) that must be present - edge weight =  • 3) Edges (nestings) that must not be present - remove edge • 4)Handling nullable foreign keys - user decides on encoding; may use null-record. • User may also specify any cost function, just not the space efficiency one discussed here.

Experimental Results • Results summary: • 40-50% of foreign keys do not have to be encoded. Elimination of all foreign keys is only possible if the Nest-Graph is a tree. • 10 to 30% of all attributes are not encoded.

Future Work and Conclusions • Developed an efficient and user-friendly tool for automating construction of XML schemas from relational schemas. • System is an improvement over current approaches which either do not have formal methods for capturing user constraints or require the user to exactly specify the entire XML schema. • The space efficiency cost-metric proposed is useful, although other metrics are possible including those that factor in query costs as well. • Future work involves expanding the algorithm to consider cost functions and nestings that introduce redundancy to increase query performance at the sacrifice of space efficiency. Also, a investigation on nesting query results will be performed.

A Cost-based Approach For Converting Relational Schemas To XML Ramon Lawrence University of Iowa ramon-lawrence@uiowa.edu http://www.cs.uiowa.edu/~rlawrenc/ Thank You!

Extra Slides...

TPC-H Schema Table Attributes Part partkey, name, mfgr, brand, type, size, container, retailprice, comment Supplier supkey, name, address, nationkey, phone, acctbal, comment PartSupp partkey, suppkey, availqty, supplycost, comment Customer custkey, name, address, nationkey, phone, acctbal, mktsegment, comment Order orderkey, custkey, orderstatus, totalprice, orderdate, orderpriority, clerk, shippriority, comment LineItem orderkey, partkey, suppkey, linenumber, quantity, extendedprice, discount, returnflag, tax, linestatus, shipdate, commitdate, receiptdate, shipinstruct, shipmode, comment Nation nationkey, name, regionkey, comment Region regionkey, name, comment

Foreign Keys in TPC-H Table with Foreign Key Table Referenced Functional Dependencies LineItem Part l_partkey p_partkey LineItem Supplier l_suppkeys_suppkey LineItem PartSupp l_partkey, l_suppkeyps_partkey, ps_suppkey LineItem Order l_orderkeyo_orderkey PartSupp Part ps_partkey p_partkey PartSupp Supplier ps_suppkeys_suppkey Supplier Nation s_nationkey n_nationkey Customer Nation c_nationkeyn_nationkey Order Customer o_custkeyc_custkey Nation Region n_regionkeyr_regionkey Primary Keys Foreign Keys

Query Result Nesting • A related problem is the problem of converting the result of a query over a relational schema into XML. Space savings can be achieved by taking the de-normalized query result and applying some normalization by exploiting nesting. A query example: • SELECT n_name, c_name, o_orderdate, o_orderkey, • l_partkey, l_quantity, s_name • FROM customer C, orders O,lineitem L,supplier S, nation N • WHERE C.c_custkey = O.o_custkey AND L.l_orderkey = • O.o_orderkey AND L.l_suppkey = S.s_suppkey • AND C.c_nationkey = N.n_nationkey • This query returns over 6 million records. Using flat translation wastes space. For instance, since the cardinality of nation is 25, the XML encoding would redundantly encode approximately 6 million nation names. • The algorithm can be used with queries involving selection, projection, and joins between relations on primary/foreign keys.

Query Nesting Results • Results summary: • For the example query, nesting the query result reduces the number of data values in the XML document by 50%, which corresponds to 20,904,839 fewer values. • For other benchmark TPC-H queries, 20-30% of data values were not encoded. Even with compression, this is significant as it results in smaller files to compress.

A Cost-based Approach For Converting Relational Schemas To XML