An Empirical Study of XML Data Management in Business Information Systems

An Empirical Study of XML Data Management in Business Information Systems Speaker: 呂瑞麟國立中興大學資訊管理學系教授 Email: jllu@nchu.edu.tw URL: http://web.nchu.edu.tw/~jlu

動機 • 念頭起源於大約公元2000年 • XML剛剛興起 • 預期會有大量的XML資料/文件被交換或者儲存 • 應該會使用資料庫來處理吧？ • 因為速度比較快、同時多人存取（concurrency、locking）、資料一致性（consistency）等等優點

動機 • 要使用什麼樣的資料庫？ • 有沒有專門為XML設計的資料庫？（稱之為Native XML DB；XML原生型資料庫；NXD） • 查（google、線上論文、最好是conference的papers） • 常見的 RDBMS • 結構不同，會有什麼樣的問題？ • 目前有沒有任何學者對此發表過任何論文可以參考？

動機 • 可能可以進行的方向 • 文件還是資料？ • 從企業的角度 • 使用資料庫來處理XML的資料的方式有哪些？其各自的優缺點為何？原因為何？ • 真的能用嗎？ • Eric Jui-Lin Lu, Bo-Chan Wu, and Po-Yun Chuang, "An Empirical Study of XML Data Management in Business Information Systems", The Journal of Systems and Software, vol. 79, no. 7, July 2006, pp. 984-1000. [SCI,EI] • 從資訊科技的角度 • 現有方式的優缺點？是否能改進？ • 想法：使用Hierarchical、好的儲存方式、應用在行動設備、好的indexing方式、好的查詢方式、好的mapping方式、好的concurrency方式、好的存取控制結構等等等 • 同時涵蓋RDBMS以及NXD

目標 • a complete and systematic survey of the current development and challenges of processing XML data in relational and native XML databases • Insert a new (undefined) value • XML Query  SQL Query • a useful benchmark for IT practitioners who need to process XML data effectively • experimental results and detailed analysis which reveal several interesting tips that can be helpful to XML document designers • a conclusion, based on the findings of using native XML databases in EDI processes, that it is practical to use native XML databases for daily operations

Literature Review • XML Files • Relational Databases • XML  RDB • RDB  XML • 支援以上功能的資料庫稱之為 XML-enabled Databases • Native XML Databases

XML  RDB • 儲存時的挑戰－－結構不同

Storage Mapping • 目前在對應 (mapping) 的規則上已有許多研究的成果 • Tian et al. [2001] 的索引對應方法 • Bourret [2001] 的方法 • Table-based model • Object relational model • 利用DTD的圖型 • Element Graph (Shanmugasundaram et al. [1999] 所提) • Edge (Florescu and Kossmann [1999] 所提)

Tian et al. (2001) • 處理方式： • 把XML的文件或其內容的個別元素，視為單一的一個物件 • 將這個物件儲存至資料庫的一個欄位內 • 接著利用元素的長度作索引 • 優點 • 簡單 • 缺點 • 非常沒效率

Bourret (2001) – table-based • 把文件視為單一或是多個表格 • Used by MS SQL and Oracle • Cannot handle complex XML documents

Bourret (2001) – object-based • Create a table for each complex element • Based on cardinality, assign PK and FK

Element Graph • Depth first traversal on DTD • How to deal with attributes? • Create tables for root and nodes under “*”

Florescu & Kossmann [1999]

RDB  XML • Typical steps: • query data using XML query languages • translates XML queries into SQL statements and submits them to the underlying RDB. • the retrieved data will be mapped against some XML schema definitions called XML views. • There are two major problems with this approach (Suciu, 2001). • automatic conversions from XML queries into SQL statements generally create redundant joins. To eliminate redundant joins, a general purpose query minimization algorithm is required. However, query minimization is NP complete for conjunctive queries. • when storing XML documents into RDBs, some conversion is required, and, consequently, structure information may be lost during the conversion.

RDB之挑戰 • Multiple-table joins can be costly • Rich structure may be lost • Ordering • Elements vs. Attributes • PIs and Comments • Forward and backward mapping is costly • XML Query  SQL Query

原生型資料庫 • XML:DB consortium 的定義： • 一個原生型XML資料庫是被設計用來儲存和操作XML的資料，其資料的存取是經由XML和其相關的標準技術，如：XPath、XSLT、DOM和SAX。原生型XML資料庫必須能保持XML的結構和相關的metadata。其儲存的格式並不重要 (例如:物件導向技術，關聯式資料庫技術..等等)，但若要存取資料庫的資料時，就必須使用XML的相關技術。而原生型XML資料庫的基本儲存單元是一份XML的文件。

使用 NXD 可行嗎？ • 一般都會採用或者設計 benchmark • 目前已經存在的benchmarks有： • Micro Benchmark: Michigan benchmark (Runapongsa et al., 2002) • Application Benchmark: XBench (Yao et al., 2002a,b), XMach-1 (Bohme and Rahm, 2001), XMark (Schmidt et al., 2002), and XOO7 (Bressan et al., 2001, 2002; Nambiar et al., 2002a,b) • Update is not considered (except for XMach-1 which consider three simple update operations) • Number of files and the size of files are not considered • No comparisons between RDBs and NXDs

設計benchmark所考量的因素 • 各式各樣的查詢語言（例如：XPath，XQuery，XQL及XML-QL 等），使得系統的效能更難以評估 (Schmidt et al. [2001])。 • 只考慮功能，而非語法 • The design of XML queries is based on five different dimensions • query target • query path • conditional path • operators • functions

設計benchmark所考量的因素 • query target • Ex. Simple elements, complex elements, attributes, or whole documents • query path • conditional path • Ex. for /order[com/com_id = 1], /order is the query path and com/com_id = 1 is the conditional path. • to measure how the depth of both query path and conditional path affect the performance of databases. Ex. Depth(/order)=1 and depth(com/com_id = 1)=2. • Operators • Ex. AND, OR, NOT, etc. • Functions • Ex. Count(), max(), min(), etc.

設計benchmark所考量的因素 • Missing elements (or null value) • RDB contains a null value • NXD contains nothing • Insertions • In addition to previous factors, also consider the number of files and the size of files • Ex. One file for each document • Insert or append • Insert a complex element may involve several tables

設計benchmark所考量的因素 • Deletions • The number of elements to be deleted • The distribution of to-be-deleted elements • Remove empty files • Modifications • The content of elements • The element itself?

我們的評估環境 • 硬體 • 資料 • 台灣花卉市場的資料（之前的研究） • 6000 purchase orders (about 7.5 MB) were randomly generated and saved in either 1, 6, or 6000 files • 軟體 • NXDs • eXcelon 3.1 ,dbXML 1.0b2, Apache Xindicce version 1.0, Software AGs Tamino XML Server version 3.1.1.4 • RDBs • Microsoft SQL Server 2000, Software AGs Tamino XML Server version 3.1.1.4

我們的評估環境 • SQL Tables • Edge 產生1個表格 • Element Graph 產生2個表格 • Relational Object 產生4個表格

我們的評估環境 • Because of the structure differences, the update operations performed on RDBs and NXDs are quite different. • Deleting an element in NXDs means removing a node and all of its sub-nodes completely. However, in RDBs, deleting an element only means putting null values to the fields corresponding to the node and all of its subnodes. • In NXDs, to create a new element, it is required to add both element names and their values. • Index on NXDs: “index on element name” or “index on element value” • indexing on an element with unique values and on non-unique values.

Experimental Results: Queries

Experimental Results: Query Depth

Experimental Results: Elements vs. Attributes

Experimental Results: Insert

Summary • It is found that RDBs in general perform better than NXDs in processing XML data. • If the structure of XML documents is complex, the XML documents need to be flatten before further processing. • When complex elements are inserted and whole documents are inserted or deleted, NXDs run faster than RDBs.

Summary • It is shown that the time required to process attributes, including query, delete, insert, and update, is shorter than the time required to process elements. Also, the shorter the depths of query path and conditional path, the faster the databases to process XML data. • Without losing readability, it is recommended to encode data in attributes rather than elements and design XML documents as ‘‘flat’’ as possible. One example approach is to use attributes to replace elements. For example, <com><City>Taipei</City></com> can be replaced by <com City = ‘‘Taipei’’></com>. Designing ‘‘flat’’ XML documents will improve not only the performance of NXDs but also RDBs. This is because less tables are needed to represent ‘‘flat’’ XML documents in RDBs.

Summary • For NXDs, appending data to the end of files is faster than inserting data in the beginning of files in most cases. • The number of documents saved in a file and the number of files saved in NXDs do have impact on the performance. Therefore, it is a good practice for developers to evaluate different alternatives when employing NXDs. • It is shown that the element graph method outperforms the other methods

Summary • In general, MS-SQL outperforms Oracle • the query result generated by SQLXML is actually saved in a table of one field with each record contains no more than 2033 characters, while the query result generated by XDK is a string formatted in XML style • Oracle runs faster than MS-SQL if programs retrieve data directly from tables and insert tags without using either SQLXML or XDK. • In other words, the reason why MS-SQL outperforms Oracle is because SQLXML is more effective than XDK.

Issues worth further investigation • It is clear that the number of documents saved in a file has significant impact on the performance of a native XML database. However, the details of how many documents are saved in a file should be hidden from developers. • The database management system itself should be able to optimize the storage dynamically. • As shown in the experimental results on both eXcelon and Tamino, when all documents are saved in one big file, adding indexes does not improve the query performance much. • the development of effective indexing schemes become imperative. • Almost all RDBs provide some sort of utilities to automatically convert XML documents into tables. However, at the current stage, only XML documents with simple structure can be processed.

個案分析研究 • 訪問對象是電子產業中 • 生產顯示器的中光電(CTX)和誠洲(ADI) • 生產主機板的華碩(ASUS)及環電(USI)

目前業界EDI處理流程

EDI 流程 • All orders placed by customers are temporarily stored in a temporary zone. These orders are in plain text format. • For every pre-determined period of time, the orders are automatically transformed into a database. Each order is saved into three tables which are orders, order-items, and other-info. Two additional fields are added into the table order. One is status which indicates the current status of an order. • other-info contains miscellaneous information about the order such as when this order is received. The main purpose of other-info is for dispute resolution if occurred. • After the orders in the database are processed by backend systems, the processed orders are written into a table called sales-orders. The content of the table salesorders are mainly retrieved from the tables orders and order-items. Additionally, whether or not the orders are approved (either pass or fail) and the actual sales quantities are written into sales-orders. • Later, the sales orders will be delivered to the customers who placed the orders.

EDI 流程 • It is assumed that orders and sales orders are all encoded in XML format. Also, it is assumed that when company A places orders to company B, there are already 5900 orders in company B’s database. • After certain period of time, company B appends the orders in the temporary zone to the database. Supposed that 100 orders are added each time. • After appending the orders to the database, company B appends two simple elements status and date to each order. • Back-end systems then query these 100 orders. After these orders are processed, back-end systems write either pass or fail and the actual sales quantity for each sales order to the database. Moreover, the information for sales orders are retrieved from the tables orders and order-items and written into the table salesorders. • It is also assumed that company A cancels two orders. In other words, the values of status are modified from a to c.

EDI 流程 • the following operations are summarized: • 100 orders are added twice. One is in step 1, and the other is in step 3. • For each order, 4 simple elements are added, and they are status, date, and the fields for storing pass or fail as well as the actual sales quantity. • 100 orders are queried twice. All happen in step 3. • Two simple elements are modified in step 4.

情況模擬 • 如依上述情況則 • SQL (Element Graph) 6957*2+10*4+157*2+10*2=14288(毫秒) • eXcelon(1 file) 9280*2+594*4+808*2+2577*2=27706 (毫秒) • eXcelon(6 file) 6976*2+618*4+1018*2+694*2=19848 (毫秒) • eXcelon(6000 file) 9952*2+945*4+1392*2+891*2=28250 (毫秒)

情況模擬 (cont.) • 假設另一情況，把新增筆數改為10筆，而其餘流程不變則 • SQL (DTD Graph) 5985*+10*4+157*2+10*2=12344(毫秒) • eXcelon(1 file)848*2+594*4+808*2+2577*2=10842(毫秒) • eXcelon(6 file) 731*2+618*4+1018*2+694*2=7358(毫秒) • eXcelon(6000 file)1444*+945*4+1392*2+891*2=11234(毫秒)

未來工作 • The missing pieces (discussed in motivation) • Mobile databases • Document-centric applications

An Empirical Study of XML Data Management in Business Information Systems