Scalable Approach for Processing Large XML Data (100 characters)

A scalable approach to processing large XML data volumes Dr. Tim Weitzel Institute of IS Frankfurt University tim@xml-network.de Dr. Thomas Tesch Infonyte GmbH Darmstadt tesch@infonyte.de Dr. Peter Fankhauser Fraunhofer IPSI Darmstadt fankhauser@ipsi.fhg.de

„one half of the world uses XML...the other half has to“ • increasing XML penetration and data volumes • document management, content management • data and process integration • deregulated electricity markets • straight through processing in stock trading („garage clearing“) • challenge: develop scalable XML tools • IETD (3,5 GB XML-manuals) • trading platform integration • 40,000 transaction every hour • 1MB SWIFT = 10MB swiftML = 100MB RAM consumption •  main memory as bottleneck

XML and main memory • scalability challenging even on huge systems, often not a relative problem • try editing the 3,5 GB XML-manual of a Boeing airplane with XML Spy • reason: DOM implemantations represent entire DOM tree in main memory • depending on XML document and DOM implementation, textual XML up to 20 times as big in a main memory DOM • analogous for XSLT: 20 MB XML document requires 200-400 MB • EDI example: SWIFT  swiftML • scalability problem: main menory restrictions, mobile devices, embedded systems • many architectures don‘t require permant XML storage but rather import data into an „XML warehouse“ (complementary to relational systems) for subsequent processing (XSLT, Xpath, XML Schema validation  aggregation, synchronization, retrieval  filter, format, transform)

XML processing

IDB – Infonyte Data Base • IDB uses Persistent DOM (PDOM) • result of >10 PY of OO/XML database research at Germany‘s main think tank • compact, binary, indexed XML format for representing DOM (directly processing well-formed XML) • basic elements of IDB: • PDOM • persistent XSLT processor (PXSLT) • query engines for XPath, XQL • document collection support • XML workbench

PDOM • PDOM for storing and accessing XML documents according to W3C DOM API • binary representation of XML instances, accessed using DOM Level 2 Interface • also: structural indices for reconstructing document sequence and increasing query performance; PDOM engine for optimizing allocation of XML documents between main and secondary memory • PDOM can store up to 2^30 XML nodes or 1 Terabyte XML

Architecture • modular (e.g. use parts of IDB as highly scalable XML backend for J2EE conforming IBM WebSphere Application Server) • PDOM • IDB components 400-800 KB code size, require 16 MB RAM • access system via command line, web server oder Java interfaces • can use schema-less XML • all index and storage structures derived from XML instance  no need to define mappings on physical data models (as in realtional systems and some XML databases)

IDB component architecture

Performance • test using XML-ified version of freely available freeDB CD database (FreeDB 2002) • FreeDB consists of about 500,000 CD descriptions • XML version about 500 MB • On a standard PC (1,8 Ghz, 512 MB RAM) • parsing and PDOM creation (32 million XML nodes, 400 MB) including all structural indices takes about 4 minutes (~2MB/s) • generating user-defined index for all CD keys (indexes 548,000 nodes or 1.7% of the entire database) in about 88 seconds • generating full-text index (28 million nodes, 89% of the entire data-base) in 17 minutes, resulting in an index size of 90 MB • XSLT processing (generate HTML) throughput up to 10 MB per second • searching for CDs with particular titles or tracks using the full-text index, first results are delivered within 5-10 milliseconds, analogous for subsequent hits.

Search results for “bowie” on “bbc”

Scalability

Applications I • XML Warehouse • business process integration • congregating data from different information systems into one common XML representation • all data then reformatted, e.g. for publishing on a web server, using XSLT or XQL/XPath commands. • huge US-based financial information and service provider • based on IDB, an application was developed for individualized messaging and feeding a web portal that allows customers to get their individual transaction data in real time • Infonyte system gets 10 GB XML raw data every day, indexes it and makes it available for ten days • significant savings by straightforwardly processing these large amounts of data going along with access time in millisecond range

Applications II • Interactive Electronic Technical Documentation (IETD) • aviation industry with long SGML history, now many systems as browser based XML applications • main challenge: designing distributed authoring environment with centralized data repository and efficient production process for compiling and formatting electronic manuals for different user groups, • Sikorsky Aircraft Corporation • XML-IETD system based on Infonyte • IDB used for production process as well as for providing the documents via a web server • production: Infonyte XSLT processor is key element for demand driven compilation of large XML data volumes • subsequent usage of the technical manuals in a reading environment, Infonyte is used as client-side tools to enable XML query languages to retrieve relevant document fragments. • architectures helped Sikorsky realize substantial cost and service improvements

Applications III • Mobile Information Management • challenge • low memory consumption, platform independence qua Java and the compact PDOM format make Infonyte the ideal XML based mobile application kernel. • Mobil Sales Force Automation • US-based Vaultus (http://www.vaultus.com) used Infonyte technology as foundation of their mobile information platform. In addition to data management, the system offers offline capabilities, secure transactions, network independence, and remote maintenance services

Performance • Performance of IDB on mobile devices • developed mobile demo scenario using the full freeDB • a limited version consisting only of the data server, the PDOM, and the index and collection APIs (all in all about 300 KB), the full FreeDB demo runs on a PocketPC (iPAQ Pocket PC H3800 with 64 MB Ram, 32 MB Rom, 206 MHz ARM-Processor, 1GB IBM-Microdrive, Personal Java 1.2 Insignia Jeode) • using the indices, response time for Boolean search on this limited platform is 1-2 seconds, searching for singular criteria is even faster.

XML Application Command Line Servlet Java API Collection API W3C DOM API XQL XPath XSLT XQuery Algebraic Query Optimizer Persistent DOM (PDOM) Index Manager Dataserver I/O Manager PDOM File RDBMS Paged I/O MainMemory Performance: an EDI example ReuseSearchAssemblyValidate EDI XMLMessage ImportCheckinCheckoutReplace FormattingFilteringTransformationAggregation SWIFT EDISWIFTFIX FIX Web PDOM SWIFTML PDOM CD-ROM FpML PDF+Print Source Production Destination

SWIFT2XML • processing SWIFT messages with XML • SWIFT to XML • developed parser • fully XML-ified (i.e. no information loss) • generic XML • multi-step optimization of process chain, trading-off bandwidth and document construction time (multiple calculations like PDOM creation and full-text index) • XML processing • processing of well-formed XML • storage as PDOM • access using full-text indices and data indices • visualizatin using XSLT, integration with web server

tim@xml-network.de download IDB, FreeDB etc.: www.infonyte.com papers etc. http://tim.weitzel.com

Scalable Approach for Processing Large XML Data (100 characters)

Scalable Approach for Processing Large XML Data (100 characters)

Presentation Transcript

Automated processing of large data volumes for development of the Hugoton-Panoma geomodel

OpenAFS and Large Volumes

Large-scale Data Processing Challenges

A Scalable Machine Learning Approach to Go

Stre a ming Processing of Large XML Data

Large scale data processing

GraphSig : A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases

Query Processing of XML Data

A probabilistic XML approach to data integration

Query Processing of XML Data

Lightcuts: A Scalable Approach to Illumination

Processing XML

A Document-based Approach to Indexing XML Data

Processing Large Volumes 814_20s Issues / Discussion / Ideas

XMLTK: An XML Toolkit for Scalable XML Stream Processing

Clustera: A data-centric approach to scalable cluster management

Stre a ming Processing of Large XML Data

A Scalable Machine Learning Approach to Go

Lightcuts: A Scalable Approach to Illumination