1 / 46

XPEDIA: XML Processing for Data Integration Amit Shvarchenberg and Rafi Sayag

XPEDIA: XML Processing for Data Integration Amit Shvarchenberg and Rafi Sayag. Manish Bhide , Manoj K Agarwal IBM India Research Lab India { abmanish , manojkag }@ in.ibm.com Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA { baroram,srp }@ us.ibm.com

juan
Download Presentation

XPEDIA: XML Processing for Data Integration Amit Shvarchenberg and Rafi Sayag

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XPEDIA: XML Processing for Data IntegrationAmitShvarchenberg andRafiSayag Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, manojkag}@in.ibm.com Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA {baroram,srp}@us.ibm.com Srinivas K. Mittapalli, Girish Venkatachaliah IBM Software Group India {smittapa,girish}@in.ibm.com

  2. XPEDIA - Introduction • XPEDIA stands for “XML Processing for Data Integration” • XML documents became popular • XPEDIA is designed to improve data integration for XML documents • XPEDIA uses parallelization and ELT flow

  3. ETL In Databases • Extract, transform, andload (ETL): • Extracting data from outside sources • Transforming data to fit operational needs • Loading it into the end target (database or data warehouse)

  4. Typical ETL Scenario With XML

  5. Typical ETL Scenario With XML

  6. Zoom-In Flow-1

  7. The Read_XML_Table operator simply reads the XML Documents

  8. XML Hierarchy Tree

  9. The Equi-Hierarchical Join operator The operator goes over all the “Country” sub-tree in the xml The operator finds the set of employees working in each department in that country The operator creates new element named “Dept2” which contain a list of all employees working in that department

  10. The Aggregation operator The operator calc the total salary of all the employees in a department The operator adds the calc to the XML document as “totalSalary”.

  11. The Shredder Operator The operator writes the totalSalary in the modified XML document to the relational database.

  12. Problem • Today, databases support a limited representation of XML documents • Processing an XML document, requires full extraction and parsing of the document • XML documents grow larger with time • A need for complex transformations has arose

  13. Problem – Computational Model • Relational data is represented in the form of rows and columns • In this model, each XML document is represented as a single row and a single column. • There is a need for a technique that handles complex data flows while preserving the simple specification

  14. Problem – Scalability • In relational data, the size of a row/tuple is seldom larger than a few KB’s • XML documents, which are composed of many small objects, often gets to over 1GB

  15. The Solution – XPEDIA • Computational Model • ELT Support • Scalability – parallelism

  16. XPEDIA Computational Model

  17. XPEDIA Computational Model • XPEDIA uses a dataflow model consisting of operators and edges • The key difference in XPEDIA model: • The data that flows between operators is an ordered list of XML documents that comply with a single XML schema

  18. Example <Haifa> <High_Value_Customers> <City name=“Haifa”/> <Customer name=“Amit”/> . . . <Customer name=“Shay”/> </High_Value_Customers> List: <Customer_Vector, City_Vector> <Amit, …, Shay>

  19. XPEDIA Computational Model (cont.) • Operators can iterate over a sub-vector of a document object • The iterated vector is defined as “scope” vector of the operator

  20. XML Operators • Filter operator: • Filters one of the vectors within a scope • Project Operator: • Iterates over a single vector and generates a new output vector that is based on a set of select expressions

  21. XML Operators – Aggregate Operator • Produces statistics by aggregating one of the vectors. The aggregation restarts for each scope item

  22. XML Operators – Equi-Hierarchical-Join • Performs an equality based join between two vectors that are contained within the scope instance

  23. XML Operators – Read/Write Table • Read Table Operator • Reads all the rows of a single table and outputs a relational tuple or XML document • Write Table Operator • Used for writing a relational or XML data into a table

  24. XML Operators – Output Stage Operator Input: Department /Company/Country/Dept Project /Company/Country/Emplyee/PName Emp ID /Company/Country/Emplyee/Einfo/EmpID

  25. ELT Optimization In XPEDIA

  26. ELT • ELT (Extract, Load, Transform) • Take parts of the ETL job flow and converts it into SQL/XML queries • ELT is a technique to gain efficiency and performance by shifting a significant processing into the database

  27. ELT In XPEDIA • Databases such as DB2 9, Oracle 11g and SQL Server 2005 have inbuilt XQuery and SQL/XML query engines. • XPEDIA applies rewriting techniques to transform parts of the ETL job flow into SQL/XML

  28. How XPEDIA converts ETL to ELT • The following tasks are required for converting ETL to ELT: 1. Rewrite the ETL flow in terms of simpler operators. 2. Convert each operator into a SQL/XML query. 3. Merge the SQL/XML queries of adjacent operators into a single SQL/XML query. 4. Convert the merged SQL/XML queries to an ELT job definition which can be executed on XPEDIA.

  29. Simplify The ETL Flow • Most of the operators in XPEDIA can be directly converted to a SQL/XML query • Complex operators, like the OutputStage, are difficult to translate to SQL/XML queries directly • We need to rewrite complex operators with a simpler operators

  30. Example • The algorithm to convert the OutputStage operator to the set of simpler operators Step 1: Apply XMLize operator on the relational data to obtain flat XML document

  31. Example (cont.) Step 2:

  32. Example (cont.) Step 3: Use Project Operator to add and drop nodes, so as to bring the height of all output node at correct position. Step 4: Use Project Operator to change names of nodes

  33. Query Generation and Merging • The XPEDIA ELT optimizer has a set of algorithms for converting operators to SQL/XML query. • The XPEDIA ELT optimizer uses a set of rules for merging these SQL/XML queries..

  34. Generating The ELT Job Definition • The generated SQL/XML queries are mapped to the XPEDIA job definition • XPEDIA translates the job definition to a Read Table operator and the rest of the ETL flow remain the same

  35. The Result • We can now use a single SQL/XML query to replace the operators between the XML data source to RDBMS • ELT allows us to use only Read/Write table operators • Benefits: reduction of the size of the data that needs to be moved

  36. XPEDIA ELT Conclusion • XPEDIA is able to use the native XML processing capabilities of the database engine to greatly improve performance. If the database does not have native XML support or is present in a flat file, XPEDIA can not use the ELT optimizer

  37. XPEDIA Parallel Processing

  38. Parallel Processing of XML Data • XPEDIA supports 2 types of job parallelism: • Pipeline: each operator is handled by a different resource • Partitioning: the XML document is divided into several partitions, each processed separately

  39. Pipelining Limitations • Pipelining limits the scalability – can only use as much resources as the number of operators • In pipelining, each resource will need to work on the entire data • By using partitioning, we allow better usage of available resources on large documents

  40. Partitions Generation • XPEDIA identifies what nodes are optimal for partitioning • The chosen partition is than divided between resources in one of the following methods: • Round Robin • Chunking Scheme

  41. Shallow Parsing • Dividing the work requires some parsing • The parsing that is done is only partial, from root node to partition node Since shallow parsing overhead is different for every partition, sometimes load balancing is done when choosing chunks sizes

  42. What have we gained with XPEDIA

  43. What Have We Gained With XPEDIA • performance gain of up to 70% by using XPEDIA ETL tools so that more processing is done inside the database engine.

  44. What Have We Gained With XPEDIA Using XPEDIA to partitioning the ETL job on multiple nodes is scalable and can improve the processing speed of the ETL job by up to 2.9 times for a 4 processor configuration

  45. Summary • We saw how the XPEDIA deals with this new problems that arose • Parallel processing techniques is used for handling large XML document • XPEDIA ELT system is able to take advantage of the native XML processing capabilities of the database engine and greatly improve performance.

  46. Questions ?

More Related