1 / 39

7 Translating Data to XML

7 Translating Data to XML. How to translate existing data formats to XML? (and why?) XW (XML Wrapper) "XML wrapper description language“ ( kääreenkuvauskieli ) developed in XRAKE project, Univ. of Kuopio, 2001–02

len-morales
Download Presentation

7 Translating Data to XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 7 Translating Data to XML • How to translate existing data formats to XML? • (and why?) • XW (XML Wrapper) • "XML wrapper description language“ (kääreenkuvauskieli) • developed in XRAKE project, Univ. of Kuopio, 2001–02 • Ek, Hakkarainen, Kilpeläinen, Kuikka, Penttinen: Describing XML Wrappers for Information Integration. In Proc. of XML Finland 2001, Tampere, Finland, Nov. 2001, 38–51. • Ek, Hakkarainen, Kilpeläinen, Penttinen: Declarative XML Wrapping of Data. CS Report A/2002/2, Univ. of Kuopio. Notes 7: XML Conversion

  2. XRAKE Project • "XML-rajapintojen kehittäminen" (Developing XML-based interfaces) • Studied definition and implementation of XML-based interfaces, and their application in • integration of heterogeneous data sources • management of mass printing • assembly and manipulation of electronic patient records Notes 7: XML Conversion

  3. XRAKE - Support • National Technology Agency of Finland (TEKES) and seven local IT companies/organizations • DEIO IS • Enfo Group • JSOP Interactive • Kuopio University Hospital • Medigroup • SysOpen • TietoEnator Notes 7: XML Conversion

  4. XW: Motivation • XML-based protocols developed for e-business, medical messages, … • Legacy data formats need to be converted to XML • How? Notes 7: XML Conversion

  5. wrapper1 wrapper2 wrapper3 source1 source2 XML-wrapping • Need ”XML-wrappers” (aka extractors) • (kääre); interface/conversion program to produce an XML representation of source data XML-form-1 XML-form-2 source3 Notes 7: XML Conversion

  6. How to wrap? 1. With an interface integrated to source • E.g. XML-interfaces of database systems • OK, if available 2. With an ad-hoc written translator • E.g. JDBC+Java, or separator-encoded text form + Perl • Conversion possibly efficient • Development and maintenance tedious :-( Notes 7: XML Conversion

  7. How to wrap? (2) 3. Generic source-independent wrapping • requires a file/message/report produced by the system • normally available • development and maintenance of wrappers should become easier => Wrapper description language XW Notes 7: XML Conversion

  8. XW (XML Wrapper) • XML-based, declarative wrapper description language • To convert • from a textual or binary source • currently (XW 1.59) only text sources supported • to an XML form Notes 7: XML Conversion

  9. XW: Design principles • A concise and natural XML syntax • description of simple and typical conversion tasks should be simple • Solving the key problem: Initial conversion of a legacy data format to XML • more general post-processing with XSLT/SAX/ DOM • necessary for being able to apply XML techniques Notes 7: XML Conversion

  10. XW: Influences xmlns:xw=”http://www.cs.uku.fi/XW/2001” • XML Namespaces • for separating XW commands and result elements • XML Schema • description of alternative and repetitive structures (CHOICE, minoccurs, maxoccurs) • data types of binary source data (string, byte, int, …) • XSLT • template-based description of result documents • variables for storing result fragments Notes 7: XML Conversion

  11. How does XW look like? <xw:wrapper xw:sourcetype="text" xmlns:xw="http://www.cs.uku.fi/XW/2001" xw:inputencoding="Cp850" … > <invoice note="XW-generated" xw:starter="\^INVOICE"> <identifierdata ...>Inserted result text ... ... </identifierdata> <specification xw:starter="\^PHONE SPECIFICATION" ...> ... </specification> <data xw:starter="\^---"xw:maxoccurs="unbounded"...> ... </data> </invoice> </xw:wrapper> Notes 7: XML Conversion

  12. AA x1 x2 BB y1y2 z1 z2 XW-architecture (1) source data wrapper description result document post-processing <xw:wrapper … > … </xw:wrapper> <part-a> <e1>x1</e1> <e2>x2</e2> </part-a> <part-b> <line-1> <d1>y1</d1> <d2>y2</d2> </line-1> <d3>z2</d3> </part-b> XSLT SAX XW-engine DOM Notes 7: XML Conversion

  13. AA x1 x2 BB y1y2 z1 z2 XW-architecture (2) source data wrapper description - to use as a program component <xw:wrapper … > … </xw:wrapper> SAX events startElement(part-a, …) startElement(e1, …) characters(”x1”) … XW-engine Notes 7: XML Conversion

  14. AA x1 x2 BB y1y2 z1 z2 <part-a> <e1>x1</e1> <e2>x2</e2> </part-a> <part-b> <line-1> <d1>y1</d1> <d2>y2</d2> </line-1> <d3>z2</d3> </part-b> result document XW-architecture (3) XW-engine SAX application source data <xw:wrapper … > … </xw:wrapper> wrapper description Notes 7: XML Conversion

  15. <whole> <a> </a> • Result document = XML for the parse tree of the source <b> <b2> <b1> <b3> </b> </whole> XW: Basic Ideas • Wrapper description ~ a grammar for source • Wrapping ~ parsing the source data • split data into parts according to the description Notes 7: XML Conversion

  16. XW Syntax <xw:wrapper xw:sourcetype=”text” xmlns:xw=”http://www.cs.uku.fi/XW/2001”> <invoice … > <identifierdata ...> ... </identifierdata> <specification ...> ... </specification> </invoice></xw:wrapper> Splitting of source content into parts(-> elements) Notes 7: XML Conversion

  17. for sub-parts <identifierdata xw:childterminator="\n" … > Recognition of content parts (1) • by separators (erotin); For example: <invoice xw:starter="\^INVOICE"… • by position (within surrounding part): <invoicenumber xw:position="53 64"/> (Invoice number is in positions 53..64 of the first row of an identifierdata-part) Notes 7: XML Conversion

  18. Recognition of content parts (2) • In binary data by content data types; For example:<xw:wrapper xw:sourcetype="binary"...> <A xw:type="byte"/> <B xw:type="string" xw:stringLength="20"/> <C xw:type="int"/> </xw:wrapper> • Split input to a byte, a string of 20 charactes, and an integer; (-> elements A,BandC) Notes 7: XML Conversion

  19. Alternative parts: <xw:CHOICE xw:maxoccurs=”unbounded"><A xw:starter=”\^aa” xw:terminator=”\n” /> <B xw:starter=”\^bb” xw:terminator=”\n” /> </xw:CHOICE> • arbitrary number (at least 1) lines starting with ”aa” or ”bb” -> elements A or B Recognition of content parts (3) • Repetition:<line xw:terminator="\n" xw:minoccurs="2" xw:maxoccurs="2"/> • 2 input lines -> 2 line elements Notes 7: XML Conversion

  20. XW: Modifying the structure of data • Limited modifications supported: • discarding parts of data • collapsing levels of hierarchy • adding levels of hierarchy • generating verbatim content and attributes • re-arranging existing data Notes 7: XML Conversion

  21. Discarding parts of data Input parts not matched by wrapper elements are ignored <spec xw:starter="SPEC" xw:childterminator="\n"> <!-- Split the ”SPEC” into rows: --> <!-- Ignore the first three rows: --> <xw:ignore xw:minoccurs="3" xw:maxoccurs="3" /> . . . </spec> Notes 7: XML Conversion

  22. Collapsing hierarchy <dataxw:starter=”START” xw:terminator=”END” xw:childterminator="\n”> <!-- ’data’ is made of rows --> <xw:collapse> <date xw:position=”5 14"/> <sum xw:position=”16 21"/></xw:collapse> . . . </data> Notes 7: XML Conversion

  23. <data> . . . </data> Collapsing hierarchy (2) START 17.8.1996 95.50 END • Split source data into parts according to specified separators Notes 7: XML Conversion

  24. 17.8.1996 95.50 Collapsing hierarchy (3) <data> <xw:collapse> </xw:collapse> . . .</data> • split parts into sub-parts, according to sub-elements Notes 7: XML Conversion

  25. <data><date></date> <sum> </sum> . . .</data> 17.8.1996 17.8.1996 95.50 95.50 Collapsing hierarchy (4) <data> <xw:collapse><date></date> <sum></sum></xw:collapse> . . .</data> Notes 7: XML Conversion

  26. 17.8.1996 17.8.1996 17.8.1996 <date></date> + <xw:collapse /> 17.8.1996 default: discardwhitespace=”true” Collapsing hierarchy (5) Input part wrapper element + <date /> result Notes 7: XML Conversion

  27. Adding levels of hierarchy • Example: Recognizing IP addresses in binary data<xw:ELEMENT xw:name=”IP-address"> <a xw:type="byte"/> <b xw:type="byte"/> <c xw:type="byte"/> <d xw:type="byte"/> </xw:ELEMENT> Notes 7: XML Conversion

  28. 193 167 232 253 <IP-address> <a>193</a> <b>167</b> <c>232</c> <d>253</d> </IP-address> <a>193</a> <b>167</b> <c>232</c> <d>253</d> Adding levels of hierarchy (2) • Binary data = string of bytes Notes 7: XML Conversion

  29. Adding levels of hierarchy (3) • NB: an xw:ELEMENTdoes not correspond to parts of input data (like ordinary result elements do): <!-- Wrap first two lines as INTRO: --><data xw:childterminator="\n"/> <xw:ELEMENT xw:name="INTRO"> <!--lines are matched by these elements:--> <xw:collapse /><xw:collapse /> </xw:ELEMENT> … </data> Notes 7: XML Conversion

  30. Rearranging content • Content can be rearranged by storing results temporarily in variables:<data xw:childterminator="\n"/><xw:STORE xw:name="lines"> <!-- lines are matched by these elements :--> <line1 /><line2 /> </xw:STORE> … <xw:COPY-OF xw:select="lines" /> </data> Notes 7: XML Conversion

  31. Axy.. z$??? B.. 1one 2two 3three <a>xy..z</a> Rearranging result structures XW <whole> <xw:STORE xw:name="xx"> <a xw:starter="A" xw:terminator="$"/> </xw:STORE> <b xw:starter="B"> <b1 xw:starter="1"/> <b2 xw:starter="2"/> <xw:COPY-OF xw:select="xx"/> <b3 xw:starter="3"/> </b> </whole> <whole> <b> <b1>one</b1> <b2>two</b2> <b3>three</b3> </b> </whole> Notes 7: XML Conversion

  32. Axy.. z$??? B.. 1one 2two 3three xy..z Rearranging result content XW <whole> <xw:STORE xw:name="xx"> <a xw:starter="A" xw:terminator="$"/> </xw:STORE> <b xw:starter="B"> <b1 xw:starter="1"/> <b2 xw:starter="2"/> <xw:VALUE-OF xw:select="xx"/> <b3 xw:starter="3"/> </b> </whole> <whole> <b> <b1>one</b1> <b2>two</b2> <b3>three</b3> </b> </whole> Notes 7: XML Conversion

  33. XW: Implementation • Prototype implemented with Java • Apache Xerces 2.0.1 used as a SAX parser • to read the wrapper description, which is represented internally as .. • a wrapper tree (käärintäpuu) • guides the parsing of source data Notes 7: XML Conversion

  34. Wrapper Tree • Wrapper tree node • corresponds to an element of wrapper description • used for matching parts of source data • includes sets S, B, T and F of delimiter strings • computed from wrapper description • S: element's own starter strings • B: strings that can begin part of element = S  starters of subelements that can begin the part of the element • T: terminating delimiters for the part of element • F: strings that can follow the part of element Notes 7: XML Conversion

  35. <xw:wrapper xw:name="Wrapper tree example" xw:sourcetype="text" xmlns:xw="http://www.cs.uku.fi/XW/2001"> <doku xw:childterminator="\n" terminator="$"> <a xw:starter="\^A" xw:minoccurs="0"/> <b xw:starter="\^B" /> <c xw:starter="\^C"/> <xw:CHOICE xw:minoccurs="0" xw:maxoccurs="unbounded"> <d xw:starter="\^D"/> <e xw:starter="\^E"/> </xw:CHOICE> </doku> </xw:wrapper> doku S: B:\^A ,\^B T: $ F: xw:CHOICE S: B:\^D,\^E T: F:\^D,\^E, $ c S:\^C B:\^C T:\n F:\^D,\^E, $ b S:\^B B:\^B T:\n F:\^C a S:\^A B:\^A T:\n F:\^B d S:\^D B:\^D T:\n F:\^D,\^E, $ e S:\^E B:\^E T:\n F:\^D,\^E, $ Aaaa Bbbb Cccc Eeee Dddd Dddd $ Notes 7: XML Conversion

  36. Executing a wrapper (simplified) • Traverse the wrapper tree; In each node: • scan input until the start of corresponding part found (= a delimiter belonging to set B) • report startElement(…) • Either • process child nodes recursively, or • report characters(…) for a leaf-level element • scan input until the end of the part (using sets T and F) • report endElement(…) • if node iterative, and a string in B found, reprocess node Notes 7: XML Conversion

  37. Development history and status • Fall 2001: language designed from concrete examples • 2002: Design of implementation principles, implementation • wrapping of separator-based and positional text data implemented • wrapping of binary data (and few other details) unimplemented Notes 7: XML Conversion

  38. XW: Some possible extensions • Evaluation of expressions • for generating computed attributes (implemented ) • for guiding repetition (min/maxoccurs) by content values • Namespace support for results • Describing recursive (unlimited nesting) source structures => recognizing LL(k) languages(Usefulness for wrapping data formats?) Notes 7: XML Conversion

  39. XW: Summary • XW: a convenient "XML wrapper description language” • Developed at Univ. of Kuopio • for translating legacy data to XML • for declarative wrapper description • Easier than writing ad-hoc conversion programs • Working prototype implementation Notes 7: XML Conversion

More Related