1 / 78

XML: Data Driving Business?

XML: Data Driving Business?. Laks V.S.Lakshmanan, IIT Bombay and Concordia University. XML : Data Model. What is an XML Document Linearization of a tree structure Every node of the tree can have several character strings associated

Download Presentation

XML: Data Driving Business?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML: Data Driving Business? Laks V.S.Lakshmanan, IIT Bombay and Concordia University

  2. XML : Data Model • What is an XML Document • Linearization of a tree structure • Every node of the tree can have several character strings associated • Info content of the document is the tree structure together with the character strings Is XML just a syntax for data interchange and serialization?

  3. XML: Data Model Types of nodes • Element Eg. <p a1="A1" . . . an="An">c1 . . . cm</p> • Document Eg. <!DOCTYPE name [markedupdeclarations]> • Processing instruction Eg. <?xml version=“1.0”? > • Comment Eg. <!--This is a comment--> • Atomic data Eg. <Data>

  4. What is a DTD? • Document Type Definition(DTD) serves as grammar • A document type definition specifies: • the elements that are permissible in a document of this type • for each each element the possible attributes, their range of values and defaults • for each element, the structure of its contents, including: • which element can occur and in what order • whether text characters can occur

  5. Example of a DTD Eg: <!DOCTYPE> Bookslist[ <!ELEMENT Bookslist (book)*> <!ELEMENT book (title,author*,publisher)> <!ELEMENT title (#PCDATA)> <!ELEMENT author(#PCDATA)> <!ELEMENT publisher(#PCDATA)> ]

  6. XML and DTD • Well formed documents • Tags should be nested properly and attributes should be unique. • Valid documents • Well formed documents that confirm to a Document Type Definition(DTD) • DTDsare used • Constrain structure • Declare entities • Provide some default values for attributes

  7. DTD Limitations • too much document oriented • too simple and too complicated at the same time • too limited to represent complex structures • IDREFs are not typed • No notion of inheritance/sub-typing • too many ways to represent the same thing • names are global, not locals

  8. DTD vs. Database Schema • Order is of significance in DTD and not in DB • DTD does not provide for data types • DTD cannot specify keys

  9. XMLSchema • Why XMLSchema • Based on XML syntax • Can be parsed and manipulated like any XML document • Supports variety of data types • Allows extensions of vocabularies and inherit from elements • Provides namespace integration • Provides logical grouping of attributes

  10. XMLSchema: An example <datatype name="PriceType"> <basetype name="decimal"/> <minExclusive>0.00</minExclusive> <scale>2</scale> </datatype> <element name="price" type="PriceType"> </element> <element name='Person'> ... </element> <element name='Employee'> <refines name='Person'/> ... </element>

  11. XMLSchema vs. DTD

  12. XML Data • Superset of XMLSchema • Can express Database relationships too.. • Eg: <elementType id="booktable"> <element id="titleID" type="#title”/> <element type="#author”/> <element type="#pages”/> <key id="bookkey"> <keyPart href="#titleID"/> </key> </elementType>

  13. Semistructured data • Data that is neither raw nor very strictly typed like in databases • Examples of semistructured data • Html file with one entry per restaurant that provides info on prices, addresses, styles • BibTex files • Genome and scientific databases • Online documentation

  14. Semistructured data: Main aspects • Structure • Irregular • Implicit • Partial • Schema • Very large • Rapidly evolving • Distinction between data and schema is blurred

  15. Semistructured data:Data model • Object Exchange Model(OEM) • Lightweight and flexible • Data representation • As a graph with objects as vertices and labels on edges • Each object has a unique object identifier • Some objects are atomic, e.g., integer, real,… • Complex objects have value as set of object references

  16. OEM: An example

  17. Semistructured data: Query Languages • Lorel • Based on OQL • Eg., • Select author:X from biblio.book.author X • Computes the set of book authors • Forms a new node and connects it with edges labelled author to nodes resulting from evaluation of the path expression

  18. Lorel: Salient features • Coercion • force comparison operators to handle comparisons between objects of different types like between string and integer • Eg. Select row:X from biblio.paper X where X.year=1998 Comment: ==>Year could have been string or integer

  19. Lorel: Salient Features • Path expressions • Data model allows arbitrary nesting • Queries should hence be able to probe arbitrary depth • Provided by path expressions • Eg. select title:t from chapter(.section)* s, s.title t where t like "*XML*"

  20. UnQL • Based on Edge labeled Graph Model • Coercion not supported • More precise knowledge of data needed • Pattern Usage • Eg. Select title: X where {biblio: {paper: {title: X, year:Y}}} in db, Y>1998

  21. UnQL • Path variables • Can use path too as data • Eg. Select @P from db1 @P.X where matches(“.*(U|u)biquitin.*”,X) ==>To determine where string “ubiquitin” appears in db1

  22. Semistructured vs. XML • Both are schema-less, self-describing • XML is ordered and semistructured data is not • XML can mix text and elements: • XML has lots of other stuff: entities, processing instructions, comments

  23. Requirements of an XML Query Language • XML Output • Server-side processing • Query operations • Selection, Extraction, Reduction, Restructuring, Combination • No schema required • Exploit available schema • Preserve order and association • Programmatic Manipulation

  24. Requirements of an XML Query Language • XML representation • Mutual embedding with XML • XLink and XPointer cognizant • Support for new data types • Suitable for metadata

  25. XML Query Languages • XQL • XML-QL • Quilt

  26. XQL • Simple expressions • //product[@maker='BSA'] : All products with attribute maker ‘BSA’ • Filters • author/address[@type='email']: Address nodes with attribute type as email • Subscripts • section[1,3 to 5]: Nodes with position 1,3,4,5

  27. XQL • Supports boolean and set operators • q1 and q2 • q1 union q2 • Grouping • //invoice{q1} : Using invoice groups the results of q1 • Sequence • a before b • Others : node(), text(), ...

  28. XQL: Limitations • Flattening • As the results of patterns and filters are not modeled by an intermediate relation • Restructuring • As flattening not permitted cannot restructure • Tag variables • Not supported • Sorting

  29. XML Query Languages • XQL • XML-QL • Quilt

  30. XML-QL • Simple examples WHERE <book> <publisher> <name>Addison-Wesley</name> </publisher> <title> $t</title> <author> $a</author> </book> IN "www.a.b.c/bib.xml" CONSTRUCT <result> <author>$a</author> <title>$t</title> </result>

  31. XML-QL • Grouping WHERE <book> $p </> IN "www.a.b.c/bib.xml", <title > $t </>, <publisher> <name>Addison-Wesley</> </publisher> IN$p CONSTRUCT <result> <title> $t </> WHERE <author> $a </> IN$p CONSTRUCT <author> $a</> </> ==> Groups by title.

  32. XML-QL • Tag variables WHERE <$p> <title> $t </title> <year>1995 </> <$e> Smith </> </> IN "www.a.b.c/bib.xml", $e IN {author, editor} CONSTRUCT <$p> <title> $t </title> <$e> Smith </> </> ==> List of books where Smith could be either author or editor

  33. XML-QL • Regular Path Expressions WHERE <part*> <name>$r</> <brand>Ford</> </> IN "www.a.b.c/bib.xml" CONSTRUCT <result>$r</> ==> Gets list of names of parts irrespective of the nesting of parts in the document.

  34. XML-QL • Skolem functions WHERE <$> <author> <firstname> $fn </> <lastname> $ln </> </> <title> $t </> </> IN "www.a.b.c/bib.xml", CONSTRUCT <person ID=PersonID($fn, $ln)> <firstname> $fn </> <lastname> $ln </> <publicationtitle> $t </> </> ==> PersonID is a Skolem function Generates new id for distinct value of ($fn,$ln) else appends to existing node.

  35. XML-QL • Allows integrating data from multiple sources • Can query order as well • Provides for embedding query within data • Allows function definitions • Is relationally complete

  36. XML-QL • Is everything fine? • Pattern specifications are too verbose • Result of the WHERE clause is a relation composed of scalar values • So cannot preserve information about hierarchy and sequence • Can hence not handle hierarchy and sequence related queries

  37. XML Query Languages • XQL • XML-QL • Quilt

  38. Quilt • Combines strengths of XML-QL and XQL • Derives ability to navigate and select nodes based on sequence from XQL • Binding of variables done like in XML-QL

  39. Quilt • An example FOR $b in//book WHERE exists($b/title) AND NOT exists($b/author) RETURN$b/title ==> Lists those titles of those books which do not have author info

  40. Quilt XML Input FOR/LET Tuples of bound var.WHERE Tuples selected RETURN XML Output Flow of data in a quilt expression

  41. Quilt: Filtering Documents • Need to preserve the relationships among selected elements • Eg: C B A B C B B A A A C A • filter = A|B B A C B

  42. Quilt • Can perform Sorting • Aggregation provided • Allows recursive functions

  43. Quilt: The real power of it • Sample document <section> <section.title>Procedure</section.title> The patient was taken to the operating room where she was placed in a supine position and <Anesthesia>induced under general anesthesia. </Anesthesia> <Prep> <action>Foley catheter was placed to decompress the bladder</action> and the abdomen was then prepped and draped in sterile fashion. </Prep> <Incision> A curvilinear incision was made <Geography>in the midline immediately infraumbilical</Geography> and the subcutaneous tissue was divided <Instrument>using electrocautery.</Instrument> </Incision> The fascia was identified and <action>#2 0 Maxon stay sutures were placed on each side of the midline.</action> <Incision> The fascia was divided using <Instrument>electrocautery</Instrument> and the peritoneum was entered. </Incision> <Observation>The small bowel was identified</Observation> and <action> the <Instrument>Hasson trocar</Instrument></action> : </section>

  44. Quilt: The real power of it • In each section with title "Procedure", what Instruments were used in the second Incision? FOR $s IN //section[section.title="Procedure"] RETURN ($s//Incision)[2]/Instrument • In each section with title "Procedure", what are the first two instruments to be used? FOR $s IN //section[section.title="Procedure"] RETURN ($s//Instrument)[1-2]

  45. Quilt: The real power of it • In the first procedure, what happened between the first incision and the second incision? FOR $proc IN //section[section.title="Procedure"][1], $bet IN $proc//((* AFTER ($proc//incision)[1]) BEFORE ($proc//incision)[2]) RETURN $bet

  46. XML Storage • Text files • Simple • Would require special purpose query processor • Relational databases • Ternary relations [Florescu et al] • Inlining methods [Shanmugasamudram et al] • STORED [Mary Fernandez]

  47. XML Storage • Object Oriented databases[Sophie Cluet et al] • Native storage

  48. XML Storage • Using Ternary relations • Edge labels are maintained in a table with the object ids that the edge connects • Value of leaf nodes are stored using yet another table

  49. Ref Val Store XML in Ternary Relation &o1 paper &o2 year title author author &o3 &o4 &o5 &o6 “The Calculus” “…” “…” “1986”

  50. XML Storage • DTDs converted into DTD graph • Inlining methods • Basic inlining • Shared inlining • Hybrid inlining

More Related