1 / 42

CSE 636 Data Integration

Explore the flexible representation of data with semistructured data, which allows for sharing and integration of documents among systems and databases. Learn about the graphs and tree-like structure of semistructured data, as well as the use of XML as a popular format for data processing and exchange.

rjackie
Download Presentation

CSE 636 Data Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 636Data Integration XML Semistructured Data Document Type Definitions

  2. Semistructured Data • Another data model, based on trees • Motivation: flexible representation of data • Often, data comes from multiple sources with differences in notation, meaning, etc. • Motivation: sharing of documents among systems and databases

  3. Graphs of Semistructured Data • Nodes = objects • Labels on arcs (attributes, relationships) • Atomic values at leaf nodes (nodes with no arcs out) • Flexibility: no restriction on: • Labels out of a node • Number of successors with a given label

  4. Example: Data Graph The beer object for Bud The bar object for Joe’s Bar root beer beer bar manf manf prize name A.B. name year award servedAt Bud M’lob 1995 Gold name addr Joe’s Maple

  5. XML • HTML • Uses tags for formatting the presentation (e.g., “italic”) • Hard for applications to process • XML = Extensible Markup Language • Uses tags for semantics(e.g., “this is an address”) • Similar to labels in semistructured data • Allows you to invent your own tags • Easy for applications to process

  6. HTML  XML <html> <body> <h1> Bibliography </h1> <p> <i>Foundations of Databases</i> Abiteboul, Hull, Vianu <br/> Addison Wesley, 1995 </p> <p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br/> Morgan Kaufmann, 1999 </p> </body> </html> <?xml version = “1.0” standalone = “yes” ?> <bibliography> <book> <title>Foundations of Databases</title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography>

  7. Why XML is of Interest to Us • XML is just syntax for data • Note: we have no syntax for relational data • But XML is not relational: semistructured • This is exciting because: • Can translate any data to XML • Can ship XML over the Web (HTTP, SOAP) • Can input XML into any application • Thus: data sharing and exchange on the Web

  8. XML Data Sharing and Exchange  XML DB  Applications Applications XML Data Transform Integrate Web (HTTP, SOAP) Warehouse Relational DB Web Site Web Service

  9. XML Tags & Elements • Tags: book, title, author, … • XML tags are case sensitive • Tags, as in HTML, are normally matched pairs • <book> … </book> • Start tag: <book>, End tag: </book> • Elements: everything between tags • Example 1: <title>Foundations of Databases</title> • Example 2: <book> <title>Foundations of Databases</title> </book> • Elements may be nested arbitrarily • Empty element: <book></book> • Abbreviation <book/>

  10. XML Attributes <bookprice = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> • Attributes are alternative ways to represent data

  11. Replacing Attributes with Elements <book> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> <price> 55 </price> <currency> USD </currency> </book>

  12. Elements vs. Attributes • Too many attributes make documents hard to read • Attributes do not specify document structure • Attributes are good for simple information

  13. More XML: CDATA Section • Syntax: <![CDATA[ .....any text here...]]> • Example: • <example> <![CDATA[ some text here </notAtag> <>]]> • </example>

  14. More XML: Entity References • Syntax: &entityname; • Example: <element> this is less than &lt; </element> • Some entities:

  15. More XML: Comments • Syntax <!-- .... Comment text... --> • Yes, they are part of the data model !!!

  16. XML Semantics: a Tree ! Elementnode Attributenode Textnode data <data> <person age=“25”> <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> <person> <name> John </name> <address>Thailand</address> <phone> 23456 </phone> </person> </data> person person age address name address name phone 25 street no city Mary Thai John 23456 Maple 345 Seattle • Order matters!!!

  17. Well-Formed XML • Start the document with a declaration, surrounded by <?xml … ?> • Normal declaration is: • <?xml version = “1.0” standalone = “yes” ?> • “Standalone” = “no DTD provided” • Has single root elementsurrounding nested elements • Has matching tags

  18. XML Data • XML is self-describing • Schema elements become part of the data • Relational schema: person(name, phone) • In XML <person>, <name>, <phone> are part of the data, and are repeated many times • Consequence: XML is much more flexible • XML = semistructured data • Well-Formed XML with nested tags is exactly the same idea as trees of semistructured data • XML also enables nontree structures, as does the semistructured data model

  19. XML is Semistructured Data • Missing attributes: • Could represent ina table with nulls <person> <name> John</name> <phone>1234</phone> </person> <person> <name>Joe</name> </person>  no phone !

  20. XML is Semistructured Data • Repeated attributes • Impossible in tables: <person> <name>Mary</name> <phone>2345</phone> <phone>3456</phone> </person>  two phones ! ???

  21. XML is Semistructured Data • Attributes with different types in different objects • Nested collections (no 1NF) • Heterogeneous collections: • <db> contains both <book>s and <publisher>s <person> <name> <first>John</first> <last>Smith</last> </name> <phone>1234</phone> </person>  structured name !

  22. Document Type Definition (DTD) • Part of the original XML specification • An XML document may have a DTD • Valid XML: if it has a DTD and conforms to it • Validation is useful in data exchange

  23. Very Simple DTD <!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title,author*,year?)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)> ]>

  24. DTD: The Content Model contentmodel • Content model:<!ELEMENT tag (CONTENT)> • Complex = a regular expression over other elements • Text-only = #PCDATA • Empty = EMPTY • Any = ANY • Mixed content = (#PCDATA | A | B | C)*

  25. DTD: Regular Expressions <name> <firstName>…</firstName> <lastName>…</lastName> </name> <name> <lastName>…</lastName> </name> <name> <firstName>…</firstName> <lastName>…</lastName> </name> <person> <name>…</name> <phone>…</phone> <phone>…</phone> <phone>…</phone> … </person> <person> <name>…</name> </person> <person> <name>…</name> <phone>…</phone> <phone>…</phone> … </person> <person> <name>…</name> <phone>…</phone> </person> <person> <name>…</name> <phone>…</phone> </person> <person> <name>…</name> <email>…</email> </person> DTD XML sequence <!ELEMENT name (firstName, lastName)) optional <!ELEMENT name (firstName?, lastName)) zero or more <!ELEMENT person (name, phone*)) one or more <!ELEMENT person (name, phone+)) alternation <!ELEMENT person (name, (phone|email)))

  26. DTD: Attributes <!ELEMENT person (ssn, name, office, phone?)> <!ATTLIST personage CDATA #REQUIRED height CDATA #IMPLIED> <personage=“25” height=“6”> <name> ...</name> ... </person>

  27. DTD: Attributes • <!ATTLIST tag (name type kind)+> • Types: • CDATA = string • (Mon | Wed | Fri) = enumeration • ID = key • IDREF = foreign key • IDREFS = foreign keys separated by space • others = rarely used • Kind: • #REQUIRED • #IMPLIED = optional • “value” = default value • “value” #FIXED = the only value allowed

  28. XML: IDs and References • Attributes can be pointers from one object to another • Compare to HTML’sNAME = “foo” and HREF = “#foo” • Allows the structure of an XML document to be a general graph, rather than just a tree

  29. XML: Creating ID’s • Give an element E an attribute A of type ID • When using tag <E> in an XML document, give its attribute A a unique value • Example: • <E A = “xyz”>

  30. XML: Creating References • To allow objects of type F to refer to another object with an ID attribute, give F an attribute of type IDREF • Or, let the attribute have type IDREFS, so the F –object can refer to any number of other objects

  31. XML: IDs and References <personid=“o555”> <name>Jane</name> </person> <personid=“o456”> <name> Mary </name> <childrenidref=“o123 o555”/> </person> <personid=“o123” mother=“o456”> <name>John</name> </person> • IDs and references in XML are just syntax

  32. DTD: ID and IDREF(S) Attributes <!ELEMENT person (ssn, name, office, phone?)> <!ATTLIS person age CDATA #REQUIRED id ID #REQUIRED manager IDREF #REQUIRED manages IDREFS #REQUIRED > <personage=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ... </person>

  33. Use of DTDs • Set standalone = “no” • Either: • Include the DTD as a preamble of the XML document, or • Follow DOCTYPE and the <root tag> by SYSTEM and a path to the file where the DTD can be found, or • Mix the two... (e.g. to override the external definition)

  34. Example (a) The DTD The document <?xml version = “1.0” standalone = “no” ?> <!DOCTYPE BARS [ <!ELEMENT BARS (BAR*)> <!ELEMENT BAR (NAME, BEER+)> <!ELEMENT NAME (#PCDATA)> <!ELEMENT BEER (NAME, PRICE)> <!ELEMENT PRICE (#PCDATA)> ]> <BARS> <BAR><NAME>Joe’s Bar</NAME> <BEER><NAME>Bud</NAME> <PRICE>2.50</PRICE></BEER> <BEER><NAME>Miller</NAME> <PRICE>3.00</PRICE></BEER> </BAR> <BAR> … </BARS>

  35. Example (b) Get the DTD from the file bar.dtd • Assume the BARS DTD is in file bar.dtd • <?xml version = “1.0” standalone = “no” ?> • <!DOCTYPE BARS SYSTEM “bar.dtd”> • <BARS> • <BAR><NAME>Joe’s Bar</NAME> • <BEER><NAME>Bud</NAME> • <PRICE>2.50</PRICE></BEER> • <BEER><NAME>Miller</NAME> • <PRICE>3.00</PRICE></BEER> • </BAR> • <BAR> … • </BARS>

  36. DTDs as Grammars <!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title,author*,year?)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)> ]>

  37. DTDs as Grammars Same thing as: • A DTD is a EBNF (Extended BNF) grammar • An XML tree is precisely a derivation tree • A valid XML document = a parse tree for that grammar db ::= (book|publisher)* book ::= (title,author*,year?) title ::= string author ::= string year ::= string publisher ::= string

  38. DTDs as Grammars <!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)> ]> <paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section> </paper> • XML documents can be nested arbitrarily deep

  39. DTDs as Schemas Not so well suited: • impose unwanted constraints on order: • <!ELEMENT person (name,phone)> • references cannot be constrained • ID/IDREFS can reference any ID • can be too vague: • <!ELEMENT person ((name|phone|email)*)>

  40. DTDs as Schemas No context-dependant typing • Cannot distinguish between used car ads and new car ads • Different structure in different contexts dealer UsedCars NewCars ad ad model year year

  41. XML APIs • Document Object Model - DOM • Manipulation of XML Data • Provides a representation of an XML Document as a tree • Reads XML Document into memory • http://www.w3.org/DOM • Many implementations (Sun JAXP, Apache Xerces, …) • Simple API for XML - SAX • Event-based framework for parsing XML data • http://www.saxproject.org/

  42. References • Lecture Slides • Jeffrey D. Ullman • http://www-db.stanford.edu/~ullman/dscb/pslides/pslides.html • Dan Suciu • http://www.cs.washington.edu/homes/suciu/COURSES/590DS/02xmlsyntax.htm • http://www.cs.washington.edu/homes/suciu/COURSES/590DS/11dtd.htm • Alon Levy • http://www.cs.washington.edu/education/courses/csep544/02sp/lectures/lecture5cut.ppt • BRICS XML Tutorial • A. Moeller, M. Schwartzbach • http://www.brics.dk/~amoeller/XML/index.html • W3C's XML homepage • http://www.w3.org/XML • XML School: an XML tutorial • http://www.w3schools.com/xml

More Related