1 / 36

Semi-Structured Data Models

Semi-Structured Data Models. By Chris Bennett. Semi-Structured Data. What is it? Data where structure not necessarily determined in advance (often implicit in data) Descriptive, not prescriptive Self-describing and flexible in structure Where does it come from?

cathy
Download Presentation

Semi-Structured Data Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semi-Structured Data Models By Chris Bennett

  2. Semi-Structured Data • What is it? • Data where structure not necessarily determined in advance (often implicit in data) • Descriptive, not prescriptive • Self-describing and flexible in structure • Where does it come from? • When the data cannot (or simply is not) modeled naturally or usefully using a standard data model • Merging multiple data sources, sparse user annotations, rapidly evolving schemas specific to given communities • Raw data is often semi-structured • Frequently a product of rapidly evolving schema • Examples • HTML, XML, BibTex, Integrated data sources, etc..

  3. Semi-Structured Data • This is great – infinite flexibility!! Is there a catch? Always a tradeoff… • In this case, retrieval and query performance can suffer greatly compared to more structured data models

  4. Semi-Structured Data So we know what it is – how do we… • Model it? • Directed labeled graphs • Query it? • Many proposals, all include regular path expressions…Lorel, XML Query… • Store it? • Big challenge Haystack Model

  5. Semi-Structured Data Models • What do they do? • Provide a common framework • In effect, they add some structure • Why? • Semi-structured data often is irregular or missing, similar concepts are represented using different types, heterogeneous sets are present, or object structure is not fully • Standardize information exchange • Data verification (both internal and external) • Examples • OEM, XML DTD, XML Schema…

  6. OEM – Object Exchange Model • Developed at Stanford (mid 90s) • Precursor to today’s accepted semi-structured data acronyms (XML) • (label, type, value, object-ID) • Main feature – self-describing • Requires a good bit of human intervention, though

  7. Object-Oriented Model versus OEM • OEM is an information exchange model (does not specify object storage issues) • OEM is much simpler (supports object nesting…omits classes, methods, inheritance) • Uses labels in place of schema

  8. Advantages of OEM • Simple model makes transforming and merging data simpler • Advanced features can be “emulated” (implies human intervention) • More suitable for heterogeneity • Hindsight: Extreme heterogeneity mandates more than a little human intervention without some structure

  9. Components of OEM • Query Language • OEM-QL – typical SELECT-WHERE-FROM • Translator • Translates OEM-QL to specific data source and back • Mediator • Collects work of translators then merges and/or combines them to make OEM structures

  10. OEM-QL SELECT – WHERE – FROM Adaptation of SQL-like language for OO models SELECT fetch-expression FROM object WHERE condition Expressions in the SELECT and WHERE clauses use the notion of a path that describes a traversal through an object using sub-object structure and labels

  11. OEM-QL SELECT biblio.?.topic FROM root WHERE biblio.?.internal-call-no ? - denotes match to any label • Return the topic of books where there exists an internal call number • The question mark allows the user to say that the intermediate “node” in the path through the object can be named anything

  12. XML DTD – Document Type Definition • Let there be (a little) more structure… • DTD’s define the legal building blocks of an XML document. • It defines the document structure with a list of legal elements and/or attributes, and it can be declared inline or external to the XML document.

  13. XML DTD Example <!DOCTYPE note [ <!ELEMENT note (to, from, heading, body) > <!ELEMENT to (#PCDATA) > <!ELEMENT from (#PCDATA) > <!ELEMENT heading (#PCDATA) > <!ELEMENT body (#PCDATA) > ]>

  14. XML DTD Advantages • An application can use a standard DTD to verify that data you receive from the outside world is valid. • It is flexible enough so that you can nest: • + -- at least one occurrence • * -- zero or more occurrences • ? – zero or one occurrence Example: <!ELEMENT note (to +, from, header, message *, #PCDATA)>

  15. DTD Drawbacks • What about constraints?? • DTD’s do not offer much help in constraining the value of a particular attribute or element (only on the use of markup) • Automated processing of XML documents requires more rigorous and comprehensive facilities in this area. • Requirements are for constraints on how the component parts of an application fit together, the doc structure, attributes, data-typing, and so on.

  16. XML Schema Well formatted is not enough! Let there be more structure! • XML Schema is an XML-based alternative (and ultimate successor) to DTD’s • They express shared vocabularies and allow machines to carry out rules made by people. • They provide a means for defining the structure, content and semantics of XML documents

  17. Successor to DTD’s • XML Schema: • Extensible to future additions • Richer and more useful than DTD’s • Written in XML • Support data types • Support namespaces

  18. XML Schema Advantages • Better validation, restriction, and type conversion • Extensible – reuse, modify existing data types, reference multiple schemes

  19. XML Schema Details Defines… • Elements that can appear in a document • Attributes that can appear in a document • Which elements are child elements • Order of child elements • Number of child elements • Whether an element is empty or can include test • Data types for elements and attributes • Default and fixed values for elements and attributes

  20. XML Schema Components Primary components,: • Simple type definitions , Complex type definitions, attribute declarations, and elements declarations The secondary components, which must have names, are as follows: • Attribute group definitions, Identity-constraint definitions, Model group definitions, and Notation declarations Finally, the "helper" components provide small parts of other components; they are not independent of their context: • Annotations, Model groups, Particles, Wildcards, Attribute Uses

  21. XML Namespaces (W3C Documentation) • Collection of names, identified by a URI reference, which are used in XML documents as element types and attribute names • XML namespaces differ from the "namespaces" conventionally used in computing disciplines in that the XML version has internal structure and is not, mathematically speaking, a set

  22. XML Schema Example W3C XML Schema Primer (examples) <schema xmlns="http://www.w3.org/2001/XMLSchema" xmlns:po="http://www.example.com/PO1" targetNamespace="http://www.example.com/PO1" elementFormDefault="unqualified" attributeFormDefault="unqualified"> <element name="purchaseOrder" type="po:PurchaseOrderType"/> <element name="comment" type="string"/> <complexType name="PurchaseOrderType"> <sequence> <element name="shipTo" type="po:USAddress"/> <element name="billTo" type="po:USAddress"/> <element ref="po:comment" minOccurs="0"/> <!-- etc. --> </sequence> <!-- etc. --> </complexType> <complexType name="USAddress"> <sequence> <element name="name" type="string"/> <element name="street" type="string"/> <!-- etc. --> </sequence> </complexType> <!-- etc. --> </schema>

  23. Querying Semi-Structured Data • Keys: • Semi-structured data modeled on directed graphs • User cannot have full knowledge of data structure, but we should exploit what structure we do know exists • Examples • Lorel • Developed at Stanford (1997) as part of the Lore (lightweight object repository) project • XPath • W3C standard • Language for addressing parts of an XML document

  24. Lore System Stanford Link • Successor to OEM • Fully functional DBMS for XML with: • Declarative query language, multiple indexing techniques, a cost-based query optimizer, multi-user support, logging, and recovery • Novel features include: • DataGuides, • Management of external data • Proximity search.

  25. Lore – Novel Features • DataGuides • Structural summary of all paths in that database • Used by query optimizer to exploit known structure • Manage External Data • Proximity Search • Ranks database objects based on their proximity to other objects • Measure proximity based on distances in the graph linking the objects together

  26. Lorel – Lore Query Language • Based on OQL • Provides powerful path traversal operators • Makes extensive use of type coercion to help yield "intuitive" results for all queries over XML data • Permits flexible form of declarative navigational access • Particularly suited to when details of structure are not known

  27. Lorel – Coercion Rules

  28. Lorel Example Find the names and zip codes of all “cheap” restaurants select Guide.restaurant.name, Guide.restaurant.(.address)?.zipcode where Guide.restaurant.% grep “cheap” - The ? after .address means the address is optional in the path expression - The % will match any subobject of restaurant - Comparison operator grep returns true if string “cheap” appears anywhere in the subobject value

  29. Lorel – Another example select X.name from John.name JN, John.child X, X.name XN where JN == XN • “Retrieve the children of John bearing his name” • == expects atomic values so they are coerced Rewritten: select X.name from John.child X where John.name == X.name

  30. Lorel – Constructing Results • S-F-W in Lorel has same semantics as SQL: results are a bag (multiset) or a set if ‘distinct’ is used • Results is always a collection of OEM objects (elimination by OID) • For each assignment of the variables in the from clause that passes the condition of the where clause, a value is generated according to the expressions in the select clause • Results could refer to database objects or could refer to new objects created by coercion

  31. Lorel – Data Updates • Create and delete database names • Delete is implicit when object becomes unreachable • Create a new atomic or complex object • Modify the value of an existing atomic or complex object • Bulk load an OEM database

  32. Lorel – Updates cont’d… • Assigning names to objects Name myFavorite := element (select Guide.Restaurant where Guide.Restaurant.name = “Saigon”) • Creating objects new_oem (int, 5) new_oem (complex, struct(a:{new_oem(int,5)}, b:{X,Y}))

  33. XPath Features • XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax • Provides basic facilities for manipulation of strings, numbers and booleans • XPath uses a compact, non-XML syntax to facilitate use of XPath within URIs and XML attribute values

  34. XPath – How It Works W3C XPath Information • XPath models an XML document as a tree of nodes • Root nodes, element nodes, text nodes, attribute nodes, namespace nodes, processing instruction nodes, comment nodes • Evaluation occurs with respect to a “context” which consists of: • a node (the context node) • a pair of non-zero positive integers (the context position and the context size) • a set of variable bindings • a function library • the set of namespace declarations in scope for the expression

  35. XQuery – How It Works • Location path – selects a set of nodes relative to the context node • An expression that is a location path results in a node set • Examples of location paths • Includes functions for node sets, strings, numbers, etc…

  36. XPath – Generic Example Simple: employee[@secretary and @assistant] Selects all the employee children of the context node that have both a secretary attribute and an assistant attribute W3C School Examples

More Related