Capturing semantics in xml documents
Download
1 / 67

Capturing Semantics in XML Documents - PowerPoint PPT Presentation


  • 45 Views
  • Uploaded on

Capturing Semantics in XML Documents. Tok Wang Ling Department of Computer Science National University of Singapore. Roadmap. XML documents and current XML schema languages ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) [4] The applications of ORA-SS

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Capturing Semantics in XML Documents' - sophie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Capturing semantics in xml documents

Capturing Semantics in XML Documents

Tok Wang Ling

Department of Computer Science

National University of Singapore

KDXD 2006, Singapore


Roadmap
Roadmap

  • XML documents and current XML schema languages

  • ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) [4]

  • The applications of ORA-SS

  • Discovering Semantics in XML documents

  • Conclusion

[4]. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc. 2005

KDXD 2006, Singapore


Roadmap1
Roadmap

  • XML documents and current XML schema languages

  • ORA-SS (Object-Relationship-Attribute model for Semi-Structured data)

  • The applications of ORA-SS

  • Discovering Semantics in XML documents

  • Conclusion

KDXD 2006, Singapore


1 xml brief introduction
1. XML – Brief introduction

  • XML (eXtensible Markup Language) is

    • Released by W3C

    • An application of SGML

    • A promising standard of data publishing, integrating and exchanging on the web

  • XML schema

    • DTD (Data Type Definition) [3]

    • XSD (XML Schema Definition), W3C recommended standard [6, 7, 8]

[3]. Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February 2004.

http://www.w3.org/TR/2004/REC-xml-20040204/

[6]. XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October 2004.

http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/

[7]. XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October 2004.

http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/

[8]. XML Schema Part 2: Datatypes Second Edition. W3C Recommendation 28 October 2004.

http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/

KDXD 2006, Singapore


1 xml a motivating example
1. XML – A motivating example

  • Suppose we have an XML document “psj.xml” about different parts, suppliers and projects, where

    • The document has a root element psj;

    • Under psj, there is a sequence of part elements;

    • Under part, there is a sequence of supplier elements;

    • Under supplier, there is a sequence of project elements.

KDXD 2006, Singapore


Example 1 psj xml
Example 1. psj.xml

<?xml version="1.0" encoding="UTF-8"?>

<psj xmlns:xsi="…" xsi:noNamespaceSchemaLocation="…">

<part>

<pno>P001</pno> <pname>Nut</pname> <color>Silver</color>

<supplier>

<sno>S001</sno> <sname>Alfa</sname>

<city>Atlanta</city> <price>5</price>

<project>

<jno>J001</jno> <jname>Rocket boots</jname>

<budget>20000</budget><qty>60</qty>

</project>

<project>

<jno>J003</jno> <jname>Firework launcher</jname>

<budget>250000</budget> <qty>650</qty>

</project>

</supplier>

<supplier>

<sno>S002</sno> <sname>Beta</sname>

<city>Atlanta</city> <city>New York</city> <price>5.5</price>

<project>

<jno>J002</jno> <jname>Diving helm</jname>

<budget>18000</budget> <qty>70</qty>

</project>

<project>

<jno>J003</jno> <jname>Firework launcher</jname>

<budget>250000</budget> <qty>50</qty>

</project>

</supplier>

</part>

<part>

<pno>P002</pno> <pname>Nut</pname> <color>Copper</color>

<supplier>

<sno>S001</sno> <sname>Alfa</sname>

<city>Atlanta</city> <price>4.6</price>

<project>

<jno>J002</jno> <jname>Diving helm</jname>

<budget>18000</budget> <qty>60</qty>

</project>

</supplier>

<supplier>

<sno>S003</sno> <sname>Beta</sname>

<city>New York</city> <price>5</price>

<project>

<jno>J001</jno> <jname>Rocket boots</jname>

<budget>20000</budget><qty>20</qty>

</project>

<project>

<jno>J004</jno> <jname>Blue fireworks</jname>

<budget>20000</budget> <qty>50</qty>

</project>

</supplier>

</part>

</psj>

KDXD 2006, Singapore


1 xml the dtd of the psj xml
1. XML – the DTD of the “psj.xml”

<?xml version="1.0" encoding="UTF-8"?>

<!--DTD generated by XXX-->

<!ELEMENT psj (part+)>

<!ELEMENT part (pno, pname, color, supplier+)>

<!ELEMENT pno (#PCDATA)>

<!ELEMENT pname (#PCDATA)>

<!ELEMENT color (#PCDATA)>

<!ELEMENT supplier(sno, sname, city+, price, project+)>

<!ELEMENT sno (#PCDATA)>

<!ELEMENT sname (#PCDATA)>

<!ELEMENT city (#PCDATA)>

<!ELEMENT price(#PCDATA)>

<!ELEMENT project (jno, jname, budget, qty)>

<!ELEMENT jno (#PCDATA)>

<!ELEMENT jname (#PCDATA)>

<!ELEMENT budget (#PCDATA)>

<!ELEMENT qty (#PCDATA)>

▼♦ psj

▼♦part

♦ pno

♦ pname

♦ color

▼♦supplier

♦ sno

♦ sname

♦ city

♦price

▼♦project

♦ jno

♦ jname

♦ budget

♦qty

(a) “psj.dtd”, The DTD of the “psj.xml”

(b) psj.dtd in Data Guide

KDXD 2006, Singapore


1 xml what the dtd says
1. XML – what the DTD says

  • DTD is a simple definition of an XML document, where users can define

    • Element/Attribute types

    • Occurrence constraints (e.g. ?, +, *)

    • Containment among different element types (the structure)

  • DTD cannot express

    • Occurrence constraints in numbers (e.g. 2 to 8)

    • Uniqueness/Key constraints on a combination of attributes/elements (ID attribute can be only assigned on one attribute at a time in DTD.)

    • Relationship types among elements and their degrees

    • Difference between the attribute (or simple element) of element type and the attribute (or simple element) of relationship type.

 Simple elements are those element types with PCDATA only without any attribute types.

KDXD 2006, Singapore


1 xml xsd

<xs:schema xmlns:xs = “…”>

<xs:element name = “psj”>

<xs:complexType>

<xs:sequence>

<xs:element name="part">

<xs:complexType>

<xs:sequence>

<xs:element name="pno" type="xs:string"/>

<xs:element name="pname" type=" xs:string"/>

<xs:element name="color" type=" xs:string"/>

<xs:element name="supplier" maxOccurs="unbounded">

<xs:complexType>

<xs:sequence>

<xs:element name="sno" type=" xs:string"/>

<xs:element name="sname" type=" xs:string"/>

<xs:element name="city" type=" xs:string“ maxOccurs="unbounded"/>

<xs:element name="price" type=" xs:string"/>

<xs:element name="project" maxOccurs="unbounded">

<xs:complexType>

<xs:sequence>

<xs:element name="jno" type=" xs:string"/>

<xs:element name="jname" type=" xs:string"/>

<xs:element name="budget" type=" xs:string"/>

<xs:element name="qty" type=" xs:string"/>

</xs:sequence>

</xs:complexType>

</xs:element>

</xs:sequence>

</xs:complexType>

</xs:element>

</xs:sequence>

</xs:complexType>

</xs:element>

</xs:sequence>

</xs:complexType>

<xs:keyname="PK">

<xs:selector xpath="part"/>

<xs:field xpath="pno"/>

</xs:key>

</xs:element>

</xs:schema>

XSD definition ofelement occurrence constraint

XSD definition of key constraint, which requires that all part element should have a non-nil pno element and the value of all pno elements in the document should be unique.

1. XML – XSD

“psj.xsd”, the XSD schema of the motivating example data.

KDXD 2006, Singapore


1 xml what xsd can tell
1. XML – what XSD can tell

  • XSD is the standard of XML schema definition, recommended by W3C and supported by most vendors, which

    • has extensible XML syntax,

    • supports more data types (user-defined type and 37 built-in types)

    • is able to represent uniqueness/keyfor both attribute types and element types.

    • And has many other improvements in comparison with DTD.

KDXD 2006, Singapore


1 xml xsd still flaws
1. XML – XSD still flaws

XSD is not sufficient in expressing the relational semantics in XML data, such as:

  • A key constraint is specified by a keyelement. The key constraintsin XSD is an extension of ID in DTD. It is totally different to the key constraint in relational databases.

    • E.g.In the previous XSD, the values of key attribute, pno of part, should be unique within the set of the part elements in the whole document.

    • Therefore, when an element type is located in a lower level such as supplier and project, XSD cannotdeclare sno and jno as their key attributes (OIDs) respectively.

KDXD 2006, Singapore


1 xml xsd still flaws cont
1. XML – XSD still flaws (cont.)

  • The keyelement must contain the following (in order):

    • One and only one selectorelement

      • contains an XPath expression that specifies the set of elements across which the values specified by the field must be unique

    • One or more field elements

      • contain an XPath expressions that specifies the values must be unique for the set of elements specified by the selector element.

        - The key constraint is similar to the uniqueconstraint, except that the column on which a unique constraint is defined canhave null values.

KDXD 2006, Singapore


1 xml xsd still flaws cont1
1. XML – XSD still flaws (Cont.)

  • XSD does not support relationship types and other relational semantic constraints.

    • E.g.The ternary relationship type psj among part, supplier and project in the original data is lost in the XSD.

  • XSD cannot distinguish attributes (or simple elements) of relationship types from those attributes (or simple elements) of element types.

    • E.g.Price is an attribute of the binary relationship type ps between part and supplier. However, it looks the same as sname, an attribute (simple element) of the element supplier.

KDXD 2006, Singapore


Reconsider the semantics in example 1
Reconsider the semantics in Example 1.

  • The XML data in Example 1. (psj.xml) is a typical data-centric XML document that is derived from structured data contents usually stored in relational or object-relational databases.

  • The semantics of the data in Example 1. can be described in the ER diagram as follows.

KDXD 2006, Singapore



One of the object relational database representations of psj xml
One of the object-relational database representations of psj.xml

part

supplier

project

PS

PSJ

There 5 tables in the

relational schema:

part (pno, pname, color)

supplier (sno, sname, (city)+)

project (jno, jname, budget)

PS (pno, sno, price)

PSJ (pno, sno, jno, qty)

KDXD 2006, Singapore


Roadmap2
Roadmap psj.xml

  • XML documents and current XML schema languages

  • ORA-SS (Object-Relationship-Attributemodel for Semi-Structureddata)

  • The applications of ORA-SS

  • Discovering Semantics in XML documents

  • Conclusion

KDXD 2006, Singapore


2 ora ss in a nutshell
2. psj.xmlORA-SS in a nutshell

  • ORA-SS is a semantics rich data model for semi-structured data.

  • It can easily represent the relational semantics and constraints in XML data.

  • ORA-SS model is also a bridge that connects the tree structure of XML and the semantics in relational and object-relational databases.

  • In comparison with traditional ERdiagram, ORA-SS schema diagram represents the hierarchical structure of XML data.

KDXD 2006, Singapore


2 ora ss in a nutshell1
2. ORA-SS in a nutshell psj.xml

  • A complete ORA-SS model has 4 diagrams

    • Schema diagram

      • Represents the structure and constrains (business rules) on XML documents

    • Instance diagram

      • Visually represents the graphical structure of XML data

    • Functional dependency diagram

      • Represents FDs in relationship types

    • Inheritance diagram

      • Represents the specialization/generalization relationships among different object classes in ORA-SS

KDXD 2006, Singapore


2 ora ss data models
2. psj.xmlORA-SS data models

  • Object class

    • attributes of object class

    • orderingon object class

  • Relationship Type

    • degree of relationship type

    • participating object classes in relationship type

    • attributesof relationship type

    • disjunctive relationship type

    • recursive relationship type

    • ID dependent relationship type

KDXD 2006, Singapore


2 ora ss data models cont
2. ORA-SS data models psj.xml(Cont.)

  • Attribute

    • attributes of object class or relationship type

    • key attribute (OID)

    • foreign key / referential constraint (IDREF/IDREFS)

    • composite attribute

    • disjunctive attribute

    • attribute with unknown structure

    • ordering on attributes

    • fixed or default value of attribute

    • derived attribute

KDXD 2006, Singapore


The ora ss schema diagram of example 1

p psj.xml

a

r

t

P

S

,

2

,

+

,

+

s

u

p

p

l

i

e

r

c

o

l

o

r

p

n

a

m

e

p

n

o

P

S

J

,

3

,

+

,

+

P

S

+

p

r

o

j

e

c

t

s

n

o

s

n

a

m

e

c

i

t

y

p

r

i

c

e

P

S

J

j

n

o

j

n

a

m

e

b

u

d

g

e

t

q

t

y

The ORA-SS schema diagram of Example 1.

Part, supplier and project are modeled as object classes.

PSis a binaryrelationship type between part and supplier,

PSJ is a ternary relationship type defined among part, supplier and project

Pno, sno and jno are declared as the object ID of part, supplier and project respectively.

Priceis an attribute of the relationship type PS;

and qtyis an attribute of PSJ.

KDXD 2006, Singapore


Ora ss features
ORA-SS – Features psj.xml

  • ORA-SS can represent the following semantics

    • Object ID attributes play the key constraints in object-relational databases, i.e. the object ID attributes functional determine (or multi-valued determine) object attributes of the same object class.

    • Various relationship types including ID dependent relationship types, their degrees and participating object classes.

    • Distinguish relationship attributes from object attributes.

KDXD 2006, Singapore


Roadmap3
Roadmap psj.xml

  • XML documents and current XML schema languages

  • ORA-SS (Object-Relationship-Attribute model for Semi-Structured data)

  • The applications of ORA-SS

  • Discovering Semantics in XML documents

  • Conclusion

KDXD 2006, Singapore


3 ora ss applications
3. psj.xmlORA-SS applications

  • Due to the rich semantics in ORA-SS, the model can be widely used in

    • Normal form XML schema

    • Relational/object-relational storage of XML data

    • XML viewcreation and validation [1]

    • XML schema/data integration

    • XML data query, especially with graphical user interfaces [5]

    • XML query optimization

    • etc.

[1]. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland. Oct 7-11, 2002

[5]. W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA 2003.

KDXD 2006, Singapore


Store ora ss in object relational databases

3. ORA-SS applications psj.xml

Store ORA-SS in object-relational databases

  • Current existing storage approaches store XML in flat files (NF relations), which are long and difficult to query and update;

  • Pure relational DBMS – join needs much time.

  • ORA-SS reflects the nested structure of semi-structured data

  • Less join in nested relations

KDXD 2006, Singapore


Store ora ss in object relational databases cont

3. ORA-SS applications psj.xml

Store ORA-SS in object-relational databases(Cont.)

Given an ORA-SS schema diagram

  • Each object class is stored as an object relation with its object ID and its object attributes. (e.g. part, supplier, project)

  • Each relationship type is stored as a relationship relation with the object IDs of participating object classes and its relationship attributes. (e.g. PS and PSJ)

  • Multi-value attributes and composite attributes are stored as nested relations. (e.g. city)

KDXD 2006, Singapore


Store ora ss in object relational databases cont1

3. ORA-SS applications psj.xml

Store ORA-SS in object-relational databases (Cont.)

Storage Schema for ORA-SS/XML Databases of the data in Example 1.

ORA-SS schema diagram

Storage schema

Object Relations

part (pno, pname, color)

supplier (sno, sname, (city)+)

project(jno, jname, budget)

Relationship relations

PS (pno, sno, price)

PSJ (pno, sno, jno, qty)

Constraint:

PSJ[pno, sno]  PS[pno, sno]

KDXD 2006, Singapore


Store ora ss in object relational databases cont2

3. ORA-SS applications psj.xml

Store ORA-SS in object-relational databases (Cont.)

An example to show the advantage of using object-relational database instead of relational database.

ORA-SS schema diagram

Storage schema in traditional RDB

Storage schema in ORDB

Employee (eno, ename, (hobby)*,

quantification(year, degree, Univ)*,

job_history(year, job_title, company)*)

Employee (eno, ename)

E_hobby (eno, hobby)

E_quantification (eno, year, degree, Univ.)

E_job_history (eno, year, job_title, company)

KDXD 2006, Singapore


Define and validate xml views

3. ORA-SS applications psj.xml

Define and validate XML views

  • Valid XML views in ORA-SS

  • View definition operators:select, project/drop, swap, join

For example, consider the following swapping operation that changes the position of supplier and part in different hierarchical levels:

Because price is a relationship attribute, it cannot be moved up with supplier elements, which would be semantically meaningless in the result view.

Valid view

Invalid view

KDXD 2006, Singapore


Define and validate xml views cont

3. ORA-SS applications psj.xml

Define and validate XML views (cont.)

Another example, consider the following projection operation that drops supplier from the structure:

Invalid view

Valid view

Dropping supplier makes price and qty become multi-valued attributes, and we should apply aggregation functions to get a meaningful view.

KDXD 2006, Singapore


Graphical xml query based on ora ss

3. ORA-SS applications psj.xml

Graphical XML query based on ORA-SS

A graphical XML query language is designed on the base of ORA-SS

Query 1: To select and display the projects that do not have any suppliers located in Atlanta.

The schema panel loads the ORA-SS schema diagram

Graphical query can be posed by either dragging components from the diagram in schema panel or using the construction buttons on the top of the window.

Complex query logics such as quantification, negation, IF-THEN construction can be specified in the Condition Logic Window

The screenshot of the user-interface of our graphical query language

KDXD 2006, Singapore


Xml query optimization

3. ORA-SS applications psj.xml

XML query optimization

  • The semantic information represented in ORA-SS is also helpful in optimizing XML query.

Consider the following simple query example which means,

(Query 2.) To display the budget of project “J001”.

KDXD 2006, Singapore


Xml query optimization1

3. ORA-SS applications psj.xml

XML query optimization

  • Traditional processing should scan the whole XML document, checking every project with jno=“J001” and finding all corresponding budget values.

  • However, in ORA-SS, since jno is the object ID and we have the functional dependecny:

    jno  budget

    so the optimized processing only need to find the first project instance with jno=“J001” and return the corresponding budget value.

KDXD 2006, Singapore


Roadmap4
Roadmap psj.xml

  • XML documents and current XML schema languages

  • ORA-SS (Object-Relationship-Attribute model for Semi-Structured data)

  • The applications of ORA-SS

  • Discovering Semantics in XML documents

  • Conclusion

KDXD 2006, Singapore


4 discover semantics in xml documents
4. psj.xmlDiscover semanticsin XML documents

  • Problem definition

    • Input: a well formed XML document, probably with a DTD or XSD schema

    • Output: semantics that are necessary to ORA-SS schema

  • It is a process of enriching XML schema to ORA-SS schema by using mining techniques.

KDXD 2006, Singapore


4 discover semantics in xml documents1
4. Discover semantics in XML documents psj.xml

  • Related issues in mining semantics

    • Object classes

      • Identify object classes

      • Identify object IDs

      • Identify object attributes and their cardinalities

      • Identify IDREF(s) attributes

    • Relationship types

      • Find relationship types with their degrees and participating object classes

      • Find attributes and their cardinalities of relationship types

KDXD 2006, Singapore


4 discover semantics in xml documents2
4. Discover semantics in XML documents psj.xml

The whole vision of the process.

The main flow of the process

The output flow

The input flow

KDXD 2006, Singapore


4 discover semantics in xml documents3
4. Discover semantics in XML documents psj.xml

  • Assumption

    • To simplify the discussion, we do not consider the order of attributes and elements.

  • User-verification

    • The findings of each steps during the process should be verified by the user.

    • The verified findings of previous steps would be used in later steps.

KDXD 2006, Singapore


Find object classes

4. Discover semantics in XML documents psj.xml

Find object classes

  • Identify object classes from element types:

    • Scan the XML document or, if possible, the DTD/XSD of the XML document to select all internal nodes in the document tree.

    • An internal node means the node must have some child nodes such as XML attribute types and/or subelement types.

    • An internal node may not be an object class, but an object class must correspond to an internal node. Therefore, internal nodes are candidates of object classes.

KDXD 2006, Singapore


Find object classes cont

4. Discover semantics in XML documents psj.xml

Find object classes (cont.)

  • Detecting composite attributes from object classes

    • Although composite attributes are also internal nodes, there are some special patterns that indicate they are not object classes.

The first pattern is that, all subelement types or attributes are

XML element

  • Single-valued

  • Always occur with the same order

  • No functional dependencycan be found within the component attributes of a composite attribute.

XML elements

Or XML attributes

values

KDXD 2006, Singapore


Find object classes cont1

4. Discover semantics in XML documents psj.xml

Find object classes (cont.)

student

studNo

XML element

XML elements

Or XML attributes

values

The second pattern is that, all subelement types or attributes are:

  • Of the same type (repeated)

  • The set of the subelement/attribute values is oftendeterminedby other element/attribute values. (e.g. studNo determines the values of hobby elements under “hobbies” element)

KDXD 2006, Singapore


Find object classes cont2

4. Discover semantics in XML documents psj.xml

Find object classes (cont.)

The DTD of Example 1.

Dataguide

<?xml version="1.0" encoding="UTF-8"?>

<!--DTD generated by XXX-->

<!ELEMENT psj (part+)>

<!ELEMENTpart(pno, pname, color, supplier+)>

<!ELEMENT pno (#PCDATA)>

<!ELEMENT pname (#PCDATA)>

<!ELEMENT color (#PCDATA)>

<!ELEMENT supplier(sno, sname, city+, price, project+)>

<!ELEMENT sno (#PCDATA)>

<!ELEMENT sname (#PCDATA)>

<!ELEMENT city (#PCDATA)>

<!ELEMENT price (#PCDATA)>

<!ELEMENT project (jno, jname, budget, qty)>

<!ELEMENT jno (#PCDATA)>

<!ELEMENT jname (#PCDATA)>

<!ELEMENT budget (#PCDATA)>

<!ELEMENT qty (#PCDATA)>

▼♦ psj

▼♦part

♦ pno

♦ pname

♦ color

▼♦supplier

♦ sno

♦ sname

♦ city

♦ price

▼♦project

♦ jno

♦ jname

♦ budget

♦ qty

From the DTD of Example 1, element type: psj, part, supplier and project are internal nodes (can be intuitively found in Dataguide). Then, the list {psj, part, supplier, project } contains candidate object classes. Because a well-formed XML document usually have a document root that is not concerned with the data, we can drop the root node psj from the list and get the final result { part, supplier, project }.

KDXD 2006, Singapore


Identify multi valued attributes

4. Discover semantics in XML documents psj.xml

Identify multi-valued attributes

  • After Object classes and composite attributes are identified, we pick out all multi-valued attributes for later use.

    • Multi-valued attributes can be detected by checking the occurrence constraints in DTD/XSD, or counting directly in the document.

    • Multi-valued attributes can be either of an object class (e.g. city of supplier) or a relationship type. To determine the affiliation of multi-valued attributes, we need to find object ID first.

    • Without considering multi-valued attributes, the search of object ID would be easier.

KDXD 2006, Singapore


Find object ids

4. Discover semantics in XML documents psj.xml

Find object IDs

  • For each identified object class (after user-verified)

    • Ifit is located at the first level below the document root, and the DTD/XSD has specified ID attribute or key constraint, then the corresponding attribute/element should be an object ID.

    • Otherwise

      • A temporary table is built, which contains all XML attributes and single-valued simple subelement types of the object class.

      • To find full functional dependencies in the temporary table.

        • Ifall attributes/elements are fully functional dependent on an attribute/element k, then k is most likely the object ID;

          Else,

          • find an attribute/element k’, which functional determines the most number of attributes/elements, k’ is suggested as the object ID,

          • and the attributes/elements that are not determined by k’ will be classified as single-valued attributes of some relationship types to be determined later.

  • The result should be verified by the user.

KDXD 2006, Singapore


Find object ids cont

4. Discover semantics in XML documents psj.xml

Find object IDs (cont.)

Candidate object classes list

{part, supplier, project}

<?xml version="1.0" encoding="UTF-8"?>

<!--DTD generated by XXX-->

<!ELEMENT psj (part+)>

<!ELEMENT part (pno, pname, color, supplier+)>

<!ELEMENT pno (#PCDATA)>

<!ELEMENT pname (#PCDATA)>

<!ELEMENT color (#PCDATA)>

<!ELEMENT supplier (sno, sname, city+, price, project+)>

<!ELEMENT sno (#PCDATA)>

<!ELEMENT sname (#PCDATA)>

<!ELEMENT city (#PCDATA)>

<!ELEMENT price (#PCDATA)>

<!ELEMENT project (jno, jname, budget, qty)>

<!ELEMENT jno (#PCDATA)>

<!ELEMENT jname (#PCDATA)>

<!ELEMENT budget (#PCDATA)>

<!ELEMENT qty (#PCDATA)>

Three temporary tables

part_temp (pno, pname, color)

supplier_temp (sno, sname, price)

project_temp (jno, jname, budget, qty)

Notice that, in this stage, all simple subelement types and attributes are treated the same.

Multi-valued attributessuch as city is not included inside the temporary table.

KDXD 2006, Singapore


Find object ids cont1

4. Discover semantics in XML documents psj.xml

Find object IDs (cont.)

Three temporary tables

part_temp (pno, pname, color)

supplier_temp (sno, sname, price)

project_temp (jno, jname, budget, qty)

1. In part_temp, we find that

pno  pname, color

thus, pno is the object ID of part.

2. In supplier_temp, we only have

sno  sname

thus, sno is the object ID of supplier,

and price is picked our as a relationship attribute.

3. In project_temp, we only have

jno  jname, budget

thus, jno is the object ID of project,

and qty is picked out as a relationship attribute.

KDXD 2006, Singapore


Find object ids1

4. Discover semantics in XML documents psj.xml

Find object IDs

  • In the stage after the process of identifying object IDs, we find out:

    • Object IDs of each object class,

    • Single-valued object attributes and their corresponding object classes,

    • Single-valued relationship attributes without knowing what relationship type they belong to.

KDXD 2006, Singapore


Multi valued attributes of object classes

4. Discover semantics in XML documents psj.xml

Multi-valued attributesof object classes

  • Recall that, before searching object ID, all multi-valued attributes are identified. Given a multi-valued attribute under an object class, we check,

    • for each object ID value of the object class, whether there is a unique set of values of the attribute

      • If it is true, then it is a multi-valued attribute of the object class;

        Else, it is classified as a multi-valued attribute of some relationship type not known yet.

KDXD 2006, Singapore


Multi valued attributes of object classes1

4. Discover semantics in XML documents psj.xml

Multi-valued attributes of object classes

  • For example, the city is a multi-valued attribute under supplier

    • We check sno and city, since each sno value is associated with the same set of city values, city is a multi-valued attribute of supplier

The temporary table of sno and city

KDXD 2006, Singapore


Find cardinality of object class attributes

4. Discover semantics in XML documents psj.xml

Find cardinality of object class attributes

  • For multi-valued object attributes, we should know their cardinality

    • If the DTD/XSD has specified, reuse it

    • Without schema, count the minimum and maximum occurrences of the multi-valued attributes.

    • Notice that, both single-valued and multi-valued attributes can be null (e.g. ? and *). Thus, the result should be verified by the user.

KDXD 2006, Singapore


Find idref idrefs

4. Discover semantics in XML documents psj.xml

Find IDREF/IDREFS

  • Identify IDREFs

    • If the DTD/XSD has specified IDREF/IDREFS or Keyref constraints, reuse them.

    • Without the schema, we compare the object attribute values with the values of other object IDs,

      • If all values of a single-valuedattribute of objects of the same class appear as object ID values of some particular object class, then it is an IDREF;

      • If all values of a multi-valued attribute of objects of the same class appear as object ID values of some particular object class, then it is an IDREFS.

        (Note that, if it is an XML attribute, multiple values of IDREFS are separated by a blank character.)

KDXD 2006, Singapore


Find relationship types

4. Discover semantics in XML documents psj.xml

Findrelationship types

  • Identify relationship types (basic idea)

    • The search of relationship types is based on the object ID and relationship attributes (single-valued or multi-valued).

    • Along with a path from the root to a leaf node in the document tree, we may pass through several object classes. The object IDs of these object classes can form a temporary table. We build such kind of temporary tables for each single-valued relationship attributes, and find relationship types.

KDXD 2006, Singapore


Find relationship types cont

4. Discover semantics in XML documents psj.xml

Find relationship types (cont.)

  • For each single-valued relationship attribute, there is a path from the root to the attribute, and along the path, put object IDs of object classes inside the temporary table together with the relationship attribute.

    • Find the FDs that determines the single-valued relationship attribute in the temporary table.

  • For multi-valued relationship attributes, we should find a combination of object IDs of different object classes that each unique combination object ID value corresponds to a unique set of the attribute values.

KDXD 2006, Singapore


Find relationship types cont1

4. Discover semantics in XML documents psj.xml

Find relationship types (cont.)

  • From the data in Example 1, we can have a temporary table for price along with the path: “part/supplier/price” as follows

We can find that {pno, sno}  price, thus, there is an binary relationship type between part and supplier; and price is an attribute of the binary relationship type.

KDXD 2006, Singapore


Find relationship types cont2

4. Discover semantics in XML documents psj.xml

Find relationship types (cont.)

  • Similarly, we can have a temporary table for qty along with the path: “part/supplier/project/qty” as follows

We can find that {pno, sno, jno}  qty, thus, there is an ternary relationship type among part, supplier and project; and qty is an attribute of the ternary relationship type.

KDXD 2006, Singapore


Find relationship types cont3

4. Discover semantics in XML documents psj.xml

Find relationship types (cont.)

  • Relationship types can be exist without have relationship attributes.

  • To find such kind of relationship types, we need to build a temporary table for different object classes with their object IDs based on the existing paths in the document tree.

  • Search the temporary table and find MVDs (see the following example.)

KDXD 2006, Singapore


Find relationship types cont4

4. Discover semantics in XML documents psj.xml

Find relationship types (cont.)

  • Suppose we have another document of project, staff, and paper. After we found their object ID attributes, accordingly, i.e. J_no, St_no, and Pa_no, we can create a temporary table as follows.

  • We have already identified the

  • Hierarchical structure;

  • Object classes and their object IDs;

  • Attributes of object classes;

  • But no attribute is likely to be of some relationship types.

KDXD 2006, Singapore


Find relationship types cont5

4. Discover semantics in XML documents psj.xml

CASE 1.

CASE 2.

Find relationship types (cont.)

We build a temporary table which consists of J_no, St_no, and Pa_no

CASE 1. If we find that each St_no value is associated with a unique set of Pa_no

values, i.e. St_no multi-determines Pa_no,

then there are two binary relationship types, one consists of project and staff,

and the other consists of staff and paper.

CASE 2.If there is no FD or MVD in the table,

then there is a ternary relationship among project, staff and paper.

KDXD 2006, Singapore


Find participating constraints

4. Discover semantics in XML documents psj.xml

Find participating constraints

  • The participating constraints of each relationship types can be obtained through the count of unique object ID values in the temporary table accordingly.

KDXD 2006, Singapore


4. Discover semantics in XML documents psj.xml

User verification

  • All outputs, including those intermediate results, should be verified by users.

  • With input from users and their verification, a semi-automatic mining process can be applied to discover the semantics in XML documents that are important in designing XML databases, storing XML data, validating XML view and processing/optimizing XML query.

  • All the discovered semantics can be represented by ORA-SS; but some of them cannot be represented in DTD/XSD.

KDXD 2006, Singapore


Roadmap5
Roadmap psj.xml

  • XML documents and current XML schema languages

  • ORA-SS (Object-Relationship-Attribute model for Semi-Structured data)

  • The applications of ORA-SS

  • Discovering Semantics in XML documents

  • Conclusion

KDXD 2006, Singapore


5 conclusion
5. psj.xmlConclusion

  • We demonstrate a data-centric XML document and show the limitations of current XML schema standard in represent relational semantics and constraints.

  • We Introduce ORA-SS, a semantics rich data model that can intuitively express the semantics in XML data.

  • We discuss the naïve method of mining semantics from XML data/schema to generate ORA-SS schema. More efficient methods should be further investigated.

KDXD 2006, Singapore


5 conclusion cont
5. Conclusion psj.xml(cont.)

  • The semantics in ORA-SS are crucial in designing XML database, writing and interpreting XML query and validating XML views, etc.

  • The method we proposed in the presentation to discover semantics only provides candidate answers. In other words, not all the results are necessarily true because the contents of the data may be changed. Therefore, user feedback is indispensable in the process of enriching XML schema to ORA-SS schema.

KDXD 2006, Singapore


References
References: psj.xml

[1]. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland. Oct 7-11, 2002

[2]. C. J. Date. An Introduction to Database Systems. 3rd edition, Addison-Wesley Publishing Company (1981).

[3]. Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February 2004. http://www.w3.org/TR/2004/REC-xml-20040204/

[4]. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc. 2005

[5]. W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA 2003.

[6]. XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/

[7]. XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/

[8]. XML Schema Part 2: Data types Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/

KDXD 2006, Singapore


Q & A psj.xml

KDXD 2006, Singapore


The End psj.xml

KDXD 2006, Singapore


ad