On View Support for a Native XML DBMS

On View Support for a Native XML DBMS Ting Chen , Tok Wang Ling School of Computing, National University of Singapore Daofeng Luo, Xiaofeng Meng Information School , Remin University of China

Outline • View for XML Documents • Two Main Approaches • Problems • ORA-SS: Object-Relationship-Attribute Model for Semi-structured Data • ORA-SS class diagram and instance diagram • ORA-SS for view schema definition • Element-Based Clustering (EBC) • Basic Approach • ORA-SS and EBC • XML View Transformation • Problem Definition • Algorithm • Conclusion

View for XML Documents • Two main approaches to define views • Define views in script languages like XQuery or XSLT • General but demanding from users’ point of view because XQuery and XSLT scripts are complex • Difficult to optimize from performance view point • Define views by Schema-Mapping • E.g. Clio[7] and eXeclon[3] • A declarative approach; alleviate users from writing complex scripts to perform view transformation • Schema mappings can then be translated into XQuery (or XSLT) scripts • We focus on the problem of view transformation through schema mapping

View for XML Documents • View for XML Documents via Schema-Mapping • Problem: Current XML schema formats are not able to express views with semantic constraints, resulting in ambiguity • E.g. (Next Slide) The source XML file contains information about researchers working under different projects and the publication list for each researcher.

View for XML Documents The view schema in Fig (c) of the above diagram is such an example. It has at least two possible meanings which can lead to different view results! 1. For each project, list all the papers published by project members; for each paper of the project, list all the authors of the paper. 2. For each project, list all the papers published by project members; for each paper of the project, list all the authors of the paper who work for the project.

ORA-SS Data Model • ORA-SS[2] • Object Class • Relationship Type • Attribute( Object attribute or Relationship attribute) • E.g. An ORA-SS Instance Diagram * Compared with the XML document in Slide 5, two extra fields (attribute Date and sub-element Position) are added in the above ORASS instance diagram

ORA-SS Data Model • ORA-SS Schema Diagram • There are two binary relationship types in the schema: Project-Researcher(JR) and Researcher-Paper(RP). The set of papers under a researcher doesn’t depend on the project he/she works in. • Position is an attribute of relationship type JR instead of Researcher. This means that a researcher may hold different positions across projects he works in. • Date is a single-valued attribute of object class Paper. Different occurrences of the same paper will always have the same Date value. • J_Name,R_Name and P_ID are identifiers of object classes Project , Researcher and Paper respectively as indicated by solid circles. Identifier values are used to tell if two object occurrences are identical.

ORA-SS Data Model • ORA-SS for View Schema Definition • It is able to define views with different semantics • View (a) has two binary relationship types. The intention of the view schema is to find all the papers published by researchers in a project; and for each paper to find all of its authors. • View (b) has only one ternary relationship type. The view is defined to find all the papers published by researchers in a project; however, for each paper View (b) only finds those authors working for the project. View (a) View (b)

ORA-SS Data Model • ORA-SS for View Schema Definition • The two view schemas in Slide 8 correspond to two different XSLT scripts. The following is the XSLT script for schema (a): <root> <xsl:for-each-group select="root/Project" group-by="@J_Name"> <Project> <J_Name><xsl:value-of select="@J_Name"/></J_Name> <xsl:for-each-group select="current-group()/Researcher/Paper" group-by="@P_Name"> <Paper> <xsl:variable name="vPName" select="@P_Name"/> <P_Name><xsl:value-of select="@P_Name"/></P_Name> <xsl:for-each-group select="/root/Project/Researcher[Paper/@P_Name =$vPName]" group-by="@R_Name"> <Researcher> <R_Name><xsl:value-of select="@R_Name"/></R_Name> </Researcher> </xsl:for-each-group> </Paper> </xsl:for-each-group> </Project> </xsl:for-each-group> </root>

ORA-SS Data Model <root> <xsl:for-each-group select="root/Project" group-by="@J_Name"> <Project> <J_Name><xsl:value-of select="@J_Name"/></J_Name> <xsl:for-each-group select="current-group()/Researcher/Paper" group-by="@P_Name"> <Paper> <xsl:variable name="vPName" select="@P_Name"/> <P_Name><xsl:value-of select="@P_Name"/></P_Name> <xsl:for-each-group select="current-group()/.." group-by="@R_Name"> <Researcher> <R_Name><xsl:value-of select="@R_Name"/></R_Name> </Researcher> </xsl:for-each-group> </Paper> </xsl:for-each-group> </Project> </xsl:for-each-group> </root> • The following is the XSLT script for schema (b): • The main difference of two scripts lies in the third xsl:for-each-group directive for Researcher. Script for Schema (a) needs to search the whole document to find the complete author list of a paper because authors may not work for the same project. On the other hand, script for Schema (b) avoids the global search because it only needs find authors of the paper working for the same project.

A A A A A A B B B B C C Element-Based-Clustering (EBC) • Element Based Clustering[5] • Extension of Element Based (EB[6]) Strategy • Element nodes (records) with the same tag name are clustered and organized as a list

Element-Based-Clustering (EBC) • EBC: Node labeling • EBC gives labels for nodes in a XML document • Labels for nodes in a XML data treecan be calculated in the following manner: 1. The root element has label nil 2. Perform a pre-order traversal (i.e. Document order) on the XML document For node x: (“+” here means string concatenation) label(x) = label(x.parent) + “.” + position of x in x.parent’s childList • Node A is ancestor of node B if Label(A) is the prefix of node Label(B) and viceversa

ORA-SS and EBC • How does ORA-SS schema help tune XML document storage?

ORA-SS and EBC 1. Object identifier of an object will be stored together with the object. Objects of the same class form a cluster. Each cluster is a sequential file. 2. Relationship attribute values will be stored in separate cluster. 3. For object attribute values, we need some heuristics. It is attempting to store objectattributes together with the object since they are likely to be accessed at the sametime. However, if an object class has too many objects attributes, to store all attributevalues together with the object brings us to a situation similar to Subtree-Based storage strategy. One solution is to store only those essential (this can be determined by users) object attributes with theobject and leave other to separated clusters. 4. Node (which can be object, relationship value and object attribute value) labels will be stored together with nodes.

View Transformation: Problem Definition • Problem Definition Given an XML document D1, a source schema V1 and a valid view schema V2 of V1,transform D1 to document D2, so that D2 is a valid document under V2. View Transformation Source Document View Document User Defined Schema Mapping Source Schema View Schema

View Transformation • View schema defined in DTD can have different interpretations which result in different views Proj Slide 8(a) JP;2 View Schema (Ambiguous) Paper Proj PR;2 ? Researcher Paper Slide 8(b) Proj Researcher Paper JPR;3 Researcher

View Transformation • View schema expressed in ORA-SS is unambiguous • The relationship set of an ORA-SS view schema clearly defines how view document should be constructed • Two basic techniques are used in construction of a single relationship in view schema • Structural join: SJ (based on object labels) For each relationship R in view schema, we first use structural join to find the set of paths of type R such that the object occurrences in each path locate on the same path in source document. • Value join : Merge (based on logical object identifiers) In XML document an object can have many occurrences. Two occurrences are considered as the same if they have identical object identifier. The set of paths resulted from structural join is then value joined (or merged) using logical object keys.

View Transformation • What makes things complicated? • A path in view schema can have more than one relationship and we need to join two relationship together. E.g. View schema in Slide 8(a) contains two relationships. • Two relationships are “joined” on their overlapping object classes. So the results constructed for one relationship may be used in construction of another. • Value join (merge) based on logical keys destroys the “sorted-ness” of the output path list of structural join. The consequence is that the output of value join can’t be used in subsequent structural join with paths from other relationships efficiently. (Structural join requires sorted input lists) • Solution: Duplicated-Preserving Merge (D-Merge). It keeps the structural join output path list intact. For two occurrences of the same object in the list, their child contents will be merged and then each of them will have its own copy of the merged content. D-Merge is used only when a relationship needs to join with another relationship.

View Transformation: Algorithm • The relationship set of a view schema determines the view transformation process. • Take the two view schemas in Slide 8 as an example: • View(a): • L := SJ(list(R), list(P), {P,R}) • L := D-Merge(L, {P}) • L := SJ(L, list(J), {J,P}); • L := Merge(L,{J}); • View(b): • L := SJ(list(J), list(P), {J,P}) • L := SJ(L, list(R), {J,P,R}); • L := Merge(L,{J});

View Transformation: Algorithm • Structural Join • Based on Object Label • Binary Structural Join[1] Input Two sorted (on node numbers) node lists: AList of potential ancestor nodes and DList of potential descendants nodes Output OutputList = [(ai; dj)] of join results, in which ai is the parent/ancestor of dj and ai is from AList and dj is from DList • EBC schemes stores elements with the same tag name in pre-ordered( i.e. sorted on element number) way; structural join can be naturally applied

View Transformation: Algorithm • Structural Join: • Based on Object Label • Complex Structural Join: (Example: structural join of three sorted input lists) Input Sorted node lists: A,B,C Output OutputList = [(ai; bj ; ck)] of join results, in which ai, bj , ck(from List A,B,C respectively) are located on the same path insource document • Two binary joins: • Step 1: Join A and B • OutputList AB: [(ai; bj )] sorted on ai • Step 2: Join AB (using ai as node label) and C • OutputList = [(ai; bj ; ck)] sorted on ai • Important: ck should be on the same path as both ai AND bj

View Transformation: Algorithm • A motivating example: Source Document: Source Schema View Schema Proj Proj JR;2 JP;2 Researcher Paper PR;2 RP;2 Paper Researcher

View Transformation: Algorithm • Data Structures • Record • Object label • Object key (object identifier) • ChildList: an array of children record references • Tuple • A tuple consists of an array of records • A tuple has a root record • Example • Array of Tuples Corresponding Tree: Tuple Array Index: 0 1 2 3 r1 r2 r3 r4

View Transformation: Algorithm • Operations • Structural Join (SJ) • Input • L1: Array of Tuples of type <A1,A2,A3,…, An> • L2: Array of Tuples of type <B1,B2,B3,…, Bn> • Mask M: Bit Array, which specifies the object classes which participate in structural join • Output • L3: Array of Tuples of type <A1,A2,A3,…, An,B1,B2,B3,…, Bn> For each tuple in L3, its objects whose types are specified in M MUST locate on the same path in the source document • Merge (Merge) • Input • L1: Array of Tuples of type < A1,A2,A3,…, An> • Mask M: Bit Array, which specifies a list of object classes • Output • L: Array of Tuples For any two tuples t1 and t2 in L1 which have the same set of objects whose types are specified in M, the two tuples will be merged. • Duplicate-Preserving Merge (D-Merge) • Input • L1: Array of Tuples of type < A1,A2,A3,…, An> • Mask M: Bit Array, which specifies a list of object classes • Output • L: Array of Tuples For any two tuples t1 and t2 in L1 which have the same set of objects whose types are specified in M, the contents of the two tuples will be merged. t1 and t2 will then both have the merged content. Note: Structural Join operation uses object number while Merge and D-Merge use object identifer.

View Transformation: Algorithm • Definition 1:Relationship R1 is lower than R2, or R1 < R2, if the top-most participating object class of R1 is a descendent of the top-most participating object class of R2. • Definition 2: A relationship is independent if its set of participating object classes is not included in any other relationship. Otherwise the relationship is nested. • Algorithm (Main steps): Step 0. Initialize L1 := null, L2 := null, M := { }, //M contains the set of object classes that has been structurally joined L[O] : = null for each O in the view schema Step 1. Put the independent relationships of the ORA-SS view schema into a partially ordered set (S, < ) Step 2: While (S is not empty){ Extract the first relationship R from S; If (M∩R is not null) L1 := L[O] for the highest O(O has the smallest level in the view schema) in M∩R; Else L1 := null; For each object class O in (R – M) sorted decreasingly by their depths in the view schema M := M U {O}; L2 := Cluster[O]; //Cluster[O] is an array storing objects of class O if(L1 != null ) L1 := StructuralJoin(L1,L2,M∩R); else L1: = L2; L[O] = L1; If R is the highest level relationship L1 := Merge(L1, R.top_most_object_class); Else L1 := D-Merge(L1, R – {object classes of R which doesn’t appear in any other independent relationship}); } Step 3: return L[Root]; where Root is the root of the view schema;

View Transformation: Algorithm • Example: View Schema Slide 8a • Step 1: Construct Relationship Project-Paper Step 1.1. L := SJ (list(P), list(R), {P,R}) Source Doc: View Schema Proj JP;2 Paper Step 1.2. L := D-Merge (L, {P}) PR;2 Researcher S = {PR,JP};

View Transformation: Algorithm • Example: View Schema Slide 8a • Step 2: Construct Relationship Project-Paper Step 2.1. L := SJ(L, list(J), {J,P}) Source Doc: View Schema Proj JP;2 Paper Step 2.2. L := Merge (L, {J} ) PR;2 Researcher S = {JP};

Conclusion • We demonstrate how to combine an efficient XML storage scheme (EBC) and an expressive XML data model (ORA-SS) to provide XML view support for a native DBMS system. • ORA-SS, used as view schema definitions, can express a great variety of constraints graphically. More importantly, it can avoid ambiguity, which is a typical problem if by DTD or XML Schema is used as view schema format. • In our transformation method, a relationship type is the basic unit of transformation. Both object label based structural join and logical key based value join are employed to construct view results. ORASS view schema information can guide correct view transformation.

Reference • Shurug Al-Khalifa, H. V. Jagadish, Nick Kouda, Jignesh M. Patel, Divesh Srivastava, YuqingWu. Structural Joins: A Primitive for Efficient XML Query Pattern Matching. In Proceedings of ICDE, 2002 • Gillian Dobbie, Wu Xiaoying, Tok Wang Ling, Mong Li Lee: ORA-SS: An Object-Relationship-Attribute Model for Semistructured Data TR21/00, Technical Report, Department of Computer Science, National University of Singapore, December 2000. • eXcelon. An General XML Data Manager. http://www.exceloncorp.com/ • H. V. Jagadish, Shurug AL-Khalifa, et al. TIMBER: A Native XML Database. Technical Report, University of Michigan,April 2002. • Xiaofeng Meng, Daofeng Luo, Mong Li Lee, Jing An. OrientStore: A Schema Based Native XML Storage System. In Proceedings of the 29th VLDB Conference, Berlin, Germany, 2003 • J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J.Widom. Lore: A Database Management System for Semistructured Data. SIGMOD Record, Vol.26(3):54-66,September 1997. • Lucian Popa, Mauricio A. Hern´andez ,Yannis Velegrakis , Ren´ee J. Miller, Felix Naumann, Howard Ho. Mapping XML and Relational Schemas with Clio. In ICDE 2002 Demo, 2002

On View Support for a Native XML DBMS

On View Support for a Native XML DBMS

Presentation Transcript

Native XML Databases for Information Systems

Native XML Database for Information Systems

SQL/XML, XQuery , and Native XML Programming Languages

XML Storage and Indexing Native XML

A ‘movingpoint’ type for a DBMS

On Relational Support for XML Publishing

Native XML Database for Information Systems

Schema Advisor for Hybrid Relational-XML DBMS

The NATIVE XML Server

Tamino – a DBMS Designed for XML

Storing XML using native storage

Sedna: A Native XML DBMS

Native XML Databases

A View Based Security Framework for XML

XML Storage and Indexing Native XML

MonetDB/XQuery: Using a Relational DBMS for XML

APPX XML Support

XML Native Query Processing

TIMBER A Native XML Database

Native XML Databases

OrientX: A Native XML Database System